CICC v13.0 — Reverse Engineering Reference

CICC is NVIDIA's CUDA C-to-PTX compiler — the binary that transforms CUDA C++ source code (or LLVM bitcode) into PTX assembly for GPU execution. At 60 MB, it is one of the largest single compiler binaries in production use. This wiki documents its internal architecture, recovered from static analysis of the stripped x86-64 ELF binary using IDA Pro 8.x and Hex-Rays decompilation.


Binary	cicc v13.0, 60,108,328 bytes, x86-64, stripped
Build	`cuda_13.0.r13.0/compiler.36424714_0`
Decompilation	80,562 functions, 80,281 recovered (99.65%), IDA Pro 8.x + Hex-Rays
Strings	188,141 extracted
LLVM base	LLVM 20.0.0 (internal), bitcode producer ID `"LLVM7.0.1"` (NVVM compat)
LLVM pass classes	~402 standard + 35 NVIDIA custom
CLI options	~1,689 registered via `cl::opt` + 222 NVVMPassOptions slots
NVVM builtins	770 (IDs 1–770, wyhash open-addressing table)
Default target	`sm_75` (Turing)
Supported SMs	sm_75 through sm_121f (Turing through Blackwell (sm120))

Three Subsystems

CICC is not a monolithic compiler. It is composed of three largely independent subsystems, each with its own lineage, coding conventions, and internal data structures:

1. EDG 6.6 C++ Frontend (3.2 MB, 0x5D0000–0x8F0000) — A licensed commercial frontend from Edison Design Group that parses CUDA C++ source code and emits transformed C code. It operates as a source-to-source translator: CUDA kernel launch syntax (<<<>>>) is lowered to CUDA runtime API calls, memory space qualifiers (__shared__, __constant__) are resolved to address space annotations, and C++ templates/constexpr are fully evaluated. The output is not LLVM IR — it is C code that feeds into a second compilation phase. See EDG 6.6 Frontend.

2. NVVM Bridge (~4 MB, 0x8F0000–0x12CFFFF) — The glue layer between EDG and LLVM. It handles CLI parsing, architecture detection (23 SM variants with 3-column flag fan-out), the dual-path compilation dispatch (Path A via LibNVVM API, Path B standalone), the NVVMPassOptions knob system (221 per-pass configuration slots), and the 770-entry builtin resolution table. This layer is entirely NVIDIA-proprietary. See Entry Point & CLI and LLVM Optimizer.

3. LLVM 20.0.0 Backend (~45 MB, 0x12D0000–0x3BFFFFF) — A heavily modified LLVM fork that performs IR optimization and PTX code generation. NVIDIA has added 35 custom passes (MemorySpaceOpt, Rematerialization, BranchDist, LoopIndexSplit, Sinking2, etc.), a proprietary two-phase compilation model with per-function thread parallelism, and extensive modifications to the NVPTX backend for tensor core code generation across 5 GPU architecture generations. See Code Generation and PTX Emission.

Additionally, jemalloc 5.3.x (~400 functions at 0x12FC000) is statically linked, replacing the system allocator for improved memory allocation performance during compilation.

Dual-Path Architecture

A distinctive feature of cicc is its dual-path design — two complete copies of the compilation backend exist within the same binary, selected at runtime:

	Path A (`0x90xxxx`)	Path B (`0x126xxxx`)
Purpose	LibNVVM API mode	Standalone mode
Simple compile	`sub_902D10`	`sub_1262860`
Multi-stage	`sub_905EE0` (43KB)	`sub_1265970` (48KB)
CLI parsing	`sub_900130`	`sub_125FB30`
Builtin table	`sub_90AEE0` (109KB)	`sub_126A910` (123KB)
Libdevice	`unk_3EA0080` (455KB)	`unk_420FD80` (455KB)
Version string	`-nvvm-version=nvvm-latest`	`-nvvm-version=nvvm70`

Runtime selection is controlled by v253 in sub_8F9C90 (the real main function). The default value (2) triggers an environment variable lookup through an obfuscated string comparison to determine which path to take. This design allows a single binary to serve both the nvcc driver toolchain and the LibNVVM runtime compilation API.

Compilation Pipeline

Both paths converge on the same 5-stage pipeline:

CUDA C++ Source (.cu / .ci / .i)
  │
  ├─ EDG 6.6 Frontend (sub_5D2A80)
  │   ├─ lgenfe_main (sub_617BD0): 282-case CLI, 737 #defines
  │   ├─ Parser: recursive-descent + declaration specifier state machine
  │   ├─ Constexpr evaluator: 317KB tree-walking interpreter
  │   └─ Backend: "Generating NVVM IR" → .int.c / .device.c / .stub.c
  │
  └─ NVVM/LLVM Pipeline
      │
      ├─ IRGEN:  EDG IL → LLVM IR translation (cicc's equivalent of Clang CodeGen)
      │            Type translation (fixed-point iteration, address space mapping)
      │            Expression/statement/function codegen (recursive AST walk)
      │            CUDA semantic lowering (threadIdx→intrinsics, printf→vprintf, etc.)
      │            Kernel metadata emission (nvvm.annotations)
      │            Two copies: Path A (0x90xxxx) and Path B (0x126xxxx)
      │
      ├─ LNK:     Module linking + libdevice (455KB embedded bitcode)
      │            Triple validation (must be nvptx64-)
      │            IR version check (nvvmir.version metadata)
      │
      ├─ OPT:     Two-phase compilation (Phase I: whole-module, Phase II: per-function)
      │            ~150 pass insertions via sub_12E54A0
      │            Three language paths: "ptx" / "mid" / default
      │            35 NVIDIA custom passes interleaved with standard LLVM
      │            Optional: concurrent per-function compilation (thread pool + jobserver)
      │
      ├─ OPTIXIR:  OptiX IR generation (optional, --emit-optix-ir)
      │
      └─ LLC:     NVPTX backend code generation
                   SelectionDAG lowering (2.3 MB NVPTXTargetLowering)
                   19 MMA shapes × 11 data types for tensor core codegen
                   9 PTX register classes
                   StructurizeCFG (mandatory for PTX structured control flow)
                   → .ptx output

Subsystem Address Map

Subsystem	Address Range	Size	Key Entry Point
jemalloc stats	`0x40D000`–`0x41FFFF`	~80KB	`sub_40D5CA` (vsnprintf)
Global constructors	`0x430000`–`0x5CFFFF`	~1.6 MB	`cl::opt` registration (~1,689 options)
EDG 6.6 Frontend	`0x5D0000`–`0x8EFFFF`	3.2 MB	`sub_5D2A80` (orchestrator)
CLI / Real Main	`0x8F0000`–`0x96FFFF`	520 KB	`sub_8F9C90` (real main)
Bitcode reader	`0x9F0000`–`0xAFFFFF`	~1 MB	`sub_9F2A40` (parseFunctionBody)
LLVM verifier	`0xBF0000`–`0xC6FFFF`	500 KB	`sub_BFC6A0` (visitCallInst)
LLVM passes	`0xC00000`–`0x12CFFFF`	~7 MB	InstCombine, GVN, DSE, LICM, etc.
PassManager / NVVM bridge	`0x12D0000`–`0x16FFFFF`	4.2 MB	`sub_12E54A0` (pipeline assembly)
Backend / machine passes	`0x1700000`–`0x1EFFFFF`	8 MB	MRPA, Block Remat, Mem2Reg
SelectionDAG	`0x1F00000`–`0x20FFFFF`	2 MB	`sub_20019C0` (LegalizeTypes, 348KB)
NVPTX emission	`0x2100000`–`0x21FFFFF`	1 MB	`sub_215A3C0` (function headers)
New PM / pass registration	`0x2340000`–`0x23FFFFF`	768 KB	`sub_2342890` (2,816-line registrar)
Loop passes	`0x2A00000`–`0x2DFFFFF`	4 MB	LoopVectorize, SLP, Unroll, etc.
NVPTX ISel + lowering	`0x3000000`–`0x36FFFFF`	7 MB	`sub_33B0210` (intrinsic switch, 343KB)
Embedded libdevice	`0x3EA0080` / `0x420FD80`	456 KB × 2	LLVM bitcode (~400 math functions)

Reading This Wiki

The wiki is organized around the compilation pipeline. Every page is written at reimplementation-grade depth for an audience of senior C++ developers with LLVM backend experience.

Section Index

Pipeline Overview — End-to-end compilation flow diagram with links to every stage.
Entry Point & CLI — CLI parsing, dual-path dispatch, architecture detection.
EDG 6.6 Frontend — CUDA C++ to transformed C source-to-source translation.
NVVM IR Generation — EDG IL tree to LLVM Module: types, expressions, statements, functions.
LLVM Optimizer — Two-phase compilation, pipeline assembly, NVVMPassOptions.
Code Generation — SelectionDAG, ISel, register allocation, scheduling.
PTX Emission — AsmPrinter, directive emission, PTX body output.
NVIDIA Custom Passes — 35 proprietary passes not in upstream LLVM.
LLVM Pass Pipeline & Ordering — Complete pass registration, execution order per O-level, tier system.
NVVM Builtins — 770-entry builtin table: hash structure, ID inventory, category breakdown.
GPU Targets — SM feature gates, architecture detection, sm_75 through sm_121f.
Data Structures — IR node layout, pattern database, DAG node, symbol table, NVVM container.
Infrastructure — Alias analysis, MemorySSA, AsmPrinter, debug verification, NVPTX target.
LTO & Module Optimization — Cross-TU inlining, devirtualization, GlobalOpt, ThinLTO import.
Configuration — Three knob systems: ~1,689 cl::opt flags, 222 NVVMPassOptions slots, ~70 codegen knobs.
Reference — Address spaces, register classes, NVPTX opcodes, GPU execution model.
Function Map — Address-to-identity lookup for ~350 key functions with confidence levels.
Binary Layout — Subsystem address map at pass granularity.
Methodology — How this analysis was performed and how to assess confidence.

Reading Path 1: End-to-End Pipeline Understanding

Goal: understand how CUDA source becomes PTX, what each stage does, and how control flows between subsystems.

Read in this order:

Pipeline Overview — The complete flow diagram. Establishes the 10 stages and their address ranges. Read this first to build the mental model that all other pages assume.
Entry Point & CLI — How cicc is invoked, the 1,689-flag CLI, dual-path dispatch (Path A LibNVVM vs. Path B standalone), and the sub_8F9C90 real-main function.
nvcc-to-cicc Interface — The flag translation layer between nvcc and cicc. The 40+ flag mappings and 3-column architecture fan-out. Necessary context for understanding why certain flags exist.
EDG 6.6 Frontend — The commercial C++ frontend. How CUDA syntax is lowered to C, the 737 configuration #defines, and the .int.c / .device.c / .stub.c output split.
NVVM IR Generation — The EDG-to-LLVM bridge. Then follow the four sub-pages: Type Translation → Expressions → Statements → Functions.
Libdevice Linking — The embedded 455KB bitcode library with 352 __nv_* math functions. Triple validation, version checking.
LLVM Optimizer — The two-phase compilation model, the 49.8KB pipeline assembler (sub_12E54A0), pass ordering, and the NVVMPassOptions knob system. This is the longest and densest stage.
Pipeline & Pass Ordering — The exact pass execution order at each O-level, the tier system, and the 526 registered passes.
Code Generation — SelectionDAG lowering, instruction selection, register allocation, instruction scheduling. Hub page with links to deep dives.
PTX Emission — AsmPrinter, directive headers, PTX body output, metadata emission.

Optional extensions after the core path:

OptiX IR Generation — The alternative output mode for ray tracing workloads.
Debug Info Pipeline — How -g debug metadata survives the optimizer.
LTO & Module Optimization — Cross-module optimization when compiling multiple translation units.
Concurrent Compilation — The Phase II thread pool and GNU Jobserver integration.
GPU Execution Model — Background on warps, divergence, shared memory, and address spaces if you are new to GPU architecture.

Reading Path 2: Reimplementing a Specific Pass

Goal: reproduce the exact behavior of one NVIDIA custom pass or understand an LLVM pass modification deeply enough to write a compatible replacement.

For an NVIDIA custom pass (e.g., MemorySpaceOpt, Rematerialization, BranchDist):

NVIDIA Custom Passes — Overview — Locate the pass in the inventory table. Note its category (module/function/loop/machine), its pipeline position, and its controlling knobs.
The pass's dedicated page (e.g., MemorySpaceOpt, Rematerialization, Branch Distribution). Every dedicated page contains the function address, decompiled algorithm, data flow description, controlling knobs, and diagnostic strings.
NVVMPassOptions — The 222-slot struct that controls per-pass enable/disable toggles and parametric thresholds. Find which slots your target pass reads.
Pipeline & Pass Ordering — Determine exactly where the pass runs in the pipeline. Identify what analyses it depends on (must run before it) and what passes consume its results (run after it).
Optimization Levels — Determine at which O-levels the pass is enabled, disabled, or parameterized differently.
Function Map — Cross-reference the pass's internal function addresses with the master function map for confidence levels.

For a modified LLVM pass (e.g., InstCombine, GVN, DSE, LICM, LoopVectorize):

The pass's dedicated page (e.g., InstCombine, GVN, DSE, LICM). These pages document NVIDIA's modifications relative to upstream LLVM 20.0.0.
Alias Analysis & NVVM AA — The custom alias analysis chain. Nearly every optimization pass depends on AA, and NVIDIA's GPU-aware AA behaves differently from upstream (address-space-aware NoAlias for disjoint spaces, __restrict__ propagation).
MemorySSA — The memory dependence representation used by DSE, LICM, and other memory-sensitive passes.

For a machine-level pass (e.g., Block Remat, MRPA, Machine Mem2Reg):

Machine-Level Passes — The complete machine pass pipeline with per-pass algorithm descriptions.
Register Allocation — The greedy RA algorithm with NVIDIA's occupancy-driven spill heuristics.
Register Classes — The 9 PTX register classes and their constraints.
NVPTX Machine Opcodes — The MachineInstr opcode reference.

Supporting references for any pass reimplementation:

IR Node Layout — The internal IR data structures that passes operate on.
Address Spaces — GPU address space semantics that many passes must respect.
NVPTX Target Infrastructure — TargetMachine, TTI hooks, and target feature queries.
Diagnostics — The three diagnostic systems (EDG, LLVM remarks, profuse framework) for reproducing pass-level reporting.

Reading Path 3: Debugging Correctness

Goal: diagnose a miscompilation, a crash, or incorrect PTX output by tracing the problem to a specific pass or pipeline stage.

Start with instrumentation and observability:

Diagnostics & Optimization Remarks — The three independent diagnostic layers: EDG frontend errors, LLVM optimization remarks (-opt-bisect-limit, -Rpass=, -Rpass-missed=), and NVIDIA's profuse framework (profuseinline, profusegvn). This page tells you how to make cicc talk about what it is doing.
Debug Info Verification — The three verification modes (verify-each, debugify-each, and JSON delta reporting). Use verify-each to detect the first pass that corrupts debug metadata.
CLI Flags — Locate the flags for dumping IR at specific pipeline points: --print-after-all, --print-before-all, --filter-print-funcs=, --opt-bisect-limit=. Also the --passes= interface for running individual passes in isolation.
Optimization Levels — Compare the pass pipeline at different O-levels. If a bug appears at -O2 but not -O1, the diff between their pipelines identifies the suspect passes.

Then isolate the pipeline stage:

Pipeline Overview — Determine which stage produces the incorrect output. The pipeline is linear: EDG → IR Generation → Libdevice Linking → Optimizer → Codegen → Emission. The stage boundary where output first goes wrong narrows the search.
NVVM IR Verifier — The 230KB three-layer verifier (module + function + intrinsic). It validates triples, address spaces, atomic restrictions, pointer cast rules, and architecture-gated intrinsic availability. A verification failure after a specific pass is a strong signal.
Bitcode I/O — If the problem is in bitcode reading/writing (corrupted input, version mismatch), this page documents the reader at sub_9F2A40 and the writer.

Then investigate the suspect pass:

NVIDIA Custom Passes or the relevant LLVM pass page — Read the algorithm description for the suspect pass. Look for documented edge cases, known limitations, and diagnostic strings that would appear in verbose output.
NVVMPassOptions — Check whether the suspect pass has enable/disable knobs or threshold parameters that could be adjusted to confirm or rule it out.
Environment Variables — Some passes are gated by environment variables (including obfuscated ones). Check whether any are influencing behavior.

For correctness issues specific to GPU semantics:

Address Spaces — Incorrect address space resolution is a common source of silent miscompilation. Global vs. shared vs. local aliasing rules differ from CPU memory models.
MemorySpaceOpt — This pass resolves generic pointers to specific address spaces. If it infers the wrong space, downstream code will access the wrong memory.
Alias Analysis — If the alias analysis returns NoAlias for pointers that do alias, DSE/LICM/GVN will misoptimize. The process-restrict propagation is a known source of aggressive alias assumptions.
StructurizeCFG — PTX requires structured control flow. If structurization produces incorrect flow blocks, the kernel will execute the wrong path.
Dead Barrier Elimination and Dead Synchronization Elimination — Incorrect elimination of barriers or synchronization can cause race conditions that only manifest under specific warp configurations.

Reading Path 4: Tuning Performance

Goal: understand what cicc does at each optimization level, which passes are the performance-critical ones, and what knobs control their aggressiveness.

Start with the tuning infrastructure:

Optimization Levels — The four standard levels (O0--O3) and three fast-compile tiers (Ofcmin/Ofcmid/Ofcmax). This page shows the exact pass pipeline diff between levels, including which passes are added, removed, or reparameterized at each step.
NVVMPassOptions — The 222-slot per-pass configuration system. This is the primary tuning mechanism. The page documents every slot's type (boolean/integer/string), its default value, and which pass reads it.
CLI Flags — The flag-to-pipeline routing tables. Locate flags that control pass thresholds (--inline-threshold=, --unroll-count=, etc.) and pass enable/disable toggles.
LLVM Knobs — The ~1,689 cl::opt flags with their defaults, types, and controlling constructors.
Environment Variables — Runtime environment overrides, including the obfuscated variables.

Then study the high-impact optimization passes:

LLVM Optimizer — Understand the two-phase model. Phase I (whole-module) determines inlining decisions, inter-procedural memory space propagation, and global optimization. Phase II (per-function, potentially concurrent) does register-pressure-driven rematerialization and instruction scheduling. Tuning decisions in Phase I cascade into Phase II.
Inliner Cost Model — Inlining is typically the single highest-impact optimization decision. This page documents the cost model thresholds, the caller/callee size heuristics, and NVIDIA's kernel-specific adjustments.
LoopVectorize & VPlan — Loop vectorization for GPU SIMT. The VPlan infrastructure, cost model, and the NVIDIA TTI hooks that influence vectorization width decisions.
Loop Unrolling — Unrolling thresholds, the NVIDIA-specific unroll heuristics, and the interaction with register pressure.
Rematerialization — NVIDIA's IR-level rematerialization pass (67KB). Trades recomputation for register pressure reduction, which directly affects occupancy on GPU.
Register Allocation — The greedy RA with occupancy-driven spill heuristics. Register count directly determines maximum occupancy.
Instruction Scheduling — The scheduler subsystems and their interaction with hardware latency models.

For tensor core workloads specifically:

Tensor / MMA Codegen — 19 MMA shapes across 11 data types. The instruction selection patterns, register allocation constraints, and WGMMA code generation for Hopper and Blackwell.
Tensor / MMA Builtins — The builtin-to-intrinsic lowering for wmma, mma, and wgmma operations.
SM 90 — Hopper — Hopper-specific features: TMA, WGMMA, asynchronous barriers, cluster launch.
SM 100 — Blackwell — Blackwell-specific features: new MMA shapes, FP4/FP6 support, sparsity.

For understanding performance at the target level:

GPU Targets — The SM feature gate matrix. Which features are enabled at each architecture level, and how architecture detection routes to different codegen paths.
NVPTX Target Infrastructure — The TTI hooks that passes query for target-specific costs (memory latency, instruction throughput, register file size).
Concurrent Compilation — If compile time itself is the bottleneck, understand the Phase II thread pool and GNU Jobserver integration to maximize parallelism.

Function Map

Address-to-identity lookup table. Confidence: VERY HIGH = string evidence, HIGH = strong structural evidence, MEDIUM = inferred from context/callgraph.

Top Functions by Size

Function	Address	Size	Confidence
X86 AutoUpgrade (intrinsic rename, leftover from LLVM x86 target)	`0xA939D0`	457KB	VERY HIGH
InstCombine::visitCallInst / visitIntrinsic	`0x10EE7A0`	396KB	HIGH
SelectionDAG LegalizeTypes workhorse (ExpandOp/PromoteOp)	`0x20019C0`	341KB	HIGH
New PassManager pipeline parser (function-level, 268 pass names)	`0x2368220`	326KB	VERY HIGH
EDG constexpr expression evaluator core (124 operator opcodes, 9,075 lines)	`0x786210`	317KB	VERY HIGH
SelectionDAG LegalizeOp main switch	`0x20ACAE0`	295KB	HIGH
SelectionDAGBuilder::visit (IR → DAG)	`0x2081F00`	261KB	HIGH
LLVM IR Verifier (visitCallInst), 298 verification messages	`0xBFC6A0`	207KB	VERY HIGH
X86 Intrinsic Upgrade Helper (broadcastf32x4, compress, etc.)	`0xA8A170`	195KB	HIGH
EDG IL tree walker #1 (297 self-recursive, 87 node types, 305 cases)	`0x7506E0`	190KB	HIGH
EDG declaration specifier parser (393 LABEL_ gotos, NOT switch/case)	`0x7C0F00`	184KB	HIGH
Bitcode Reader parseFunctionBody, 174 error strings	`0x9F2A40`	182KB	VERY HIGH
EDG constexpr top-level dispatch (80 expression types + 62 intrinsics)	`0x77FCB0`	150KB	HIGH
EDG IL tree copier/transformer (callback params a3/a4, template instantiation)	`0x766570`	148KB	HIGH
SelectionDAG LegalizeTypes dispatch (967 case labels)	`0x1FFB890`	137KB	HIGH
EDG declaration specifier state machine (80 token cases, 4,371 lines)	`0x672A20`	132KB	VERY HIGH
je_malloc_conf_init (199 config strings)	`0x12FCDB0`	129KB	VERY HIGH
computeKnownBits / SimplifyDemandedBits	`0x11A7600`	125KB	VERY HIGH
EDG lgenfe_main (282-case CLI switch, 737 config macros, EDG 6.6)	`0x617BD0`	123KB	VERY HIGH
NVVM Builtin Resolution table (post-opt, 770 entries)	`0x126A910`	123KB	VERY HIGH
NVVMPassOptions init (4,786 lines, 221 slots in 4,512-byte struct)	`0x12D6300`	125KB	VERY HIGH
PassOptionRegistry::lookupOption (hash table at registry+120)	`0x12D6170`	—	HIGH
PassOptionRegistry::getBoolOption (triple: '1'/true, 't'/true)	`0x12D6240`	—	HIGH
writeStringOption (24-byte entry to output struct)	`0x12D6090`	—	HIGH
writeBoolOption (16-byte entry to output struct)	`0x12D6100`	—	HIGH
4-stage pipeline orchestrator (LNK/OPT/OPTIXIR/LLC), nvopt+nvllc objects	`0x12C35D0`	41KB	VERY HIGH
Bitcode linker: triple validation, IR version check, symbol size matching	`0x12C06E0`	63KB	VERY HIGH
NVVM IR version checker (nvvmir.version metadata, NVVM_IR_VER_CHK env)	`0x12BFF60`	9KB	VERY HIGH
NVVM container format parser (arch, FTZ, IEEE, opt level extraction)	`0x12642A0`	—	HIGH
Concurrent worker entry (dispatches Phase I/II)	`0x12E7B90`	3KB	HIGH
Concurrent compilation entry (jobserver, thread pool, split-module)	`0x12E1EF0`	51KB	VERY HIGH
Function sorting by priority (insertion sort / introsort)	`0x12E0CA0`	—	HIGH
Per-function compilation callback (completion handler)	`0x12E8D50`	—	HIGH
Phase II per-function optimizer (sets qword_4FBB3B0=2)	`0x12E86C0`	—	HIGH
Concurrency eligibility check (counts defined functions)	`0x12D4250`	—	HIGH
GNU Jobserver init (parse MAKEFLAGS, create pipe, spawn pthread)	`0x16832F0`	—	HIGH
Bitcode Metadata Reader (parseMetadata)	`0xA09F80`	121KB	VERY HIGH
EDG IL function body processor (14 params, scope stack management)	`0x627530`	114KB	HIGH
EDG IL tree walker #2 (427 self-recursive, parallel traversal)	`0x760BD0`	109KB	HIGH
EDG IL codegen (node type dispatch on byte+80, 2,589 lines)	`0x8BA620`	108KB	HIGH
NVVM Builtin Resolution table (pre-opt, 770 entries)	`0x90AEE0`	107KB	VERY HIGH
NVVM Builtin lowering engine (pre-opt, wgmma/tex/surf, 3571 lines)	`0x955A70`	103KB	HIGH
New PassManager pipeline parser (CGSCC-level)	`0x2377300`	103KB	HIGH

Pipeline Functions

Function	Address	Size	Confidence
`main()` thunk → `sub_8F9C90`	`0x4396A0`	tiny	KNOWN
Real main: CLI parsing, wizard check, dispatch	`0x8F9C90`	10KB	VERY HIGH
Simple compile entry (Path A)	`0x902D10`	—	HIGH
Simple compile entry (Path B)	`0x1262860`	—	HIGH
LibNVVM pipeline driver (Path A): 14-phase flow, libdevice linking, API dispatch	`0x905EE0`	43KB	VERY HIGH
LibNVVM compilation entry (Path B): 4-stage pipeline, embedded builtins	`0x1265970`	48KB	VERY HIGH
CUDA C++ Front-End stage (lgenfe): timer "CUDA C++ Front-End"	`0x905880`	6KB	HIGH
NVVM IR Container → Module opt setup	`0x9047E0`	10KB	HIGH
Backend SM config + EDG binding, triple construction	`0x908850`	10KB	HIGH
LNK stage verbose callback	`0x903BA0`	5KB	HIGH
LLC stage verbose callback	`0x903730`	5KB	HIGH
CLI processing (Path A): -arch, -maxreg, -split-compile, -gen-lto	`0x900130`	—	HIGH
CLI processing (Path B)	`0x125FB30`	—	HIGH
EDG master orchestrator (setjmp recovery, timer callbacks)	`0x5D2A80`	2KB	VERY HIGH
Backend entry: "Generating NVVM IR", file output (.int.c/.device.c/.stub.c), TileIR dlopen	`0x5E3AD0`	11KB	VERY HIGH
Multi-stage orchestrator: .lnk.bc → .opt.bc → .ptx	`0x9685E0`	—	HIGH
Architecture detection: -arch → triple fan-out	`0x95EB40`	15KB	VERY HIGH
NVVM option parsing (all -opt-, -llc-, -gen-*, -Xopt)	`0x9624D0`	—	HIGH
Flag mapping table (O0-O3, nvcc flag translation)	`0x8FE280`	—	HIGH
LLVM cl::opt bulk registration (~1500 options)	`0xB6EEA0`	—	HIGH
Timer/context creation ("CUDA C++ Front-End", "LibNVVM")	`0xC996C0`	—	HIGH

EDG 6.6 Frontend

Core Orchestration

Function	Address	Size	Confidence
EDG master orchestrator (setjmp recovery, timer callbacks)	`0x5D2A80`	2KB	VERY HIGH
EDG lgenfe_main (282-case CLI switch, 737 config macros, EDG 6.6)	`0x617BD0`	123KB	VERY HIGH
CLI option registration table (~300 options via sub_6101D0)	`0x610260`	22KB	HIGH
Option fetcher (called in main loop of sub_617BD0)	`0x6140E0`	6KB	HIGH
Backend entry: "Generating NVVM IR", file output (.int.c/.device.c/.stub.c), TileIR dlopen	`0x5E3AD0`	11KB	VERY HIGH
Translation unit init (416-byte TU object, keyword init, parser entry)	`0x8D0BC0`	—	VERY HIGH
Semantic analysis init (zeroes 6 globals)	`0x8D0F00`	tiny	HIGH
Keyword table init (~350 keywords via sub_885C00)	`0x706250`	30KB	VERY HIGH
TU finalization ("Generating Needed Template Instantiations")	`0x709330`	5KB	HIGH
Register single keyword: `(token_id, "keyword_string")`	`0x885C00`	tiny	HIGH

AST-to-Source Printer Cluster

Function	Address	Size	Confidence
Main expression/statement emitter (61 self-references, recursive)	`0x5DBFC0`	41KB	HIGH
Function declaration printer (__sti__, #pragma section, nv_linkonce_odr)	`0x5E13C0`	44KB	HIGH
Statement printer (if/else/for/while/switch/case/return)	`0x5DFD00`	26KB	HIGH
Declaration printer (linkage/storage, __builtin_va_alist)	`0x5D9330`	12KB	HIGH
Scope/block printer (bit-fields, array dimensions)	`0x5DA0F0`	13KB	HIGH
Struct/union/enum printer (#pragma pack)	`0x5DAD30`	9KB	HIGH
Variable initializer printer (memcpy, aggregate init)	`0x5D80F0`	17KB	HIGH
Inline asm printer (volatile, constraints, format specifiers)	`0x5DF1B0`	11KB	HIGH
Identifier printer (keyword mangling: auto→__xauto)	`0x5D5A80`	7KB	HIGH
Top-level declaration dispatcher	`0x5DB980`	7KB	HIGH
Function parameter list printer (__text__/__surf__ annotations)	`0x5D7860`	6KB	HIGH

Parser & Declaration Processing

Function	Address	Size	Confidence
Declaration specifier state machine (while/switch, 80 token cases)	`0x672A20`	132KB	VERY HIGH
Declaration specifier parser (393 LABEL_ gotos, NOT switch/case)	`0x7C0F00`	184KB	HIGH
Top-level declaration/declarator parser	`0x662DE0`	61KB	HIGH
Overloaded function resolution (__builtin_ detection, OMP variants)	`0x6523A0`	64KB	HIGH
Struct/union/class specifier processing	`0x66AC40`	49KB	HIGH
Enum specifier processing	`0x66F9E0`	39KB	HIGH
Block-level declaration/statement processor (largest in 0x630000 zone)	`0x63CAE0`	67KB	HIGH
Declaration statement parsing (35 token refs, 14 diagnostics)	`0x661400`	28KB	HIGH
Function declarator processing (parameter lists, return types)	`0x66DF40`	24KB	HIGH
Declaration specifier combination validator	`0x668EE0`	26KB	HIGH
Storage class specifier processor (_Thread_local validation)	`0x668230`	9KB	HIGH
Primary declarator-to-IL conversion (type kind dispatch)	`0x6333F0`	26KB	HIGH
Name/identifier processing	`0x64BAA0`	46KB	HIGH
Builtin/intrinsic recognition (53 string refs, C++20/23 reflection)	`0x64A920`	25KB	HIGH
IL function body processor (14 params, scope stack management)	`0x627530`	114KB	HIGH
IL statement processing (16 params, IL walker/transformer)	`0x62C0A0`	63KB	HIGH

Type System

Function	Address	Size	Confidence
Type conversion checker (recursive, vector type handling)	`0x713ED0`	36KB	HIGH
Binary operation type checker (11 callers — very central)	`0x7115B0`	17KB	HIGH
Usual arithmetic conversions (10 params)	`0x712770`	12KB	HIGH
Type node comparator (parallel tree walk, canonicalization)	`0x7386E0`	23KB	HIGH
Declaration-level type comparison	`0x739430`	20KB	HIGH
Type-to-string emitter (19 callers, backbone of diagnostics)	`0x74A390`	29KB	VERY HIGH
Constant expression emitter (alignof, sizeof, nullptr, zero-init)	`0x748000`	45KB	HIGH
Declarator emitter (19 callers, paired with sub_74A390)	`0x74D110`	10KB	HIGH
Type node deep-copy	`0x73A9D0`	19KB	HIGH
Declaration node deep-copy (192 bytes = 12 x __m128i)	`0x73F780`	6KB	HIGH
Operator overloadability checker	`0x73CC20`	9KB	HIGH

IL Tree Infrastructure

Function	Address	Size	Confidence
IL tree walker #1 (297 self-recursive, 87 node types, 305 cases)	`0x7506E0`	190KB	HIGH
IL tree walker #2 (427 self-recursive, parallel traversal)	`0x760BD0`	109KB	HIGH
IL tree walker #3 (316 self-recursive)	`0x75C0C0`	87KB	HIGH
IL tree copier/transformer (callback params a3/a4, template instantiation)	`0x766570`	148KB	HIGH
Walker driver/setup (5 callbacks + flags)	`0x759B50`	31KB	HIGH
Copier driver (parallel to sub_759B50)	`0x75B260`	16KB	HIGH
Master walker driver (sets all 6 global callback pointers)	`0x75AFC0`	—	HIGH

Constexpr Evaluator

Function	Address	Size	Confidence
EDG constexpr expression evaluator core (124 operator opcodes, 9,075 lines)	`0x786210`	317KB	VERY HIGH
Statement executor (declarations, loops, switch, compound blocks)	`0x795660`	77KB	HIGH
Object member accessor (base classes, virtual bases, union tracking)	`0x79CCD0`	67KB	HIGH
Aggregate initializer evaluator (arrays, structs, designated init)	`0x799B70`	33KB	HIGH
Function call evaluator (argument binding, recursion limits)	`0x79B7D0`	29KB	HIGH
EDG constexpr top-level dispatch (80 expression types + 62 intrinsics)	`0x77FCB0`	150KB	HIGH
Type size calculator (Robin Hood hash memoization, 64MB cap)	`0x7764B0`	18KB	HIGH
Loop/range-for evaluator	`0x7987E0`	11KB	HIGH
Builtin call evaluator (dispatched from case 0x3D)	`0x77C870`	18KB	HIGH
Aggregate initializer evaluator (struct/array/union at compile time)	`0x77D750`	34KB	HIGH

Preprocessor

Function	Address	Size	Confidence
Main preprocessor token scanner (all C/C++ token kinds)	`0x7B8B50`	59KB	HIGH
Macro expansion engine (99-entry predefined table, __VA_OPT__)	`0x81B8F0`	77KB	HIGH
Numeric literal tokenizer (hex float, binary, digit separators)	`0x7B40D0`	42KB	HIGH
Character classification / next-token dispatch (trigraphs, line splices)	`0x7BC390`	29KB	HIGH
String literal scanner (escape processing, raw strings)	`0x7B6B00`	13KB	HIGH
Macro body substitution (__VA_ARGS__, __VA_OPT__)	`0x8200E0`	22KB	HIGH
Source character reader / tokenizer bootstrap	`0x7B2B10`	16KB	HIGH
Preprocessing directive dispatcher	`0x7B8270`	8KB	HIGH

Template Engine

Function	Address	Size	Confidence
Complete template instantiation engine (parameter lists, member iteration)	`0x7A9440`	40KB	HIGH
Template argument type resolution/matching	`0x7410C0`	42KB	HIGH
Template type instantiation handler	`0x743600`	19KB	HIGH
Template instantiation engine (word_4F06418 SM-arch checks)	`0x5EBF70`	30KB	HIGH
Template argument deduction engine (pattern matching, pack expansion)	`0x5FBCD0`	38KB	HIGH

Semantic Analysis

Function	Address	Size	Confidence
Deep semantic analysis (29 SM-arch refs, 27 sub_8D* calls)	`0x6040F0`	64KB	HIGH
Overload resolution main (43 SM-arch refs — highest)	`0x607B60`	32KB	HIGH
Expression parsing/semantic ("Parsing Lambda", __nv_parent)	`0x609F00`	58KB	HIGH
Declaration processing (9 SM version refs)	`0x5FE9C0`	28KB	HIGH
Class hierarchy analysis (vtable layout, diamond inheritance)	`0x5F94C0`	24KB	HIGH
Conversion function lookup (33 sub_8D* calls)	`0x5F4F20`	21KB	HIGH
Operator overload resolution	`0x5F2920`	23KB	HIGH
Declaration elaboration (type-spec strings "A;P", "O;F", "I", "B")	`0x84EC30`	71KB	HIGH
Declaration semantic analysis (148 global refs, highest density)	`0x8708D0`	63KB	HIGH

CUDA-Specific Frontend

Function	Address	Size	Confidence
Memory space attribute processing (__shared__, __constant__, __managed__)	`0x6582F0`	22KB	HIGH
Declaration with memory space annotation (15 diagnostic calls)	`0x65F400`	24KB	HIGH
Atomic builtin name generator (__nv_atomic_fetch_*)	`0x6BBC40`	34KB	HIGH
CUDA device code generation master	`0x804B20`	28KB	HIGH
CUDA registration stub (__cudaRegisterAll, __cudaRegisterEntry)	`0x806F60`	8KB	VERY HIGH
Device stub generator ("__device_stub_%s", __cudaLaunch)	`0x808590`	11KB	HIGH
CUDA kernel launch lowering (cudaGetParameterBufferV2)	`0x7F2B50`	16KB	HIGH
Static init with CUDA memory space (__sti__, __constant__)	`0x801880`	7KB	HIGH
Optimization flag configurator (109 flags from O-level)	`0x60D650`	6KB	HIGH
SM-arch feature gate (56 qword_4F077A8 comparisons)	`0x60E7C0`	12KB	HIGH

Name Mangling (Itanium ABI)

Function	Address	Size	Confidence
Primary mangling entry	`0x8E74B0`	29KB	HIGH
Type mangling	`0x8E9FF0`	26KB	HIGH
Type component mangling (__real__, __imag__)	`0x816460`	24KB	HIGH
Builtin type mangling (DF16_, Cu6__bf16, u6__mfp8)	`0x80E340`	23KB	HIGH
NVIDIA extension mangling (Unvdl, Unvdtl, Unvhdl)	`0x80FE00`	8KB	HIGH
Special type mangling (basic_ostream, allocator substitution)	`0x80C5A0`	11KB	HIGH
Expression mangling	`0x813790`	13KB	HIGH

Diagnostics & Support

Function	Address	Size	Confidence
Diagnostic emitter (severity labels, ANSI color, word-wrap)	`0x681D20`	37KB	VERY HIGH
SARIF JSON diagnostic output (ruleId, level, locations)	`0x6837D0`	20KB	HIGH
Type name formatter (quoted type names for error messages)	`0x67FCF0`	40KB	HIGH
EDG abort / __builtin_unreachable (478 callers!)	`0x721090`	tiny	VERY HIGH
Exit with status ("Compilation aborted/terminated")	`0x720FF0`	—	HIGH
IR node alloc with context (204 callers)	`0x724DC0`	—	HIGH
IR node free (196 callers)	`0x724E30`	—	HIGH
Get/create void type singleton at qword_4F07BA8 (145 callers)	`0x72C930`	—	HIGH
Arena allocator (63 callers)	`0x7247C0`	—	HIGH
IR node hash (polynomial: v10 += ch + 32*v10, 9 callers)	`0x72DB90`	8KB	HIGH
Tracked heap allocation (linked list at qword_4F195F8)	`0x822B10`	—	HIGH
Hash table bucket chain finalizer	`0x823310`	—	HIGH
EDG heap pool allocator (152-byte, 416-byte, etc. entries)	`0x823970`	—	HIGH

Class Layout & Vtable

Function	Address	Size	Confidence
Class layout emitter (__vptr, __v_, __b_ prefixes)	`0x7E3EE0`	7KB	HIGH
Virtual base offset calculator	`0x7E57B0`	9KB	HIGH
Virtual call lowering (node_kind==103)	`0x7E88E0`	11KB	HIGH
Class definition emitter (vtable, nested types, friends)	`0x7E9AF0`	13KB	HIGH
Statement emission mega-function (largest in class layout zone)	`0x7EE560`	45KB	HIGH
Class member emission (__cxa_atexit, __cxa_vec_cctor)	`0x7FEC50`	48KB	HIGH
Function definition emission (ctor initializers, default args)	`0x7FCF80`	17KB	HIGH

LLVM cl::opt Registration Infrastructure

Function	Address	Size	Confidence
Global option counter (atomic increment)	`0xC523C0`	—	HIGH
cl::Option::setArgStr(name, len) — Legacy PM	`0xC53080`	—	HIGH
cl::Option::addArgument() — Legacy PM	`0xC53130`	—	HIGH
cl::OptionCategory getter	`0xC57470`	—	HIGH
cl::opt name setter — New PM	`0x16B8280`	—	HIGH
cl::opt finalization — New PM	`0x16B88A0`	—	HIGH
SmallVector::grow()	`0xC8D5F0`	—	HIGH

Key Constructors (cl::opt registration)

Function	Address	Size	Confidence
ctor_010_0: TargetLibraryInfo VecFuncs table (9 vector math libs, 960 string xrefs, NOT decompiled)	`0x4397F0`	~102KB	VERY HIGH
ctor_027: DOES NOT EXIST (phantom, no decompiled file)	`0x456120`	—	DISPROVED
ctor_036: LLVM version = "20.0.0" (via LLVM_OVERRIDE_PRODUCER fallback)	`0x48CC90`	2KB	VERY HIGH
ctor_043_0: NVIDIA CICC-specific options (19 opts, XOR cipher hidden flag)	`0x48D7F0`	30KB	VERY HIGH
MASTER pass/analysis registration (~172 init calls)	`0x4A5950`	7KB	VERY HIGH
ctor_107_0: MC/Target options (131 opts, getenv("bar") backdoor)	`0x4A64D0`	59KB	VERY HIGH
ctor_133_0: Known library function table (422 C/POSIX functions)	`0x4B0180`	29KB	VERY HIGH
ctor_145: MISSING from decompilation (too large for Hex-Rays)	`0x4B4360`	~99KB	HIGH
ctor_147_0: PassManager debug/print options	`0x4CC760`	20KB	HIGH
ctor_156_0: CLI infrastructure (help, version, print-options)	`0x4CEB50`	9KB	HIGH
ctor_186_0: Inliner heuristics (NVIDIA: profuseinline, inline-budget)	`0x4DBEC0`	14KB	HIGH
ctor_201: GVN options (NVIDIA: profusegvn, gvn-dom-cache)	`0x4E0990`	9KB	HIGH
ctor_214_0: LSR options (NVIDIA: disable-lsr-for-sharedmem32-ptr)	`0x4E4B00`	8KB	HIGH
ctor_216_0: Loop Unrolling options (largest unroll ctor)	`0x4E5C30`	21KB	HIGH
ctor_259_0: CICC core compiler options (debug-compile, maxreg)	`0x4F0FB0`	17KB	HIGH
ctor_262_0: BranchDist pass options	`0x4F2830`	10KB	HIGH
ctor_263_0: SCEV-CGP pass options (44 strings!)	`0x4F36F0`	10KB	HIGH
ctor_264: IP-MSP knobs	`0x4F45B0`	—	HIGH
ctor_267_0: MemorySpaceOpt options (18 strings)	`0x4F54D0`	10KB	HIGH
ctor_277_0: Rematerialization options (39 strings, remat-for-occ)	`0x4F7BE0`	7KB	HIGH
ctor_335_0: MASTER codegen pass configuration (88 strings)	`0x507310`	29KB	VERY HIGH
ctor_356_0: NVPTX SM enum + PTX version table (45 entries, sm_20–sm_121f)	`0x50C890`	16KB	VERY HIGH
ctor_358_0: NVPTX pass enable/disable (43 strings, usedessa)	`0x50E8D0`	21KB	HIGH
ctor_361_0: NV Remat Machine Block options (30 strings, nv-remat-*)	`0x5108E0`	8KB	HIGH
ctor_376_0: LTO/bitcode/plugin options	`0x512DF0`	39KB	HIGH
ctor_377_0: PassBuilder pipeline configuration (77 strings)	`0x516190`	44KB	HIGH
ctor_388_0: Optimizer pipeline enables (enable-ml-inliner, etc.)	`0x51B710`	15KB	HIGH
ctor_600_0: CodeGen/TargetMachine mega-options (118 strings)	`0x57F210`	59KB	HIGH
ctor_605: SM processor table (45 entries, sm_20–sm_121f, PTX version map)	`0x584510`	3KB	VERY HIGH
ctor_609_0: NVPTX backend options (25+ opts, usedessa, enable-nvvm-peephole)	`0x585D30`	37KB	HIGH
ctor_637_0: disable-*Pass flag registration (48 flags)	`0x593380`	—	HIGH
ctor_701: MISSING data blob (likely instruction encoding tables)	`0x5A8850`	~70KB	MEDIUM

NVIDIA Custom Pass Implementations

Function	Address	Size	Confidence
MemorySpaceOptPass registration	`0x2CDD6D0`	reg	HIGH
MemorySpaceOptPass factory	`0x2CDFF20`	factory	HIGH
MemorySpaceOpt core analysis	`0x2CDA660`	10KB	HIGH
MemorySpaceOpt address space inference	`0x2CD7710`	9KB	HIGH
IPMSPPass (interprocedural memory space) registration	`0x1C6FBC0`	reg	HIGH
RematerializationPass (IR-level) implementation	`0x1CE7DD0`	13KB	HIGH
Machine Block Rematerialization	`0x2186D90`	9KB	HIGH
BranchDistPass registration	`0x1C4B520`	reg	HIGH
LoopIndexSplitPass implementation	`0x1C7B2C0`	11KB	HIGH
NVVMPeepholeOptimizerPass registration	`0x2CAF0F0`	reg	HIGH
ByValMem2RegPass	`0x2CD6510`	350B	HIGH
BasicDeadBarrierEliminationPass	`0x2CD2690`	366B	HIGH
CNPLaunchCheckPass (Dynamic Parallelism validation)	`0x1CEBC30`	reg	HIGH
PrintfLoweringPass	`0x1CB0B80`	name	HIGH
Pass registration master function (all 402+20 passes)	`0x2342890`	32KB	VERY HIGH
Pass name listing (pipeline names for all passes)	`0x233C410`	—	HIGH

MMA / Tensor Core Emission

Function	Address	Size	Confidence
MMA instruction operand builder (shapes, types, rounding modes)	`0x21E74C0`	17KB	VERY HIGH
tcgen05 Blackwell scaled MMA operands (scaleD, negA, negB, transA)	`0x21E8CD0`	2KB	VERY HIGH
HMMA store-C (hmmastc), SM ≥ 70	`0x21DFBF0`	5KB	HIGH
HMMA load-A/B (hmmaldab), SM ≥ 70	`0x21E0360`	3KB	HIGH
HMMA load-C (hmmaldc), SM ≥ 70	`0x21E0630`	3KB	HIGH
HMMA MMA (hmmamma), SM ≥ 70	`0x21E0870`	4KB	HIGH
IMMA load-A/B (immaldab), SM ≥ 72	`0x21E1280`	4KB	HIGH
IMMA load-C (immaldc), SM ≥ 72	`0x21E15D0`	3KB	HIGH
IMMA store-C, SM ≥ 72	`0x21E1830`	5KB	HIGH
IMMA MMA w/ saturation (immamma), SM ≥ 72	`0x21E1D20`	6KB	HIGH
Binary MMA (bmmamma, b1 .and.popc/.xor.popc), SM ≥ 75	`0x21E2280`	6KB	HIGH
MMA address-space resolver (opcode → addrspace enum)	`0x21DEF90`	—	HIGH
tcgen05 scaled MMA operands (NVPTX backend copy)	`0x35F3E90`	—	HIGH
tcgen05.mma full instruction lowering (10 shape variants)	`0x36E9630`	—	HIGH
tcgen05.mma SelectionDAG lowering	`0x304E6C0`	—	HIGH
tcgen05 infrastructure ops (fence/wait/alloc/dealloc/cp/commit)	`0x30462A0`	—	HIGH

PTX Emission

Function	Address	Size	Confidence
Function header orchestrator (.entry/.func, params, attrs, pragmas)	`0x215A3C0`	—	VERY HIGH
Kernel attribute emission (.reqntid, .maxntid, cluster, .maxnreg)	`0x214DA90`	—	VERY HIGH
Stack frame emission (__local_depot, %SP, %SPL, register decls)	`0x2158E80`	17KB	VERY HIGH
Register class → encoded ID (9 classes, 0x10000000–0x90000000)	`0x21583D0`	—	HIGH
Register class → PTX type suffix (.pred, .b16, .b32, .b64, .f32, .f64, .b128)	`0x2163730`	—	HIGH
Register class → PTX prefix (%p, %rs, %r, %rd, %f, %fd, %h, %hh, %rq)	`0x21638D0`	—	HIGH
GenericToNVVM pass registration ("generic-to-nvvm")	`0x215DC20`	—	VERY HIGH
GenericToNVVM pass body (addrspace 0→1 rewriting)	`0x215E100`	36KB	HIGH
Module emission entry (global ctor rejection, DWARF init)	`0x215ACD0`	—	HIGH
Global variable emission (texref/surfref/samplerref/data)	`0x2156420`	—	HIGH
Atomic opcode emission (13 ops, scope prefix)	`0x21E5E70`	—	VERY HIGH
L2 cache-hinted atomic emission (Ampere+)	`0x21E6420`	—	HIGH
Memory barrier emission (membar.cta/gpu/sys, fence.sc.cluster)	`0x21E94F0`	—	HIGH
Cluster barrier emission (arrive/wait + relaxed)	`0x21E8EA0`	—	HIGH
Special register emission (%tid, %ctaid, %ntid, %nctaid)	`0x21E86B0`	—	VERY HIGH
Cluster special register emission (15 regs, SM 90+)	`0x21E9060`	—	HIGH
Address space conversion + MMA helpers (cvta, rowcol, abtype)	`0x21E7FE0`	—	HIGH

Hash Infrastructure

Function	Address	Size	Confidence
wyhash v4 hash function (multi-length dispatch)	`0xCBF760`	—	VERY HIGH
Thin wrapper → sub_CBF760 (hash for builtin names)	`0xC92610`	—	HIGH
Hash table insert-or-find (quadratic probing, triangular numbers)	`0xC92740`	—	VERY HIGH
Hash table find-only (same probing)	`0xC92860`	—	HIGH
Rehash at 75% load factor (double or tombstone cleanup)	`0xC929D0`	—	HIGH
String entry allocator (length+17, 8-byte aligned)	`0xC7D670`	—	HIGH

NVVM Builtin Infrastructure

Function	Address	Size	Confidence
Hash table insertion helper (pre-opt)	`0x90ADD0`	56 lines	VERY HIGH
Builtin dispatcher (pre-opt): name → ID	`0x913450`	27 lines	VERY HIGH
Builtin dispatcher (post-opt): name → ID	`0x12731E0`	25 lines	VERY HIGH
Builtin lowering engine (pre-opt, wgmma/tex/surf, 3571 lines)	`0x955A70`	103KB	HIGH
Builtin lowering engine (post-opt, 3408 lines)	`0x12B3FD0`	101KB	HIGH

Register Allocation

Function	Address	Size	Confidence
Instruction constraint emission (180+ case opcode switch)	`0xB612D0`	102KB	HIGH
SimplifyAndColor phase	`0x1081400`	13KB	HIGH
SelectNodeForRemoval / Briggs criterion (K=15 at 3 locations)	`0x1090BD0`	10KB	VERY HIGH
AssignColorsAndOptimize (address unverified, was erroneously listed as 0x12E1EF0)	`0x10841C0`	11KB	MEDIUM
Operand constraint spec creator (type 14=GPR, 40=FP, 78=vec)	`0xA778C0`	—	HIGH
Final instruction emitter with allocated registers	`0xA78010`	—	HIGH

jemalloc (Statically Linked, v5.3.x)

Function	Address	Size	Confidence
je_stats_print_arena (per-arena stats, HPA shards)	`0x4134A7`	83KB	HIGH
je_stats_print_bins (18 stat columns per bin)	`0x40F894`	37KB	HIGH
je_stats_general (version, build config, runtime opts)	`0x411419`	32KB	HIGH
je_stats_print (top-level: allocated, active, resident, mapped)	`0x417CBD`	14KB	HIGH
je_stats_print_large (large extent class stats)	`0x40EF06`	13KB	HIGH
je_malloc_vsnprintf (custom format printer, avoids reentrancy)	`0x40D5CA`	21KB	HIGH
je_mutex_stats_read (mutex profiling counters)	`0x40E5B5`	7KB	HIGH
je_malloc_conf_init (199 config strings)	`0x12FCDB0`	129KB	VERY HIGH

Optimizer Pipeline Assembly

Functions discovered during wiki writing (W101--W241). These assemble the LLVM optimization pipeline from NVVMPassOptions slots.

Pipeline Builders

Function	Address	Size	Confidence
Master pipeline assembler (reads opts struct, ~150 pass-insertion decisions)	`0x12E54A0`	50KB	VERY HIGH
Tier 0 full optimization sub-pipeline (~40 passes, base for O1/O2/O3)	`0x12DE330`	—	VERY HIGH
Tier 1/2/3 phase-specific sub-pipeline (phase-conditional pass insertion)	`0x12DE8F0`	—	VERY HIGH
Codegen pass dispatch (reads opts[200] optimization threshold)	`0x12DFE00`	20.7KB	HIGH
OPT stage two-phase orchestrator (sets qword_4FBB3B0 to 1 or 2)	`0x12E7E70`	—	VERY HIGH
New-PM driver: pipeline name selector (O0/O1/O2/O3/Ofcmin/Ofcmid/Ofcmax)	`0x226C400`	—	HIGH
NVPTXTargetMachine creation (NVIDIA options, standalone path)	`0x12F4060`	16KB	HIGH
OptiX IR generation core function	`0x12F9270`	~6KB	HIGH

Pass Factories (Pipeline Insertion Order)

Each factory creates a pass instance; referenced from sub_12E54A0, sub_12DE330, and sub_12DE8F0.

Function	Address	Size	Confidence
NVVMReflect factory (~8 pipeline insertions)	`0x1857160`	—	HIGH
SCCP factory	`0x1842BC0`	—	HIGH
NVVMVerifier wrapper (creates context, invokes module verifier)	`0x12D4560`	—	HIGH
NVVMPredicateOpt factory (AggressiveInstCombine variant)	`0x18A3430`	—	HIGH
NVVMPredicateOpt variant / LoopRotate factory	`0x18A3090`	—	HIGH
ConstantMerge / GlobalDCE / LICM factory	`0x184CD60`	—	HIGH
FunctionAttrs factory (infers readonly, nounwind, etc.)	`0x1841180`	—	HIGH
LICM factory (parameter 0 = standard mode)	`0x195E880`	—	HIGH
LoopVectorize/SLP factory (7 params: width, thresholds)	`0x19B73C0`	—	HIGH
CGSCC standard pipeline factory (InlinerWrapper, 1--5 iterations)	`0x1A62BF0`	—	HIGH
PrintModulePass factory (debug dump, params: level, verbose)	`0x17060B0`	—	HIGH
JumpThreading / CVP factory (parameter: threshold)	`0x198DF00`	—	HIGH
EarlyCSE factory	`0x196A2B0`	—	HIGH
SROA factory	`0x1968390`	—	HIGH
DCE (DeadCodeElimination) factory	`0x18DEFF0`	—	HIGH
Sink/MemSSA factory (3 params: mode, flags)	`0x1869C50`	—	HIGH
NVVMLoopOpt/BarrierOpt / IV Demotion factory	`0x18B1DE0`	—	HIGH
NVVMIntrinsicLowering factory (level 0 = basic, level 1 = barrier)	`0x1CB4E40`	—	HIGH
MemCpyOpt factory	`0x1B26330`	—	HIGH
LoopUnroll / SpeculativeExecution factory (2 params)	`0x19C1680`	—	HIGH
ADCE (AggressiveDeadCodeElimination) factory	`0x1C76260`	—	HIGH
ADCE variant factory (separate pipeline position)	`0x1C6FCA0`	—	HIGH
SimplifyCFG factory (2 params: mode, flags)	`0x190BB10`	—	HIGH
InstructionSimplify factory	`0x1A7A9F0`	—	HIGH
NVVMRematerialization factory (IR-level)	`0x1A13320`	—	HIGH
Reassociate factory (parameter: tier)	`0x1B7FDF0`	—	HIGH
LoopStrengthReduce factory	`0x19CE990`	—	HIGH
NVVMBranchDist factory (two pipeline positions)	`0x1CB73C0`	—	HIGH
NVVMSinking2 factory (SM-specific late sinking)	`0x1CC60B0`	—	HIGH
NVVMGenericAddrOpt factory (generic address optimization)	`0x1CC71E0`	—	HIGH
NVVMReduction factory (SM-specific)	`0x1CC5E00`	—	HIGH
NVVMUnreachableBlockElim factory	`0x1CC3990`	—	HIGH
NVVMLateOpt factory (Tier 3 only)	`0x1C46000`	—	HIGH
NVVMLowerAlloca factory (dual gate: opts[2240] + opts[2280])	`0x1CBC480`	—	HIGH
NVVMLowerBarriers factory (runs between LICM invocations)	`0x1C98160`	—	HIGH
Sinking2Pass fast-mode factory (flag=1, Ofcmin pipeline)	`0x18B3080`	—	HIGH
VerifierPass factory (late CFG cleanup guard at opts[4464])	`0x1654860`	—	HIGH
NVIDIA loop pass factory (opts[3080] guard)	`0x1922F90`	—	MEDIUM
EarlyCSE MemorySSA variant / NVVMBarrierAnalysis factory	`0x18E4A00`	—	HIGH
EarlyCSE variant (v=1 if opts[3704])	`0x1C8A4D0`	—	HIGH
NVVMAnnotationsProcessor factory	`0x215D9D0`	—	HIGH
NVIDIA Custom Inliner (CGSCC, 20,000-unit per-caller budget)	`0x1864060`	75KB	VERY HIGH

NVPTX Backend (SelectionDAG & ISel)

Function	Address	Size	Confidence
NVPTXTargetLowering::LowerIntrinsicCall (largest function in binary)	`0x33B0210`	343KB	VERY HIGH
NVPTXDAGToDAGISel::Select (ISel entry, hash-based cost table)	`0x3090F90`	91KB	VERY HIGH
computeKnownBitsForTargetNode (112 opcodes, 399x sub_969240 calls)	`0x33D4EF0`	114KB	HIGH
NVPTXTargetLowering::LowerCall (PTX `.param` calling convention)	`0x3040BF0`	88KB	HIGH
LLVM standard InlineCostAnalysis (library function)	`0x30DC7E0`	51KB	HIGH
Vector legalization type-split record mapping	`0x3302A00`	—	HIGH
Operand type classifier (reads byte_444C4A0)	`0x34961A0`	26.6KB	HIGH

NVVM Verifier Subsystem

Function	Address	Size	Confidence
NVVMModuleVerifier (data layout, address space, triple validation)	`0x2C80C90`	51KB	HIGH
NVVMIntrinsicVerifier (SM gates, types, MMA, atomics, tex/surf)	`0x2C7B6A0`	143KB	VERY HIGH
Frontend verifier (convergent intrinsic SM-version gating)	`0x1C36530`	—	HIGH
NVVMIntrinsicLowering core engine (2,460 lines)	`0x2C63FB0`	140KB	HIGH

LTO Subsystem

Function	Address	Size	Confidence
NVModuleSummary builder (ThinLTO, two-phase declaration merge)	`0xD7D4E0`	74KB	HIGH
New PM CGSCC inliner (inside LazyCallGraph framework)	`0x2613930`	69KB	HIGH
IP-MSP module-pass variant (LIBNVVM path, DenseMap-based)	`0x1C6A6C0`	54KB	HIGH
LinkUserModules (wrapper around LLVM Linker::linkModules)	`0x12F5610`	~4KB	HIGH

LLVM IR Utility Functions

Common LLVM IR manipulation functions referenced across many passes.

Function	Address	Size	Confidence
operator new / BumpPtrAllocator (SDNode, BasicBlock, pass objects)	`0x22077B0`	—	HIGH
Value::replaceAllUsesWith / salvageDebugInfo	`0xBD84D0`	—	HIGH
Instruction::eraseFromParent / SDUse remove from use list	`0xB43D60`	—	HIGH
getCalledFunction / BranchInst::getCondition	`0xB43CB0`	—	HIGH
Function::hasAttribute(N) (noimplicitfloat, optnone, convergent)	`0xB2D610`	—	HIGH
Function::getName / IR node name getter	`0xBD5D20`	—	HIGH
PHINode::Create / SDNode alloc variant (80 bytes)	`0xBD2DA0`	—	HIGH
hasAttribute(26) (convergent/varargs marker check)	`0xB91C10`	—	HIGH
TTI::getInstructionCost (IR-level) / MDString::getString	`0xB91420`	—	HIGH
Ref-count decrement on metadata/debug-info	`0xB91220`	—	HIGH
Ref-count increment on metadata/debug-info	`0xB96E90`	—	HIGH
Value::setName / SetValueName (assigns %name to IR value)	`0x164B780`	—	HIGH
IRBuilder::CreateBinOp / SCEV type extension (349x callers)	`0x1623A60`	—	HIGH
ReleaseDebugLoc / debug location list removal	`0x161E7C0`	—	HIGH
Fatal error emitter ("Broken module found, compilation aborted!")	`0x16BD130`	—	HIGH
Create binary OR instruction (opcode 27)	`0x15FB440`	—	HIGH
DataLayout::getPointerSizeInBits(addressSpace)	`0x15A9520`	—	HIGH
DataLayout::getStructLayout (struct size computation)	`0x15A9930`	—	HIGH
SCEV fold/normalize / NVVM AA address-space NoAlias query	`0x146F1B0`	—	HIGH
CombineTo / ReplaceAllUsesWith (DAG use-chain + worklist push)	`0xF162A0`	—	HIGH
Function cloner (coroutine resume/destroy)	`0xD2E510`	—	HIGH
Create runtime library call instruction (OpenMP, MMA, barriers)	`0x921880`	—	HIGH
Builtin function call emitter (pre-opt path, EDG builtins)	`0x1285290`	—	HIGH
Kernel metadata emitter (cluster_dim, blocksareclusters)	`0x93AE30`	~5.6KB	HIGH
ExpandIntegerResult (type legalization, 632 case labels)	`0x201BB90`	75KB	HIGH

Machine-Level Infrastructure

Function	Address	Size	Confidence
InstrEmitter DenseMap grow / rehash (hash: key*37)	`0x2E29BA0`	—	HIGH
TwoAddressInstruction DenseMap (SrcEqClassMap)	`0x1F4E3A0`	—	HIGH

Binary Layout

This page is a visual guide to navigating the cicc v13.0 binary in IDA Pro. It covers the ELF structure, section layout, subsystem address ranges, embedded data payloads, and the statically linked jemalloc allocator. If you are opening this binary for the first time, start here to orient yourself before diving into individual subsystems.

ELF Overview

CICC is a statically linked, stripped x86-64 ELF binary. There are no dynamic symbol tables, no .dynsym, no DWARF debug info, and no export table. Every function name was removed at build time. IDA Pro recovers 80,562 functions; Hex-Rays successfully decompiles 80,281 of them (99.65%).

Property	Value
File size	60,108,328 bytes (57.3 MB)
Architecture	x86-64, little-endian
Linking	Fully static (no `.interp`, no PLT/GOT)
Stripped	Yes, all symbol tables removed
Build ID	`cuda_13.0.r13.0/compiler.36424714_0`
Compiler	Built with GCC (inferred from CRT stubs and `.init_array` layout)
Allocator	jemalloc 5.3.x, statically linked (~400 functions)

Because the binary is statically linked, libc, libpthread, and libm are all embedded. This inflates the raw function count but also means every call target resolves to a concrete address within the binary itself -- there are no external dependencies at runtime beyond the kernel syscall interface.

Address Space Map

The binary's .text section spans roughly 0x400000 to 0x3C00000. Within that 56 MB range, subsystems occupy contiguous, non-overlapping regions. The map below is the primary orientation tool for IDA Pro navigation.

0x400000 ┌─────────────────────────────────────────┐
         │  CRT startup + libc stubs               │  ~52 KB
0x40D000 ├─────────────────────────────────────────┤
         │  jemalloc stats / vsnprintf              │  ~80 KB
0x420000 ├─────────────────────────────────────────┤
         │  (gap: misc libc, math, string ops)      │  ~64 KB
0x430000 ├─────────────────────────────────────────┤
         │  Global constructors (cl::opt reg)        │  ~1.6 MB
         │  ~1,689 LLVM command-line option objects  │
0x5D0000 ├─────────────────────────────────────────┤
         │  EDG 6.6 C++ Frontend                    │  3.2 MB
         │  Parser, constexpr evaluator, IL walker   │
0x8F0000 ├─────────────────────────────────────────┤
         │  CLI / Real Main / NVVM Bridge            │  520 KB
         │  sub_8F9C90 (real main), dual-path dispatch│
0x960000 ├─────────────────────────────────────────┤
         │  Architecture detection, NVVM options     │  576 KB
0x9F0000 ├─────────────────────────────────────────┤
         │  Bitcode reader (parseFunctionBody)       │  ~1 MB
0xAF0000 ├─────────────────────────────────────────┤
         │  X86 AutoUpgrade (legacy, 457KB fn)       │  ~1 MB
0xBF0000 ├─────────────────────────────────────────┤
         │  LLVM IR Verifier                        │  500 KB
0xC00000 ├─────────────────────────────────────────┤
         │  LLVM Support / ADT library              │  ~3.2 MB
         │  (see detailed sub-map below)             │
0x12D0000├─────────────────────────────────────────┤
         │  PassManager / NVVM bridge                │  4.2 MB
         │  Pipeline assembly (sub_12E54A0)          │
0x12FC000├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┤
         │  jemalloc core (~400 functions)           │  ~256 KB
0x1700000├─────────────────────────────────────────┤
         │  Backend / machine passes                 │  8 MB
         │  RegAlloc, Block Remat, Mem2Reg           │
0x1F00000├─────────────────────────────────────────┤
         │  SelectionDAG                            │  2 MB
         │  LegalizeTypes (348KB), LegalizeOp        │
0x2100000├─────────────────────────────────────────┤
         │  NVPTX PTX emission                      │  1 MB
0x2340000├─────────────────────────────────────────┤
         │  New PM / pass registration               │  768 KB
         │  2,816-line registrar at sub_2342890      │
0x2A00000├─────────────────────────────────────────┤
         │  Loop passes                             │  4 MB
         │  LoopVectorize, SLP, Unroll               │
0x3000000├─────────────────────────────────────────┤
         │  NVPTX ISel + lowering                    │  7 MB
         │  343KB intrinsic switch (sub_33B0210)     │
0x3700000├─────────────────────────────────────────┤
         │  Machine-level passes (tail)              │  ~3 MB
         │  BlockPlacement, Outliner, StructurizeCFG │
0x3A00000├─────────────────────────────────────────┤
         │  (trailing code, CRT finalization)        │
         └─────────────────────────────────────────┘

DATA SECTIONS:
0x3EA0080   Embedded libdevice bitcode (Path A)    456 KB
0x420FD80   Embedded libdevice bitcode (Path B)    456 KB
0x4F00000+  Global BSS (cl::opt storage, hash tables, state)

Detailed Subsystem Map at Pass Granularity

The coarse map above partitions the binary into ~18 zones. The following map refines every zone to individual-pass resolution, giving the factory address of each identified pass or subsystem entry point. Addresses prefixed with sub_ are IDA function names. Sizes in parentheses are decompiled C output; actual machine code is typically 2-3x smaller.

Zone 1: CRT, libc, jemalloc stats (0x400000 - 0x42FFFF)

0x400000   _start / CRT entry (ELF entry point)
0x40D5CA   sub_40D5CA   vsnprintf (jemalloc stats formatting)
0x420000   libc math/string helpers (memcpy, memset, strlen, etc.)

No LLVM or NVIDIA code lives here. Pure runtime support.

Zone 2: Global constructors (0x430000 - 0x5CFFFF)

~1,689 cl::opt registration constructors execute before main(). Each registers a command-line option string, description, default value, and storage pointer into the global option registry. The .init_array section holds function pointers to these constructors.

Zone 3: EDG 6.6 C++ Frontend (0x5D0000 - 0x8EFFFF)

The complete Edison Design Group C++ frontend, version 6.6. Contains the lexer, parser, constexpr evaluator, template instantiator, overload resolver, IL walker/copier, diagnostic engine, SARIF output, and CUDA-specific extensions (kernel launch grammar, __shared__/__device__ memory space parsing, atomic builtin stubs).

Function	Address	Size
EDG main entry (called from real main)	`sub_5D2A80`
Expression parser core	`sub_610000`-`sub_62FFFF`	128 KB
Declaration processing	`sub_750000`-`sub_76FFFF`	128 KB
Template / constexpr	`sub_840000`-`sub_87FFFF`	256 KB
SARIF, diagnostics, keywords	`sub_880000`-`sub_8EFFFF`	448 KB

Zone 4: CLI / Real Main / Dual-Path Entry (0x8F0000 - 0x9EFFFF)

Function	Address	Size
Real main (after CRT/jemalloc init)	`sub_8F9C90`
Path A CLI parsing (LibNVVM API mode)	`sub_900130`
Path A simple compile entry	`sub_902D10`
Path A multi-stage pipeline	`sub_905EE0`	43 KB
Path A builtin resolution table	`sub_90AEE0`	109 KB
Architecture detection, NVVM option parsing	`sub_960000`-`sub_9EFFFF`	576 KB

Zone 5: Bitcode Reader / X86 AutoUpgrade / Verifier (0x9F0000 - 0xBFFFFF)

Sub-range	Contents
`0x9F0000`-`0xAEFFFF`	Bitcode reader (`sub_A24000` parseFunctionBody ~166KB)
`0xAF0000`-`0xBEFFFF`	X86 AutoUpgrade (`sub_A939D0` 457KB -- legacy intrinsic upgrader)
`0xBF0000`-`0xBFFFFF`	LLVM IR Verifier entry points

Zone 6: LLVM Support Library (0xC00000 - 0xCAFFFF)

1,653 functions. Pure LLVM infrastructure -- no NVIDIA-specific modifications except a single !Flat address space annotation in the sample profile reader at sub_C29E70.

Sub-range	Functions	Contents
`0xC00000`-`0xC0F000`	65	IR Verifier (`sub_C05FA0` visitInstruction 75KB, `sub_C0A940` verify 12KB)
`0xC0D4F0`	1	`sub_C0D4F0` TargetRegistry::lookupTarget (8KB)
`0xC0F6D0`	1	`sub_C0F6D0` IR module linker (48KB)
`0xC10000`-`0xC2FFFF`	~400	InstrProf reader, Sample Profile reader/writer, hashing
`0xC30000`-`0xC3FFFF`	214	ImmutableMap/Set, APInt printing
`0xC40000`-`0xC4FFFF`	197	APInt core arithmetic (div, mul, shift)
`0xC50000`-`0xC5FFFF`	141	CommandLine parser (`cl::opt` infrastructure)
`0xC60000`-`0xC6FFFF`	135	JSON parser, debug counters, error handling
`0xC70000`-`0xC7FFFF`	114	ConstantRange arithmetic
`0xC80000`-`0xC8FFFF`	194	SHA-1 hash, regex, SmallVector, sorting
`0xC90000`-`0xC9FFFF`	139	Timer/profiling, TimeTrace (Chrome trace)
`0xCA0000`-`0xCAFFFF`	186	YAML lexer/parser, TypeSize, VFS

Zone 7: NVVM Container, SCEV, DWARF, MC Layer (0xCB0000 - 0x10CFFFF)

This 4 MB zone contains LLVM mid-level infrastructure and the NVVM container format.

Sub-range	Contents	Key functions
`0xCB0000`-`0xCBFA60`	YAML parser/emitter (libyaml)	`sub_CB9640` main parser (26KB)
`0xCC0130`-`0xCCABA0`	LLVM Triple parsing	`sub_CC0130` Triple_normalize (35KB)
`0xCCBB10`-`0xCDCA30`	NVVM container format	`sub_CDD2D0` serialize, `sub_CD1D80` deserialize, `sub_CCD5F0` version validator (9KB)
`0xCD9990`		NVVM options parser (calls 60+ parse helpers)
`0xD60000`-`0xD82000`	NV Module Summary / LTO	`sub_D7D4E0` buildModuleSummary (74KB), `sub_D81040` runOnModule (56KB)
`0xD83000`-`0xDFD000`	ScalarEvolution (SCEV)	SCEV framework, AddRecExpr, backedge analysis
`0xE00000`-`0xE0FFFF`	DWARF debug info string/enum tables
`0xE10000`-`0xE2FFFF`	Itanium C++ name demangler	`sub_E18BB0` parseExpr (47KB)
`0xE30000`-`0xEBFFFF`	MC assembler layer	ELF/COFF/MachO section parsers, expression evaluator
`0xEC0000`-`0xED0000`	MC assembler directives	`sub_ECB300` ELF section parser (40KB)
`0xED0000`-`0xEF8000`	InstrProf / MemProf reader	Profiling data infrastructure
`0xEF8000`-`0xF05000`	Bitstream remark serialization
`0xF05000`-`0xF6FFFF`	SelectionDAG infrastructure	DAG node creation, SDValue, EVT/MVT helpers
`0xF70000`-`0xF8FFFF`	Loop vectorization runtime checks	`sub_F77B70` vectorizeLoop (37KB), `sub_F72730` canVectorizeMemory (29KB)
`0xF90000`-`0xFCFFFF`	SimplifyCFG + code sinking	`sub_FB0000` switch table gen, `sub_FA0000` speculative exec
`0xFD0000`-`0xFEFFFF`	AliasSet, register pressure tracking, CFG graphviz
`0xFF0000`-`0x101FFFF`	Block scheduling, RPO traversal, constant folding
`0x1020000`-`0x103FFFF`	Inline ASM + scheduling model	`sub_1035170` CUTLASS kernel detection (41KB)
`0x1040000`-`0x106FFFF`	Divergence analysis, DAG utilities, IR linker
`0x1070000`-`0x10AFFFF`	MC object emission, InstructionSimplify	`sub_10ACA40` visitAdd (94KB)

Zone 8: InstCombine Mega-Region (0x10D0000 - 0x122FFFF)

The single largest contiguous pass in the binary. NVIDIA's modified InstCombine spans 1.4 MB of code with three NVIDIA-custom opcodes (0x254D, 0x2551, 0x255F) for proprietary intrinsic folding.

Sub-range	Contents	Key functions
`0x10D0000`-`0x10EFFFF`	InstCombine visitors (casts, shifts, memory)	Various visitXxx functions
`0x10EE7A0`	InstCombine main visitor	`sub_10EE7A0` (405KB / 9,258 lines -- largest function in binary)
`0x10F0000`-`0x1100000`	Sub-visitors for specific opcodes
`0x1100000`-`0x1170000`	Intrinsic folding, demanded bits	`sub_1169C30` intrinsic folder (87KB), `sub_11A7600` computeKnownBits (127KB)
`0x1180000`-`0x119FFFF`	InstCombine core worklist	`sub_1190310` main dispatch (88KB)
`0x11A0000`-`0x11AFFFF`	ValueTracking / KnownBits	`sub_11AE870` SimplifyDemandedBits
`0x11B0000`-`0x11BFFFF`	InstCombine tail (vector, extract/insert)
`0x11D0000`-`0x11FFFFF`	SimplifyLibCalls	Math function optimization
`0x11FF000`-`0x122FFFF`	LLVM textual IR parser (LLParser)

Zone 9: NVVM Bridge / Builtin System / IR Codegen (0x1230000 - 0x12CFFFF)

This zone is the core NVIDIA bridge between the EDG frontend AST and the LLVM IR optimizer.

Sub-range	Contents	Key functions
`0x1230000`-`0x125FFFF`	LLVM IR codegen from AST	Expression, statement, type codegen
`0x125FB30`	Path B CLI parsing	`sub_125FB30` (standalone/nvcc mode)
`0x1262860`	Path B simple compile	`sub_1262860`
`0x1265970`	Path B multi-stage pipeline	`sub_1265970` (48KB)
`0x126A7B0`	Builtin lookup helper	`sub_126A7B0`
`0x126A910`	Builtin registration table	`sub_126A910` (126KB) -- registers 717 builtins (IDs 1-770)
`0x12B3FD0`	Builtin resolution dispatch	`sub_12B3FD0` (103KB) -- giant switch on builtin ID
`0x12C06E0`	Bitcode linker	`sub_12C06E0` (libdevice linking)

Zone 10: Pipeline Builder / Pass Options (0x12D0000 - 0x12FFFFF)

The pipeline assembler constructs the complete LLVM pass pipeline, inserting passes by calling factory functions whose addresses scatter across the entire binary.

Function	Address	Size
Module split-range helper	`sub_12D3E60`
Pass factory: creates NVIDIA custom pass	`sub_12D4560`	325 B
NVVMPassOptions initializer -- populates 222 pass option slots into 4,480-byte struct	`sub_12D6300`	125 KB
AddPass -- hash-table-based pass insertion into pipeline	`sub_12DE0B0`	3.5 KB
Tier 0 sub-pipeline builder (full optimization, 40 passes)	`sub_12DE330`	4.8 KB
Tier 1/2/3 sub-pipeline builder (85-pass superset, tier-gated)	`sub_12DE8F0`
Codegen dispatch -- routes to backend machine pass pipeline	`sub_12DFE00`
Master pipeline assembler -- 1,553 lines, two major pipelines (normal + fast)	`sub_12E54A0`	49.8 KB
Machine pass assembly (Pipeline B fast path)	`sub_12EB010`
Machine codegen execution	`sub_12EC4F0`
jemalloc core (~400 functions)	`sub_12FC000`+	~256 KB
`malloc_conf_init` (parses 199 config strings from `MALLOC_CONF`)	`sub_12FCDB0`	129 KB

Zone 11: IR Infrastructure / PassManager (0x1300000 - 0x16FFFFF)

Dense LLVM infrastructure: IR types, constants, instructions, metadata, use-lists, PassManager execution engine, IR linker, bitcode reader, regex, and DataLayout.

Sub-range	Contents	Key functions
`0x1300000`-`0x135FFFF`	IR constants, types, APInt, APFloat
`0x1360000`-`0x13FFFFF`	IR instructions, basic blocks, functions	`sub_1361950` AssumptionCacheTracker
`0x1400000`-`0x14FFFFF`	TargetLibraryInfo, pass scheduling	`sub_149CCE0` TLI wrapper, `sub_14A04B0` TLI creation, `sub_14A3CD0` NVPTX TargetPassConfig
`0x1500000`-`0x15FFFFF`	IR builder, GEP, PHI, branch creation	`sub_15F83E0` conditional branch, `sub_15F9210` load, `sub_15F9650` store
`0x1600000`-`0x160FFFF`	PassManager execution engine	`sub_160FB70` PassManager::run, `sub_1611EE0` PassManagerBuilder init
`0x1610000`-`0x162FFFF`	Pass scheduling, metadata RAUW	`sub_1619140` register target passes, `sub_1619BD0` PassManager::finalize
`0x1630000`-`0x16FFFFF`	IR Linker, bitcode reader, regex	`sub_16786A0` IRLinker::run (61KB), `sub_166A310` parseFunctionBody (60KB)

Zone 12: InstCombine (NewPM) + Sanitizers + PGO (0x1700000 - 0x17FFFFF)

946 functions. Dominated by the New Pass Manager version of InstCombine (~600 functions, ~3.5 MB decompiled), with sanitizer instrumentation (MSan, TSan, coverage) and PGO/GCov infrastructure.

Sub-range	Contents	Key functions
`0x1700000`-`0x17B0000`	InstCombine (NewPM)	`sub_1743DA0` main visitor (168KB), `sub_17A9010` liveness (111KB)
`0x17B0000`-`0x17BFFFF`	GCov instrumentation	`sub_17BF860` coverage notes (53KB)
`0x17C0000`-`0x17CFFFF`	PGO indirect-call promotion	`sub_17C2DB0` (39KB)
`0x17D0000`-`0x17DFFFF`	MemorySanitizer	`sub_17DDCE0` shadow propagation (58KB)
`0x17E0000`-`0x17EFFFF`	PGO instrumentation	`sub_17EEF60` InstrProfiling reader (81KB)
`0x17F0000`-`0x17FFFFF`	ThreadSanitizer, SanitizerCoverage	`sub_17FF260` TSan entry (51KB), `sub_17F91F0` SanCov (44KB)
`sub_17060B0`		PrintModulePass (debug dump, inserted ~30x in pipeline)

Zone 13: GVN + Scalar Passes + NVIDIA Custom IR Passes (0x1800000 - 0x1CFFFFF)

This 5 MB zone contains the bulk of LLVM's scalar optimization passes and all of NVIDIA's custom IR-level passes.

GVN family (0x1900000 - 0x193FFFF):

Function	Address	Size
GVN::runOnFunction (core fixed-point iteration)	`sub_1900BB0`	83 KB
GVN PRE (Partial Redundancy Elimination)	`sub_1906720`	26 KB
NewGVN expression printing	`sub_1930810`	3 KB
NewGVN core value numbering	`sub_1933B40`	43 KB

Standard scalar passes (0x1830000 - 0x1AFFFFF):

Function (pipeline factory call)	Address	Size
InstructionCombining (Old PM wrapper)	`sub_1832270`
TailCallElim / JumpThreading	`sub_1833EB0`
FunctionAttrs	`sub_1841180`
SCCP (Sparse Conditional Constant Propagation)	`sub_1842BC0`
ConstantMerge / GlobalDCE	`sub_184CD60`
NVVMReflect	`sub_1857160`
IPConstantPropagation / ArgumentPromotion	`sub_185D600`
Sink / MemorySSA	`sub_1869C50`
NVVMPredicateOpt / SelectionOpt	`sub_18A3430`
LoopPass (barrier optimization)	`sub_18B1DE0`
DCE (Dead Code Elimination)	`sub_18DEFF0`
CorrelatedValuePropagation	`sub_18EEA90`
DSE (Dead Store Elimination)	`sub_18F5480`
DeadArgumentElimination	`sub_18FD350`
SimplifyCFG	`sub_190BB10`
LICM / LoopRotate	`sub_195E880`
LoopIndexSplit	`sub_1952F90`
LoopUnroll / LoopVectorize	`sub_197E720`
LoopSimplify / IndVarSimplify	`sub_198DF00`
SROA (Scalar Replacement of Aggregates)	`sub_198E2A0`
InstCombine variant	`sub_19401A0`
SROA variant / LoopUnswitch	`sub_19B73C0`
NVIDIA pass (unknown)	`sub_19CE990`
NVVMRematerialization (IR-level remat)	`sub_1A13320`
NVVMIRVerification	`sub_1A223D0`
LLVM standard pass pipeline (parameterized, called ~8x with different configs)	`sub_1A62BF0`
LoopIdiomRecognize / IndVarSimplify	`sub_1A68E70`
InstructionSimplify / ValueTracking	`sub_1A7A9F0`

Loop unrolling + switch lowering (0x1B00000 - 0x1B7FFFF):

Function	Address	Size
LoopUnroll main driver	`sub_1B01A40`	68 KB
Unroll-and-Jam	`sub_1B07290`	55 KB
Loop peeling	`sub_1B0BF10`	39 KB
Unroll prologue/epilogue generation	`sub_1B12B90`	65 KB
Code sinking (".sink.split")	`sub_1B51110`	51 KB
SimplifyCFG condition combining	`sub_1B5C580`	30 KB
Switch-to-lookup-table transformation	`sub_1B60700`	83 KB

Loop/SLP vectorizer (0x1B80000 - 0x1BFFFFF):

Function	Address	Size
LoopVectorize main driver ("loop-vectorize")	`sub_1BB6740`	43 KB
VPlan builder	`sub_1BAB460`	32 KB
SLP horizontal reduction ("slp-vectorizer")	`sub_1BDDB00`	47 KB
SLP shuffle/reorder engine	`sub_1BD0660`	62 KB

NVVM module validation + configuration (0x1C00000 - 0x1C3FFFF):

Function	Address	Size
NVVM codegen config parser (70+ knobs: AdvancedRemat, CSSACoalescing, DoMMACoalescing, PGO, OCGKnobs)	`sub_1C20170`	33 KB
NVVM compile mode parser (WHOLE_PROGRAM_NOABI/ABI, SEPARATE_ABI, opt level, debug info)	`sub_1C21CE0`	28 KB
Kernel attribute validator (cluster launch, parameter size, Hopper constraints)	`sub_1C32740`	30 KB
NVVM intrinsic lowering (tex/surf/syncwarp/ISBE/MAP/ATTR validation)	`sub_1C36530`	112 KB
NVVM module validator (data layout, target triple, UnifiedNVVMIR)	`sub_1C3BC10`	48 KB

NVIDIA custom IR passes (0x1C40000 - 0x1CFFFFF):

This 1 MB block contains the majority of NVIDIA's proprietary IR-level optimization passes. Every pass listed here has no upstream LLVM equivalent.

Function	Address	Size	Role
Dead Synchronization Elimination -- removes redundant `__syncthreads()` barriers via fixed-point R/W dataflow	`sub_1C47810`	63 KB	dead-sync-elim
Alloca cloning / PHI insertion (mem2reg extension)	`sub_1C4D210`	69 KB
NVIDIA pass helper (dead-sync / common-base infrastructure)	`sub_1C585C0`	39 KB
Common Base Elimination -- removes redundant base address computations	`sub_1C5DFC0`	39 KB	common-base-elim
Block-level analysis infrastructure ("Processing", "Block")	`sub_1C5FDC0`	26 KB
Base address bitcast helper ("baseValue", "bitCastEnd")	`sub_1C637F0`	28 KB
Base Address Strength Reduction ("BaseAddressStrengthReduce")	`sub_1C67780`	59 KB	base-addr-sr
MemorySpaceOpt loop index analysis ("phi maxLoopInd")	`sub_1C6A6C0`	54 KB
GVN or LICM variant	`sub_1C6E800`
ADCE (Aggressive DCE)	`sub_1C6FCA0`
MemorySpaceOpt function cloning -- specializes generic pointers to global/shared/local	`sub_1C70910`	75 KB	memspace-opt (core)
LoopIndexSplit -- splits loops on index conditions (three modes: all-but-one, single-iter, range-split)	`sub_1C7B2C0`	84 KB	loop-index-split
Memmove Unrolling -- forward/reverse element copy loops	`sub_1C82A50`	40 KB	lower-aggr-copies
Struct/Aggregate Splitting -- element-wise memcpy decomposition	`sub_1C86CA0`	73 KB	lower-aggr-copies
EarlyCSE / GVN variant	`sub_1C8A4D0`
FP128/I128 Emulation -- replaces 128-bit ops with `__nv_*` library calls	`sub_1C8C170`	26 KB	lower-ops
MemorySpaceOpt entry (pipeline factory address)	`sub_1C8E680`		nvvm-memspace-opt
NVVMLowerBarriers / BarrierLowering	`sub_1C98160`
MemorySpaceOpt address space resolution (warnings for illegal atomics on const/local)	`sub_1CA2920`	32 KB
MemorySpaceOpt secondary resolver	`sub_1CA9E90`	28 KB
Printf Lowering -- lowers `printf` to `vprintf` + local buffer packing	`sub_1CB1E60`	31 KB	printf-lowering
NVVMIntrinsicLowering (most frequently inserted pass, ~10 occurrences in pipeline)	`sub_1CB4E40`		nvvm-intrinsic-lower
NVVMBranchDist	`sub_1CB73C0`		branch-dist
RLMCAST transformation (register-level multicast)	`sub_1CBFA40`	75 KB
NVVMSinking2 (NVIDIA enhanced code sinking)	`sub_1CC60B0`		sinking2
IV Demotion -- narrows 64-bit induction variables to 32-bit ("demoteIV", "newBaseIV")	`sub_1CD74B0`	75 KB	iv-demotion
NLO (NVIDIA Live Output) helper ("nloNewAdd", "nloNewBit")	`sub_1CDC1F0`	35 KB
Instruction classification / cost model (NLO/remat)	`sub_1CDE4D0`	80 KB
Simplify Live Output (NLO pass -- "nloNewBit")	`sub_1CE10B0`	48 KB
Rematerialization pull-in cost analysis ("Total pull-in cost")	`sub_1CE3AF0`	56 KB
Rematerialization block executor ("remat_", "uclone_" prefixes)	`sub_1CE67D0`	32 KB
NVVMRematerialization main driver -- live-in/live-out pressure analysis per block	`sub_1CE7DD0`	67 KB	remat
Final NVVM lowering / intrinsic cleanup	`sub_1CEBD10`
Formal parameter space overflow checker	`sub_1CEE970`	27 KB
NVVMPeephole	`sub_1CEF8F0`		nvvm-peephole
Instruction scheduling helper (physical register constraints)	`sub_1CFDD60`	49 KB

Zone 14: SelectionDAG ISel / CodeGenPrepare / Backend (0x1D00000 - 0x1EFFFFF)

Sub-range	Contents	Key functions
`0x1D00000`-`0x1D60000`	SelectionDAG ISel core	`sub_1D4BB00` bytecode interpreter (97KB, 131-case switch), `sub_1D54C20` runOnMachineFunction (72KB, "sdagisel")
`0x1D1B0D0`		`sub_1D1B0D0` computeKnownBits (87KB, 62-case ISD switch)
`0x1D210A0`		`sub_1D210A0` SimplifyDemandedBits (46KB, 118-case switch, calls NVPTX hooks at `sub_1F58D40`)
`0x1D70000`-`0x1D7FFFF`	CodeGenPrepare	`sub_1D73760` address sinking (65KB, "sunkaddr")
`0x1D07BB0`	57 KB	Pre-RA instruction scheduling
`0x1D80000`-`0x1DFFFFF`	Deque worklist, block splitting	`sub_1D7AA30` (74KB, ".unlikely", ".cond.split")
`0x1E00000`-`0x1EFFFFF`	Register allocation infrastructure	Greedy RA, live intervals, spill cost

Zone 15: Backend CodeGen Infrastructure (0x1F00000 - 0x20FFFFF)

Sub-range	Contents	Key functions
`0x1F00000`-`0x1F0C000`	ScheduleDAG infrastructure	`sub_1F0A020` DAG builder/emitter (41KB)
`0x1F0BF50`-`0x1F0EBC0`	Shrink Wrapping	`sub_1F0DCB0` core analysis (27KB, "shrink-wrap")
`0x1F10000`-`0x1F15000`	SlotIndexes + SpillPlacement	`sub_1F10320` "slotindexes", `sub_1F12110` "spill-code-placement"
`0x1F15000`-`0x1F1F000`	LiveInterval utilities	`sub_1F19E60` "Impossible to implement partial COPY"
`0x1F20000`-`0x1F5FFFF`	Register coalescer, VirtRegRewriter
`0x1F58D40`		NVPTX target hook for SimplifyDemandedBits
`0x1F60000`-`0x1FFFFF`	TwoAddressInstruction, stack protection
`0x2000000`-`0x20FFFFF`	LegalizeTypes	`sub_20019C0` (341KB -- third largest function in binary)

Zone 16: NVPTX Target Backend (0x2100000 - 0x21FFFFF)

Sub-range	Contents	Key functions
`0x2100000`-`0x210FFFF`	Register allocation support	`sub_210BC20` seedLiveRegs ("regalloc"), `sub_210BE60` "ran out of registers"
`0x2110000`-`0x212FFFF`	DAG type legalization/promotion
`0x2130000`-`0x213FFFF`	DAG combiners, ISel patterns
`0x2140000`-`0x214FFFF`	NVPTXAsmPrinter	PTX header/kernel emission
`0x2150000`-`0x215FFFF`	PTX function/param emission	`sub_215D9D0` NVVMAnnotationsProcessor / GenericToNVVM
`0x2160000`-`0x216FFFF`	NVPTXTargetMachine	Pass pipeline, SubtargetInfo
`0x2170000`-`0x218AFFF`	Atomics lowering, rematerialization (machine-level)
`0x21BC000`-`0x21BFFFF`	Alloca hoisting, image opt
`0x21C0000`-`0x21CFFFF`	MemorySpace lowering (machine-level)
`0x21D0000`-`0x21DFFFF`	DAG lowering mega-function, peephole, prolog/epilog
`0x21E0000`-`0x21EFFFF`	MMA/tensor codegen, atomics, special regs, cluster ops
`0x21F0000`-`0x21FFFFF`	Ldg transform, vec split, mem2reg, register pressure

Zone 17: New PM Pass Registration (0x2340000 - 0x23FFFFF)

Function	Address	Size
Master pass registration -- registers all 526 passes (121 module + 174 function + 23 loop + 48 MF + analyses) into StringMap	`sub_2342890`	~2,816 lines
Print available passes (--print-pipeline-passes)	`sub_233C410`
Function pass pipeline text parser	`sub_233F860`
Module pipeline text parser	`sub_2377300`
Inner function/loop pipeline parser	`sub_2368220`
Alias analysis name resolver (globals-aa, basic-aa, scev-aa, tbaa)	`sub_233BD40`
Hash table insertion (pass_name -> constructor)	`sub_E41FB0`

Zone 18: IPO / Attributor / OpenMP Optimization (0x2400000 - 0x29FFFFF)

Sub-range	Contents	Key functions
`0x2400000`-`0x25FFFFF`	Attributor framework	`sub_251CD10` runTillFixpoint (53KB)
`0x2590000`-`0x265FFFF`	Sanitizer instrumentation (ASan, HWASan)
`0x266E000`-`0x269FFFF`	OpenMP target offloading	`sub_2686D90` runtime table (215KB, ~160 `__kmpc_*` entries), `sub_26968A0` Generic-to-SPMD transform (61KB, "OMP120")
`0x2678420`	41 KB	OpenMP state machine for generic kernels
`0x2680940`	52 KB	Parallel region merging
`0x26A0000`-`0x29FFFFF`	Coroutine support, LTO infrastructure, PGO lowering

Zone 19: Loop Transforms (0x2A00000 - 0x2CFFFFF)

Function	Address	Size
LoopPeeling ("llvm.loop.peeled.count")	`sub_2A07DE0`	76 KB
LoopRotation (".lr.ph", "h.rot")	`sub_2A0CFD0`	65 KB
UnrollLoop main ("loop-unroll", "UnrollCount")	`sub_2A15A20`	85 KB
UnrollAndJamLoop ("loop-unroll-and-jam")	`sub_2A1CF00`	58 KB
Runtime unrolling (".epil.preheader", ".prol.preheader")	`sub_2A25260`	91 KB
IndVarSimplify IV widening ("iv.rem", ".sext", ".zext")	`sub_2A76A40`	67 KB
WidenIV / IV transformation	`sub_2A79EE0`	82 KB
Dead Synchronization Elimination (island -- the larger copy; see also `sub_1C47810`)	`sub_2C84BA0`	94 KB

Note: sub_2C84BA0 is a second copy of the dead synchronization elimination pass located outside the main NVIDIA custom pass zone. This is the 94KB variant analyzed in depth (p2b.6-01), with the four-category fixed-point R/W dataflow algorithm and red-black tree maps.

Zone 20: Codegen Target Options / SelectionDAG Lowering (0x2D00000 - 0x2FFFFFF)

5,217 functions. Contains LLVM TargetMachine option registration and the core SelectionDAG infrastructure used by the NVPTX backend.

Sub-range	Contents	Key functions
`0x2D00000`-`0x2D8FFFF`	SelectionDAG core	DAG combine, node creation, legalization helpers
`0x2D97F20`	112 KB	TargetOptions registration (all `cl::opt` for -march/-mcpu/-mattr/relocation/code model)
`0x2E00000`-`0x2FFFFF`	SelectionDAG continued	Type legalization, custom lowering, pattern matching

Zone 21: NVPTX ISel + SelectionDAG Lowering (0x3000000 - 0x36FFFFF)

7 MB. The NVPTX instruction selection and target-specific DAG lowering.

Sub-range	Contents	Key functions
`0x3000000`-`0x328FFFF`	DAG node construction, EVT/MVT helpers
`0x3290000`-`0x32FFFFF`	NVPTXTargetLowering	`sub_32E3060` LowerOperation dispatcher (111KB), `sub_32A1EF0` type legalization (109KB), `sub_32D2680` load/store lowering (81KB)
`0x3300000`-`0x33AFFFF`	Intrinsic lowering (DAG level)	`sub_33B0210` intrinsic switch (343KB)
`0x33B0000`-`0x36FFFFF`	ISel pattern helpers, register info

Zone 22: NVPTX Instruction Selector / Machine Tail (0x3700000 - 0x3BFFFFF)

Sub-range	Contents	Key functions
`0x3700000`-`0x37AFFFF`	Table-driven instruction selector	`sub_376DE90` main pattern matcher (138KB -- per-SM opcode legality gating via compressed table at offset 521536)
`0x372FEE0`	104 KB	DAG operand tree copier (recursive)
`0x374DD20`	67 KB	NVPTX custom lowering entry
`0x3900000`-`0x396FFFF`	NVIDIA register pressure / remat (machine-level)	`sub_396A6C0` RP reporting ("Register Pressure: N"), `sub_3964ED0` ".remat" naming
`0x3937240`	14 KB	ABI Preserve directive emission
`0x395CFD0`	11 KB	GEP Splitting pass
`sub_395DD20`	66 KB	DAG pattern computation
`0x3970000`-`0x397FFFF`	AsmPrinter / PTX emission	`sub_3979400` emitFunctionBody (62KB), `sub_397DF10` emitInlineAsm (30KB)
`sub_3970E40`	18 KB	BB print + `.pragma "nounroll"`
`0x3980000`-`0x3BFFFFF`	MC layer, DWARF, ELF emission	Object file writers, section management

Pass Factory Address Summary

The pipeline assembler (sub_12E54A0) calls pass factory functions to construct the pipeline. Each factory address below is called directly from the pipeline builder and uniquely identifies a pass in the binary.

Factory address	Pass identity	Type
`sub_1654860`	BreakCriticalEdges	F
`sub_17060B0`	PrintModulePass (debug dump)	M
`sub_1832270`	InstructionCombining	F
`sub_1833EB0`	TailCallElim / JumpThreading	F
`sub_1841180`	FunctionAttrs	M
`sub_1842BC0`	SCCP	F
`sub_184CD60`	ConstantMerge / GlobalDCE	M
`sub_1857160`	NVVMReflect	F
`sub_185D600`	IPConstantPropagation	M
`sub_1869C50`	Sink / MemorySSA	F
`sub_18A3430`	NVVMPredicateOpt	F
`sub_18B1DE0`	LoopPass (barrier opt)	F
`sub_18DEFF0`	DCE	F
`sub_18EEA90`	CorrelatedValuePropagation	F
`sub_18F5480`	DSE	F
`sub_18FD350`	DeadArgumentElimination	M
`sub_190BB10`	SimplifyCFG	F
`sub_195E880`	LICM / LoopRotate	F
`sub_1952F90`	LoopIndexSplit	L
`sub_197E720`	LoopUnroll / LoopVectorize	F
`sub_198DF00`	LoopSimplify / IndVarSimplify	F
`sub_198E2A0`	SROA	F
`sub_19401A0`	InstCombine variant	F
`sub_19B73C0`	SROA variant / LoopUnswitch	F
`sub_19CE990`	NVIDIA pass (unknown)	F
`sub_1A13320`	NVVMRematerialization (IR-level)	F
`sub_1A223D0`	NVVMIRVerification	M
`sub_1A62BF0`	LLVM standard pass pipeline (parameterized)	M
`sub_1A68E70`	LoopIdiomRecognize	F
`sub_1A7A9F0`	InstructionSimplify	F
`sub_1B26330`	MemCpyOpt	F
`sub_1B7FDF0`	Reassociate / Sinking	F
`sub_1C4B6F0`	AlwaysInliner	M
`sub_1C6FCA0`	ADCE	F
`sub_1C8A4D0`	EarlyCSE	F
`sub_1C8E680`	NVVMMemorySpaceOpt	M
`sub_1C98160`	NVVMLowerBarriers	F
`sub_1CB4E40`	NVVMIntrinsicLowering (~10 insertions)	F
`sub_1CB73C0`	NVVMBranchDist	F
`sub_1CC60B0`	NVVMSinking2	F
`sub_1CE7DD0`	NVVMRematerialization (main)	F
`sub_1CEBD10`	Final NVVM lowering	F
`sub_1CEF8F0`	NVVMPeephole	F
`sub_1CB0F50`	ProfileSummaryInfoWrapper / NVVMModulePass	F
`sub_12D4560`	NVVMVerifier / ModuleVerifier	M
`sub_215D9D0`	NVVMAnnotationsProcessor	M
`sub_149CCE0`	TargetLibraryInfoWrapperPass	M
`sub_1BFB520`	TargetTransformInfoWrapperPass	F
`sub_14A7550`	createVerifierPass / BasicAliasAnalysis	M
`sub_1361950`	AssumptionCacheTracker	M

Type: M = ModulePass, F = FunctionPass, L = LoopPass.

Embedded Data Payloads

Libdevice Bitcode

Two identical copies of NVIDIA's libdevice are embedded directly in the .rodata section as raw LLVM bitcode. Each copy is approximately 456 KB and contains around 400 math intrinsic implementations (__nv_sinf, __nv_expf, __nv_sqrtf, etc.). The duplication supports the dual-path architecture: Path A (LibNVVM API mode) references one copy at 0x3EA0080; Path B (standalone mode) references the other at 0x420FD80. The bitcode is linked into the user's module during the LNK phase via the bitcode linker at sub_12C06E0.

String Tables

IDA Pro extracts 188,141 strings from the binary. These fall into several categories:

Category	Approximate count	Example
LLVM `cl::opt` descriptions	~1,689	`"Enable aggressive reassociation"`
LLVM error/diagnostic messages	~5,000	`"Invalid bitcode signature"`
EDG error messages	~2,500	`"expected a declaration"`
LLVM pass names	~440	`"instcombine"`, `"gvn"`, `"nvvm-memspace-opt"`
PTX instruction templates	~800	`"mov.b32 %0, %1;"`
NVVM builtin names	~770	`"__nvvm_atom_cas_gen_i"`
jemalloc config strings	~200	`"background_thread"`, `"dirty_decay_ms"`
NVVM container field names	~144	`"SmMajor"`, `"FastMath.Ftz"`
Miscellaneous (format strings, assertions)	~170,000+	`"%s:%d: assertion failed"`

String cross-referencing is the single most productive technique for identifying functions in a stripped binary. The LLVM pass registration pattern is especially reliable: a string like "nvvm-memspace-opt" appears exactly once, in the constructor of that pass, which IDA locates via xref.

NVVM Container Format

The binary includes a proprietary container format for wrapping LLVM bitcode with compilation metadata. The container uses a 24-byte binary header with magic 0x7F4E5C7D, followed by delta-encoded tag/value pairs (only fields that differ from defaults are serialized). There are 144 distinct tag IDs spanning core options (tags 1-39), compression metadata (tag 99), extended target options (tags 101-173), blob data (tags 201-218), and structured hardware descriptors (tags 401-402 for TMA/TCGen05 configurations). Serialization and deserialization are handled by sub_CDD2D0 and sub_CD1D80 respectively.

jemalloc Integration

NVIDIA statically links jemalloc 5.3.x as the process-wide memory allocator. The jemalloc functions cluster around 0x12FC000 (approximately 400 functions). The configuration initialization function sub_12FCDB0 (129 KB, one of the largest functions in the binary) parses 199 configuration strings from the MALLOC_CONF environment variable.

Key jemalloc entry points visible in the binary:

Function	Address
`malloc_conf_init` (199 config strings)	`0x12FCDB0`
`vsnprintf` (jemalloc stats formatting)	`0x40D5CA`
Core arena management, tcache, extent allocator	`0x12FC000` range

The jemalloc integration is significant for reverse engineering because it means malloc/free calls throughout the binary resolve to jemalloc's arena-based allocator rather than glibc's ptmalloc2. When tracing memory allocation patterns in IDA, look for calls into the 0x12FC000 range.

Global Constructors

The region from 0x430000 to 0x5CFFFF (~1.6 MB) is dominated by global constructors that execute before main(). The primary purpose of these constructors is LLVM cl::opt registration: approximately 1,689 command-line option objects are initialized, each registering a string name, description, default value, and storage location into LLVM's global option registry.

The .init_array section contains function pointers to these constructors. They execute in linker-determined order and populate a global hash table that sub_8F9C90 (the real main) later queries during CLI parsing. In IDA Pro, navigating to any cl::opt constructor reveals the option name string and its associated global variable, which is invaluable for understanding what flag controls what behavior.

Additional global constructors handle:

LLVM pass registration (RegisterPass<T> and PassInfo objects)
LLVM target initialization (NVPTX target machine factory)
jemalloc allocator bootstrapping
EDG frontend static initialization tables

Dual-Path Code Duplication

A distinctive structural feature of the binary is the presence of two near-complete copies of the NVVM bridge and backend entry points. Path A (LibNVVM API mode) lives around 0x90xxxx; Path B (standalone/nvcc mode) lives around 0x126xxxx. Each path has its own:

Component	Path A	Path B
Simple compile entry	`sub_902D10`	`sub_1262860`
Multi-stage pipeline	`sub_905EE0` (43 KB)	`sub_1265970` (48 KB)
CLI parsing	`sub_900130`	`sub_125FB30`
Builtin resolution table	`sub_90AEE0` (109 KB)	`sub_126A910` (123 KB)
Embedded libdevice ref	`unk_3EA0080`	`unk_420FD80`
Version string	`nvvm-latest`	`nvvm70`

In IDA, if you have identified a function in one path, search for a structurally similar function at the corresponding offset in the other path. The code is not byte-identical -- Path B is generally slightly larger due to additional standalone-mode logic -- but the control flow graphs are nearly congruent.

When opening cicc in IDA Pro for the first time, the auto-analysis will take several minutes due to the 60 MB size. The following workflow accelerates orientation:

Start with strings. Open the Strings window (Shift+F12), filter for known LLVM pass names ("instcombine", "gvn", "nvvm-"). Each xref leads directly to a pass constructor or registration site.
Use the address map above. If you are looking at an address in the 0xC00000-0x12CFFFF range, you are in LLVM optimization passes. The 0x3000000-0x36FFFFF range is NVPTX instruction selection. The 0x5D0000-0x8EFFFF range is EDG. Context narrows the search space immediately.
Watch for vtable patterns. LLVM passes are C++ classes with virtual methods. IDA's vtable reconstruction reveals inheritance hierarchies. Every FunctionPass, ModulePass, and LoopPass subclass has a vtable with runOnFunction/runOnModule at a consistent slot offset.
Anchor on mega-functions. The largest functions are the easiest to locate and serve as landmarks: sub_A939D0 (457 KB, X86 AutoUpgrade), sub_10EE7A0 (396 KB, InstCombine), sub_20019C0 (341 KB, LegalizeTypes). These anchors partition the address space.
Follow the pipeline. Entry at sub_8F9C90 calls into EDG at sub_5D2A80, pipeline assembly at sub_12E54A0, and PTX emission starting at 0x2100000. Tracing callgraph edges from these known entry points maps out the entire compilation flow.
Mark jemalloc early. Identifying and labeling the jemalloc cluster at 0x12FC000 prevents wasted time reverse-engineering well-known allocator internals. The 199-string malloc_conf_init function is an unmistakable fingerprint.
Locate NVIDIA passes via factory addresses. The Pass Factory Address Summary table above maps every pipeline-inserted pass to its constructor address. In IDA, setting a breakpoint at sub_12DE0B0 (AddPass) and logging the second argument reveals the exact pass insertion order at runtime.

Master Address-Range Map

The definitive quick-reference for "what lives at address X?" Every major address range in the cicc v13.0 binary, sorted by start address, consolidated from all subsystem pages in this wiki.

.text Section (0x400000 - 0x3BFFFFF)

Start	End	Size	Subsystem	Zone
`0x400000`	`0x40CFFF`	52 KB	CRT startup (`_start`, libc stubs)	1
`0x40D000`	`0x41FFFF`	80 KB	jemalloc stats (`vsnprintf` at `sub_40D5CA`)	1
`0x420000`	`0x42FFFF`	64 KB	libc helpers (memcpy, memset, strlen, math)	1
`0x430000`	`0x5CFFFF`	1.6 MB	Global constructors (~1,689 `cl::opt` registrations, pass/target init)	2
`0x5D0000`	`0x8EFFFF`	3.2 MB	EDG 6.6 C++ Frontend (parser, constexpr, templates, IL walkers, SARIF, preprocessor)	3
`0x8F0000`	`0x8FFFFF`	64 KB	Real main / CLI (`sub_8F9C90` entry, flag mapping, XOR deobfuscator)	4
`0x900000`	`0x92FFFF`	192 KB	Path A entry (LibNVVM API: CLI parse, pipeline driver, builtin tables)	4
`0x930000`	`0x95FFFF`	192 KB	Path A builtins (pre-opt builtin lowering, 770-entry resolution)	4
`0x960000`	`0x9EFFFF`	576 KB	Architecture detection (`-arch` fan-out, NVVM option parsing)	4
`0x9F0000`	`0xAEFFFF`	1 MB	Bitcode reader (`parseFunctionBody` 166KB, metadata reader 121KB)	5
`0xAF0000`	`0xBEFFFF`	1 MB	X86 AutoUpgrade (`sub_A939D0` 457KB -- legacy intrinsic upgrader)	5
`0xBF0000`	`0xBFFFFF`	64 KB	LLVM IR Verifier (entry points, `visitCallInst` 207KB)	5
`0xC00000`	`0xCAFFFF`	704 KB	LLVM Support/ADT (APInt, CommandLine, ConstantRange, JSON, Timer, YAML, VFS)	6
`0xCB0000`	`0xCBFFFF`	64 KB	YAML parser/emitter (libyaml)	7
`0xCC0000`	`0xCCFFFF`	64 KB	LLVM Triple parsing (`Triple_normalize` 35KB)	7
`0xCCD000`	`0xCDFFFF`	76 KB	NVVM container format (serialize `sub_CDD2D0`, deserialize `sub_CD1D80`, 144 tags)	7
`0xCE0000`	`0xD5FFFF`	512 KB	NVVM options (container validators, option parsers)	7
`0xD60000`	`0xD82FFF`	140 KB	NV Module Summary / LTO (`buildModuleSummary` 74KB, `runOnModule` 56KB)	7
`0xD83000`	`0xDFFFFF`	500 KB	ScalarEvolution (SCEV) (AddRecExpr, backedge analysis, trip counts)	7
`0xE00000`	`0xE0FFFF`	64 KB	DWARF debug info (string/enum tables)	7
`0xE10000`	`0xE2FFFF`	128 KB	Itanium name demangler (`parseExpr` 47KB)	7
`0xE30000`	`0xEBFFFF`	576 KB	MC assembler layer (ELF/COFF/MachO section parsers, expression evaluator)	7
`0xEC0000`	`0xED0000`	64 KB	MC directives (`sub_ECB300` ELF section parser 40KB)	7
`0xED0000`	`0xEF8000`	160 KB	InstrProf / MemProf reader (profiling data infrastructure)	7
`0xEF8000`	`0xF05000`	52 KB	Bitstream remark serialization	7
`0xF05000`	`0xF6FFFF`	428 KB	SelectionDAG infrastructure (DAG node creation, SDValue, EVT/MVT helpers)	7
`0xF70000`	`0xF8FFFF`	128 KB	Loop vectorization runtime checks (`vectorizeLoop` 37KB, `canVectorizeMemory` 29KB)	7
`0xF90000`	`0xFCFFFF`	256 KB	SimplifyCFG + code sinking (switch table gen, speculative exec)	7
`0xFD0000`	`0xFEFFFF`	128 KB	AliasSet / register pressure (CFG graphviz)	7
`0xFF0000`	`0x101FFFF`	192 KB	Block scheduling (RPO traversal, constant folding)	7
`0x1020000`	`0x103FFFF`	128 KB	Inline ASM + scheduling model (CUTLASS kernel detection 41KB)	7
`0x1040000`	`0x106FFFF`	192 KB	Divergence analysis (DAG utilities, IR linker)	7
`0x1070000`	`0x10CFFFF`	384 KB	MC object emission + InstructionSimplify (`visitAdd` 94KB)	7
`0x10D0000`	`0x122FFFF`	1.4 MB	InstCombine mega-region (main visitor 396KB, KnownBits 125KB, SimplifyLibCalls, LLParser)	8
`0x1230000`	`0x12CFFFF`	640 KB	NVVM Bridge / IR codegen (AST-to-IR, Path B entry, builtin tables, bitcode linker)	9
`0x12D0000`	`0x12FBFFF`	176 KB	Pipeline builder (NVVMPassOptions 125KB, `AddPass`, tier builders, master assembler 50KB)	10
`0x12FC000`	`0x133FFFF`	256 KB	jemalloc core (~400 functions, `malloc_conf_init` 129KB)	10
`0x1340000`	`0x16FFFFF`	3.8 MB	IR infrastructure / PassManager (IR types, constants, instructions, metadata, execution engine, IR linker)	11
`0x1700000`	`0x17FFFFF`	1 MB	InstCombine (NewPM) + Sanitizers + PGO (MSan, TSan, coverage, GCov)	12
`0x1800000`	`0x18DFFFF`	896 KB	Standard scalar passes (InstructionCombining, TailCallElim, FunctionAttrs, SCCP, Sink, MemorySSA)	13
`0x18E0000`	`0x18FFFFF`	128 KB	DCE / CVP / DSE (Dead Code Elimination, CorrelatedValuePropagation, Dead Store Elimination)	13
`0x1900000`	`0x193FFFF`	256 KB	GVN family (`runOnFunction` 83KB, PRE 26KB, NewGVN 43KB)	13
`0x1940000`	`0x19FFFFF`	768 KB	Scalar passes continued (LICM, LoopRotate, LoopIndexSplit, LoopUnroll, SROA)	13
`0x1A00000`	`0x1AFFFFF`	1 MB	NVVMRematerialization / LLVM standard pipeline / InstructionSimplify	13
`0x1B00000`	`0x1B7FFFF`	512 KB	Loop unrolling + switch lowering (main driver 68KB, Unroll-and-Jam 55KB, peeling 39KB)	13
`0x1B80000`	`0x1BFFFFF`	512 KB	Loop/SLP vectorizer (LoopVectorize 43KB, VPlan 32KB, SLP 47KB+62KB)	13
`0x1C00000`	`0x1C3FFFF`	256 KB	NVVM module validation + config (codegen config 33KB, compile mode 28KB, intrinsic lowering 112KB, module validator 48KB)	13
`0x1C40000`	`0x1CFFFFF`	768 KB	NVIDIA custom IR passes (dead-sync-elim, common-base-elim, base-addr-sr, memspace-opt, loop-index-split, printf-lowering, iv-demotion, remat, peephole, sinking2, NLO)	13
`0x1D00000`	`0x1DFFFFF`	1 MB	SelectionDAG ISel / CodeGenPrepare (bytecode interpreter 97KB, address sinking 65KB)	14
`0x1E00000`	`0x1EFFFFF`	1 MB	Register allocation infrastructure (Greedy RA, live intervals, spill cost)	14
`0x1F00000`	`0x1FFFFFF`	1 MB	Backend codegen infrastructure (ScheduleDAG, ShrinkWrapping, SpillPlacement, register coalescer, TwoAddressInstruction)	15
`0x2000000`	`0x20FFFFF`	1 MB	LegalizeTypes (`sub_20019C0` 341KB -- third largest function)	15
`0x2100000`	`0x21FFFFF`	1 MB	NVPTX target backend (AsmPrinter, PTX emission, MMA/tensor codegen, atomics, TargetMachine)	16
`0x2200000`	`0x233FFFF`	1.25 MB	(gap: misc codegen, late passes)	--
`0x2340000`	`0x23FFFFF`	768 KB	New PM pass registration (master registrar 2,816 lines, 526 passes, pipeline text parser)	17
`0x2400000`	`0x258FFFF`	1.6 MB	Attributor framework (`runTillFixpoint` 53KB)	18
`0x2590000`	`0x265FFFF`	832 KB	Sanitizer instrumentation (ASan, HWASan)	18
`0x2660000`	`0x269FFFF`	256 KB	OpenMP target offloading (194-entry `__kmpc_*` table, Generic-to-SPMD 61KB, state machine 41KB)	18
`0x26A0000`	`0x29FFFFF`	3.5 MB	Coroutines / LTO infrastructure / PGO lowering / EarlyCSE / SROA (NewPM)	18
`0x2A00000`	`0x2CFFFFF`	3 MB	Loop transforms (LoopPeeling, LoopRotation, UnrollLoop, IndVarSimplify, dead-sync-elim island)	19
`0x2D00000`	`0x2FFFFFF`	3 MB	Codegen target options / SelectionDAG lowering (TargetOptions 112KB, DAG combine, type legalization)	20
`0x3000000`	`0x36FFFFF`	7 MB	NVPTX ISel + DAG lowering (NVPTXTargetLowering 111KB, intrinsic switch 343KB, register info)	21
`0x3700000`	`0x37AFFFF`	704 KB	Table-driven instruction selector (main matcher 138KB, per-SM opcode gating)	22
`0x37B0000`	`0x38FFFFF`	1.3 MB	Late machine passes (inliner cost model at `0x38576C0`, pipeline helpers)	22
`0x3900000`	`0x397FFFF`	512 KB	NVIDIA machine-level passes (register pressure, remat, ABI preserve, GEP split, AsmPrinter/PTX emission)	22
`0x3980000`	`0x399FFFF`	128 KB	MC layer / DWARF emission (object file writers, DWARF sections at `0x3990000`-`0x39DF000`)	22
`0x39A0000`	`0x3BFFFFF`	2.4 MB	Trailing codegen (section management, CRT finalization)	22

.rodata / .data Sections (0x3C00000+)

Start	End	Size	Contents
`0x3C00000`	`0x3EAFFFF`	~2.7 MB	Read-only data (strings, jump tables, XOR-encrypted env vars at `0x3C23A7B`)
`0x3EA0080`	`0x3F1FFFF`	456 KB	Embedded libdevice bitcode (Path A)
`0x3F252E0`	`0x3F3E6C0`+	varies	NVPTX tables (constraint type table, constraint word table, MVT tables)
`0x420FD80`	`0x428FFFF`	456 KB	Embedded libdevice bitcode (Path B)
`0x42812C0`	--	varies	Obfuscated version strings (XOR+ROT13 ciphertext)
`0x444C4A0`	`0x4456580`+	varies	MVT tables (operand type, vector element count, scalarized MVT)
`0x4F00000`+	--	large	BSS (`cl::opt` storage, hash tables, global state)

Usage

Given an IDA address, find the row whose Start <= address < End. The Subsystem column tells you which component of cicc you are looking at. For pass-level detail within a zone, jump to the corresponding Zone section above.

Cross-References

Pipeline Overview -- compilation flow from entry to PTX emission
LLVM Pipeline -- 526-pass registration table and tier execution order
Optimizer -- two-phase model, AddPass mechanism, tier system
Pass Inventory -- complete pass catalog with dedicated deep-dive pages
NVVMPassOptions -- 222-slot pass configuration system
Function Map -- address-to-identity lookup table
CLI Flags -- flag-to-pipeline routing

Methodology

This page documents how the reverse engineering of cicc v13.0 was performed. It serves as both a transparency record -- so readers can assess the confidence of any claim in this wiki -- and as a practical guide for anyone who wants to reproduce or extend the analysis.

Scope and Scale

CICC is a 60 MB stripped x86-64 ELF binary with no debug symbols, no export table, and no DWARF information. The scale of the analysis:

Metric	Value
Total functions detected	80,562
Functions decompiled	80,281 (99.65%)
Strings extracted	188,141
LLVM base version	20.0.0 (internal fork)
LLVM pass classes identified	~402 standard + 35 NVIDIA custom
CLI options registered	~1,689 `cl::opt` + 222 NVVMPassOptions
NVVM builtins catalogued	770 (IDs 1-770)

The 281 functions that Hex-Rays could not decompile are predominantly very small thunks, computed-jump trampolines, or hand-written assembly stubs in the CRT startup and jemalloc fast paths. None are in critical compiler logic.

Toolchain

All analysis was performed with IDA Pro 8.x and the Hex-Rays x86-64 decompiler. No dynamic analysis (debugging, tracing, instrumentation) was used -- the entire effort is static analysis of the binary at rest. Supplementary tools:

Tool	Purpose
IDA Pro 8.x	Disassembly, auto-analysis, cross-referencing, type reconstruction
Hex-Rays decompiler	Pseudocode generation for all 80,281 recovered functions
IDA Python scripting	Bulk string extraction, function size enumeration, xref graph walking
Custom Python scripts	Callgraph analysis, module taxonomy, evidence indexing, pipeline tracing

No runtime instrumentation, no strace/ltrace, no gdb breakpoints. Every finding derives from static analysis of the binary's code and data sections.

Function Identification Strategies

Identifying functions in a stripped binary of this size requires multiple complementary strategies. They are listed below in order of reliability.

String Cross-References (Highest Confidence)

LLVM is a string-rich codebase. Error messages, pass names, option descriptions, and assertion text are compiled into the binary. A string like "Running pass 'NVVMMemorySpaceOpt'" appears at exactly one address in .rodata, and IDA's xref from that string leads directly to the function that prints it. This is the most reliable identification technique and produces VERY HIGH confidence identifications.

Specific high-value string patterns:

LLVM pass registration: "instcombine", "gvn", "nvvm-memspace-opt" -- each appears in exactly one RegisterPass constructor or PassInfo initializer.
cl::opt names: "-nvvm-enable-remat", "-nvvm-branch-dist-threshold" -- each names a global variable and its registration constructor.
Error messages with context: "parseFunctionBody: ..." (174 unique error strings in the bitcode reader), "visitCallInst: ..." (298 verification messages in the verifier).
Timer names: "CUDA C++ Front-End", "LibNVVM", "Optimizer" -- appear in timer-creation calls that bracket pipeline stages.
EDG error templates: "expected a %s", "declaration not allowed here" -- 2,500+ diagnostic strings anchoring the frontend parser.

LLVM Pass Registration Patterns (Very High Confidence)

Every LLVM pass follows a predictable structural pattern. A pass class has a vtable with virtual methods at fixed offsets (runOnFunction at slot N, getAnalysisUsage at slot M). The pass registers itself via a global constructor that stores a PassInfo object containing the pass name string, the pass ID address, and a factory function pointer. By enumerating all .init_array entries that write a PassInfo-shaped structure, all ~437 passes were catalogued systematically.

The New Pass Manager (at sub_2342890, a 2,816-line registrar function) contains a massive string-to-pass-factory dispatch table with ~268 pass name entries. Decompiling this single function yields the name-to-address mapping for every New PM pass in the binary.

Vtable Analysis (High Confidence)

LLVM's class hierarchy is deep and regular. Pass -> FunctionPass -> LoopPass, Pass -> ModulePass, etc. Each level adds virtual methods at predictable vtable slots. By reconstructing vtable layouts (finding pointers to __cxa_pure_virtual for abstract methods, then tracing concrete overrides), the class hierarchy was reconstructed without debug symbols.

For the NVPTX backend specifically, vtable analysis identified NVPTXTargetLowering (2.3 MB of lowering logic), NVPTXInstrInfo, NVPTXRegisterInfo, and NVPTXFrameLowering as distinct classes with their own method tables.

Callgraph Propagation (High Confidence)

Once a function is identified with high confidence, its callees and callers gain contextual identity. If sub_12E54A0 is the pipeline assembly function (confirmed by string refs to pass names it registers), then the functions it calls to create individual passes are the pass factory functions. This propagation is transitive: identifying a factory function identifies its return type's vtable, which identifies the pass's runOnFunction method.

The pipeline orchestrator at sub_12C35D0 (41 KB) is a particularly productive anchor: it calls into the LNK, OPT, OPTIXIR, and LLC stages in sequence, and each stage's entry point was identified by following its callgraph edges.

Size and Structural Fingerprinting (Medium Confidence)

Some functions are identifiable by their size and structural characteristics alone. LLVM's InstCombine::visitCallInst is famously enormous (396 KB in this binary) because it handles every LLVM intrinsic. SelectionDAG::LegalizeTypes (348 KB) contains a switch with 967 case labels. These mega-functions have no structural equivalents and can be identified by size alone with reasonable confidence.

Similarly, the EDG frontend's constexpr evaluator (sub_786210, 317 KB) is identifiable by its 124 case labels corresponding to C++ operator opcodes -- a characteristic that matches the known EDG evaluator design.

Known Library Fingerprinting (Medium Confidence)

jemalloc was identified by its 199 configuration string names ("background_thread", "dirty_decay_ms", "narenas", etc.), which are unique to jemalloc's malloc_conf_init function. Once the allocator library was identified, its ~400 functions were bulk-labeled, removing them from the analysis scope.

The X86 AutoUpgrade function (sub_A939D0, 457 KB) is an LLVM artifact -- leftover x86 intrinsic renaming code that ships in every LLVM-based binary regardless of target. It was identified by its intrinsic name strings ("llvm.x86.sse2.*", "llvm.x86.avx.*") and excluded from NVPTX-specific analysis.

Confidence Levels

Every function identification in this wiki carries one of four confidence levels:

Level	Meaning	Basis
KNOWN	Identity is certain	Direct string evidence naming the function, or the function is a trivial thunk to a known target
VERY HIGH	Effectively certain	Multiple corroborating string references, structural match to known LLVM code, consistent callgraph position
HIGH	Strong identification	Single strong indicator (vtable match, size fingerprint, callgraph position) corroborated by context
MEDIUM	Probable identification	Inferred from callgraph context, parameter patterns, or structural similarity without direct string evidence

Approximately 60% of identified functions are VERY HIGH or KNOWN confidence. The remaining 40% are HIGH or MEDIUM, concentrated in areas with fewer string anchors (machine-level passes, register allocation internals, EDG IL tree walkers).

Analysis Pipeline and Scripts

The manual IDA Pro work was augmented by a systematic scripted pipeline that processed the exported IDA databases into structured evidence. The pipeline operates in two phases: L0 (foundation) builds indices and classifies all 80,562 functions automatically, and L1 (module analysis) organizes functions into per-module directories with metadata for human review.

All scripts live in cicc/scripts/. The pipeline requires four JSON databases exported from IDA: cicc_functions.json (80,562 function records), cicc_strings.json (188,141 string records), cicc_xrefs.json (cross-reference records), and cicc_callgraph.json (call edge records). These exports are stored in cicc/databases/.

L0 Foundation Pipeline

The L0 pipeline runs as a single sequential batch via scripts/run_foundation_analysis.sh. Each step depends on the output of the previous step.

Step 0: Extract Wiki Knowledge (foundation/00_extract_wiki_knowledge.py)

Scans all existing wiki markdown files for hex addresses (regex \b0x[0-9a-fA-F]{6,}\b) and builds a ground-truth mapping of address-to-module from prior manual analysis. This seed data provides the highest-confidence module assignments (100% confidence) used to bootstrap the automated classifier.

Output: foundation/taxonomy/modules/wiki_known_functions.json, wiki_module_addresses.json.

Step 1: Build Fast Lookup Indices (foundation/01_build_indices.py)

Loads the three IDA JSON databases (functions, strings, xrefs) and builds four pickle-serialized indices for O(1) lookup in subsequent steps:

addr_to_func.pkl -- address to function metadata (name, size, instruction count, library/thunk flags).
string_to_xrefs.pkl -- string address to string value and xref list.
func_to_callers.pkl -- function name to list of caller names.
func_to_callees.pkl -- function name to list of callee names.

Output: foundation/indices/.

Step 2: Classify Strings (foundation/02_classify_strings.py)

Applies four regex-based pattern sets to all 188,141 strings, classifying each into one or more semantic categories:

Error messages: strings matching error, failed, invalid, unsupported, expected, etc.
Optimization passes: strings matching pass, optimize, transform, inline, unroll, gvn, licm, etc.
Architecture features: strings matching sm_\d+, tensor, warp, FP4, blackwell, hopper, etc.
Debug messages: strings matching debug, trace, dump, verbose.

Each classified string retains its address and xref list, so the classifier output doubles as a "which functions reference optimization-related strings" index.

Output: foundation/taxonomy/strings/error_messages.json, optimization_passes.json, architecture_features.json, debug_messages.json, extracted_pass_names.json.

Step 3: Build Module Taxonomy (foundation/03_build_module_taxonomy.py)

The core classification engine. Assigns each of the 80,562 functions to one of eight compiler subsystem modules (or unknown) using four strategies applied in decreasing confidence order:

Wiki ground truth (100% confidence) -- addresses found in wiki pages in Step 0.
String content analysis (80% confidence) -- functions whose string xrefs match module-specific keyword patterns (e.g., a function referencing "tensor", "mma", or "tcgen" strings is classified as tensor_core_codegen).
Call proximity propagation (30-60% confidence, 3 iterations) -- unclassified functions are assigned to the module voted by their callers (weighted 2x) and callees. A minimum of 2 votes is required. Each iteration propagates classifications outward from already-classified functions.
Code location heuristics (40% confidence) -- address range rules for known code regions (e.g., 0x2F00000-0x3000000 maps to register_allocation).

The eight modules are: optimization_framework, register_allocation, compilation_pipeline, ptx_emission, instruction_selection, error_handling, tensor_core_codegen, architecture_detection.

Output: foundation/taxonomy/modules/function_to_module_map.json, module_list.json.

Step 4: Analyze Call Graph (foundation/04_analyze_callgraph.py)

Computes three structural properties of the call graph:

Entry points -- functions with zero callers and nonzero callees (top 100 by callee count). These are pipeline entry points, API functions, or global constructors.
Leaf functions -- functions with zero callees and nonzero callers (top 1,000 by caller count). These are utility functions, allocators, and assertion handlers.
Hot paths -- functions ranked by caller count (top 1,000). The highest-traffic functions in the binary.

Output: foundation/callgraph/entry_points.json, leaf_functions.json, hot_paths.json.

Step 5: Assign Priorities (foundation/05_assign_priorities.py)

Computes a composite priority score for each function to guide analysis effort allocation. The scoring formula:

Size component: 1000 points for functions over 10 KB, 700 for 5-10 KB, 400 for 2-5 KB, 200 for 1-2 KB, 100 for 500 B-1 KB.
Call frequency component: 500 points for 1000+ callers, 300 for 500+, 150 for 100+, 75 for 50+.
Named function bonus: 200 points if the function has a recovered name (not sub_).
Critical module bonus: 300 points if the function belongs to a critical module (compilation_pipeline, tensor_core_codegen, architecture_detection, register_allocation, instruction_selection, ptx_emission).

Functions scoring 1000+ are tier CRITICAL, 500+ are HIGH, 200+ are MEDIUM, below 200 are LOW.

Output: foundation/priorities/scoring_report.json, critical.json, high.json, medium.json, low.json.

Step 6: Generate Coverage Tracker (foundation/06_generate_coverage_tracker.py)

Aggregates all prior outputs into a master JSON tracker that records, per module and per function, the analysis status (pending/in-progress/complete), the assigned analyst, and the evidence quality score. This tracker serves as the coordination database for the L1 phase.

Output: foundation/coverage/tracker.json.

L1 Module Analysis Pipeline

The L1 pipeline runs via scripts/run_l1_programmatic.sh and requires L0 completion. It organizes CRITICAL and HIGH priority functions into per-module directories for systematic human review.

Step 1: Create Module Structure (modules/01_create_module_structure.py)

Creates the directory tree modules/{module}/functions/{critical,high}/ for each of the eight modules. MEDIUM and LOW tiers are intentionally excluded from L1 to focus effort on the most important functions.

Step 2: Extract Function Metadata (modules/02_extract_function_metadata.py)

For each CRITICAL and HIGH function, creates a directory modules/{module}/functions/{tier}/{address}/ containing a metadata.json file with: address, name, module, priority score, size, call frequency, scoring reasons, top 50 callers, top 50 callees, and paths to decompiled/disassembly/CFG files if they exist on disk.

Step 3: Generate Module READMEs (modules/03_generate_module_readmes.py)

Generates a skeleton README.md for each module with function counts, analysis progress tracking fields, and section headings for purpose, key functions, integration points, and data structures. These serve as the starting point for human-written module documentation.

Standalone Analysis Scripts

Six additional scripts perform targeted analyses independent of the L0/L1 pipeline:

analyze_nvvm_pipeline.py -- Loads the NVVM call graph (nvvm_callgraph.json, exported from the LibNVVM shared object analysis) and traces the compilation flow from nvvmCompileProgram. Identifies NVVM API entry points, finds LLVM optimization pass function symbols, traces call paths to depth 10, identifies hub functions (nodes with in-degree or out-degree above 10), and extracts the optimization pass ordering reachable from the compile entry point.

deep_pipeline_trace.py -- Performs deep BFS traversal (up to depth 15, width 100 per level) from nvvmCompileProgram through the NVVM call graph. Annotates each function with structural characteristics (LEAF, HUB, FANOUT, FANIN) and groups results by call depth to reveal the pipeline's stage boundaries. Also traces from secondary API entry points (nvvmVerifyProgram, nvvmAddModuleToProgram, nvvmCreateProgram).

extract_pipeline_structure.py -- Parses the 188,141 strings database for disable-*Pass patterns and Disable * description strings to extract the complete list of optimization passes by name. Categorizes passes into groups (Dead Code Elimination, Loop Optimizations, Inlining, Memory, NVVM-Specific, Lowering, etc.) and reconstructs the 13-stage compilation pipeline from NVVM module loading through PTX code generation. Also extracts compilation mode information (fast-compile, split-compile, partial-link).

analyze_performance_hotspots.py -- Loads the full function database (cicc_functions.json) and computes: global hotspot ranking (top 100 most-called functions), hot path chains (BFS from top 50 hotspots through callees, tracking weighted call frequency), size-efficiency analysis (bytes per call for each function), loop depth estimation (regex-based nesting analysis of decompiled C files), bottleneck identification (functions with 500+ callers), and module-level hotspot distribution.

catalog_optimization_framework.py -- Specialized script for the optimization_framework module. Reads per-function metadata from the L1 module directories, builds a critical function registry sorted by size, extracts HIGH-tier statistics (size tier distribution, top 20 most-called), scans decompiled code for optimization-related string patterns (pass references, iteration patterns, technique keywords), and identifies entry points (functions with 2 or fewer callers).

validate_callgraph.py -- Comprehensive validation system that cross-checks the call graph data against module classifications. Performs six verification analyses: cross-module call matrix verification (counting inter-module edges and sampling for spot-checks), entry point validation (confirming claimed entry points have zero callers), reachability analysis (BFS from main to find dead code), module dependency cycle detection (DFS on the module dependency graph), integration hotspot verification (functions called by all 8 modules), and bridge function identification (functions that both call into and are called from 2+ other modules).

Evidence Index Builders

Two versions of the evidence aggregation engine synthesize all data sources into per-function quality scores:

build_evidence_index.py (v1) -- Loads all five databases (functions, callgraph, strings, xrefs, names, comments, module map) into memory. For each of the 80,562 functions, counts eight evidence types (metadata, callers, callees, strings, xrefs, name pattern, size, module consistency) and computes a weighted confidence score (string evidence weighted highest at 20 points, callgraph at 15 each, xrefs at 15, metadata and name at 10 each, module at 10, size at 5). Produces nine output files including quality tier assignments (GOLD >= 80%, SILVER >= 50%, BRONZE < 50%), citation density analysis, cross-reference statistics, and prioritized recommendations for further analysis.

build_evidence_index_v2.py (v2, optimized) -- Memory-efficient reimplementation that avoids loading the full xref list into memory. Instead of building complete xref lookup tables, it streams the xref file line-by-line and counts only. The callgraph is preprocessed into a caller/callee count map rather than a full edge list. Produces the same nine analysis files as v1 with identical quality tier logic. Recommended for systems with less than 32 GB RAM.

Cross-Module Dependency Analysis

07_analyze_cross_module_dependencies.py -- The most complex standalone analysis. Streams the full call graph (using ijson for memory-efficient parsing) four times to compute:

Inter-module call matrix -- for each pair of the 8 modules, the number of call edges crossing the boundary.
Module dependency depth -- per-module statistics on how many other modules each function depends on, identifying isolated functions and hub functions.
Critical bridges -- functions that call into 3 or more other modules (top 100 by bridge count).
Integration hotspots -- functions called by 3 or more other modules (top 100 by fan-in).
Module dependency graph -- a JSON graph structure with weighted edges suitable for visualization.
Integration patterns -- entry point modules (highest out-degree), utility hub modules (highest in-degree), and linear dependency chains.

Data Flow and Directory Structure

The complete analysis data is organized as follows:

cicc/
  databases/                    # IDA exports (input data)
    cicc_functions.json         #   80,562 function records
    cicc_strings.json           #   188,141 string records
    cicc_xrefs.json             #   cross-reference records
    cicc_callgraph.json         #   call edge records
    cicc_names.json             #   recovered names
    cicc_comments.json          #   IDA comments
  foundation/                   # L0 pipeline output
    indices/                    #   pickle indices for fast lookup
    taxonomy/
      modules/                  #   function-to-module map, module list
      strings/                  #   classified string databases
    callgraph/                  #   entry points, leaf functions, hot paths
    priorities/                 #   priority scoring and tier assignments
    coverage/                   #   master progress tracker
    analyses/                   #   evidence index, quality tiers, cross-module data
  modules/                      # L1 pipeline output
    {module}/
      functions/
        critical/{addr}/        #   metadata.json per critical function
        high/{addr}/            #   metadata.json per high function
      analysis/                 #   module-level analysis files
      README.md                 #   module documentation skeleton
  decompiled/                   # Hex-Rays output (per-function C files)
  disasm/                       # IDA disassembly output (per-function ASM files)
  graphs/                       # Control flow graphs (JSON and DOT)
  scripts/                      # All analysis scripts
    foundation/                 #   L0 pipeline scripts (00-07)
    modules/                    #   L1 pipeline scripts (01-03)
    run_foundation_analysis.sh  #   L0 batch runner
    run_l1_programmatic.sh      #   L1 batch runner

Verification Approaches

To verify any specific finding in this wiki:

Open IDA at the stated address. Every function identification includes an address. Navigate to it, press F5 to decompile, and check whether the decompiled code matches the described behavior.
Check string xrefs. For VERY HIGH and KNOWN identifications, search for the quoted string in IDA's Strings window. The xref should lead to the stated function address or a function that directly calls it.
Compare with upstream LLVM. CICC is based on LLVM 20.0.0. The LLVM source tree at the corresponding git tag contains the original implementations of all standard passes. Structural comparison (switch case counts, parameter counts, error message text) between the decompiled code and the LLVM source is the gold standard for verification.
Cross-reference the dual paths. Path A and Path B contain near-duplicate code. If a function is identified in Path A, the corresponding Path B function should exhibit the same structure. Agreement between the two paths increases confidence.
Trace from known entry points. Start at sub_8F9C90 (real main, KNOWN confidence) and follow the call chain. Every function reachable from main through a chain of identified functions has a verified callgraph path.
Run the validation script. Execute scripts/validate_callgraph.py to cross-check the call graph against module classifications. The script produces a CALLGRAPH_VALIDATION_REPORT.json with quantitative metrics: entry point accuracy, cross-module call counts, reachability percentage, bridge function inventory, and module dependency cycles. A healthy analysis should show entry point confidence above 90% and reachability above 80%.
Re-run the evidence index. Execute scripts/foundation/build_evidence_index_v2.py to regenerate quality tier assignments. Compare the GOLD/SILVER/BRONZE percentages against the expected distribution (majority SILVER or GOLD for classified functions). Functions that drop to BRONZE after a wiki edit indicate a regression in evidence consistency.

Reproducing the Full Analysis

To reproduce this analysis from scratch:

Obtain the binary. Install CUDA Toolkit 13.0. The binary is at <cuda>/nvvm/bin/cicc. SHA-256 and build string cuda_13.0.r13.0/compiler.36424714_0 must match.
Run IDA auto-analysis. Open cicc in IDA Pro 8.x with default x86-64 analysis settings. Allow auto-analysis to complete (5-10 minutes for a binary of this size). Accept the detected compiler (GCC).

Batch decompile. Run the following IDA Python script to decompile all functions and export per-function C files:

import idautils, ida_hexrays, idc
for func_ea in idautils.Functions():
    try:
        cfunc = ida_hexrays.decompile(func_ea)
        name = idc.get_func_name(func_ea)
        addr = f"0x{func_ea:X}"
        with open(f"decompiled/{name}_{addr}.c", "w") as f:
            f.write(str(cfunc))
    except:
        pass

Export databases. Use IDA Python to export the five JSON databases (functions, strings, xrefs, callgraph, names) to cicc/databases/. The function export should iterate Functions() and record address, name, size, instruction count, is_library, is_thunk, callers, and callees for each. The string export should iterate IDA's string list and record address, value, and xrefs.
Run L0 foundation pipeline.
```
cd cicc/scripts
bash run_foundation_analysis.sh
```
This executes Steps 0-6 in sequence, producing all indices, classifications, and the coverage tracker. Expected runtime: 2-5 minutes on a modern machine.
Run L1 module setup.
```
bash run_l1_programmatic.sh
```
This creates the per-module directory structure, extracts metadata for CRITICAL and HIGH functions, and generates module README skeletons. Expected runtime: under 1 minute.

Run standalone analyses (optional, for deeper investigation):

python3 analyze_nvvm_pipeline.py       # NVVM pipeline trace
python3 deep_pipeline_trace.py          # Deep BFS from nvvmCompileProgram
python3 extract_pipeline_structure.py   # Pass extraction from strings
python3 analyze_performance_hotspots.py # Hotspot ranking
python3 validate_callgraph.py           # Validation report

Run evidence indexing (optional, for quality assessment):
```
cd foundation
python3 build_evidence_index_v2.py
```
Begin manual analysis. With the foundation data in place, start from the CRITICAL priority list and the string anchors described in the Function Identification Strategies section above. The Function Map page is the primary lookup table.

Dependencies

The analysis scripts require only the Python 3.8+ standard library with one exception: 07_analyze_cross_module_dependencies.py uses ijson for streaming JSON parsing of the large callgraph file. Install with pip install ijson. All other scripts use only json, pickle, re, collections, pathlib, statistics, dataclasses, and typing.

Binary Address Sweep Reports

In addition to the automated scripts, the analysis produced 90+ raw binary sweep reports stored in cicc/raw/. Each report covers a contiguous address range (typically 128 KB to 512 KB) and contains per-function identification notes, string evidence citations, structural observations, and confidence assessments. The reports are named by address range (e.g., p1.3-01-sweep-0x8F0000-0x90FFFF.txt covers the compilation pipeline entry region) and organized into 10 sweep phases corresponding to the binary's major sections. A second sweep phase (p2-* and p2a-p2g) provides focused analyses of specific subsystems (EDG frontend, IR generation, optimization passes, SelectionDAG, register allocation, scheduling, configuration).

These raw reports are the primary source material from which the wiki pages were written. They are not cleaned or edited for presentation -- they contain working notes, false starts, and corrections made during the analysis process.

Limitations and Known Gaps

This analysis has several inherent limitations:

No dynamic validation. All findings are from static analysis. Runtime behavior under specific inputs (unusual SM targets, edge-case CUDA constructs) has not been verified.
EDG internals are partially opaque. The EDG frontend is a licensed third-party component. Its internal data structures are less well-documented in the LLVM literature, making identification harder. The IL tree format and scope management structures are identified at MEDIUM confidence.
Inlined functions are invisible. If the compiler inlined a function during the build of cicc itself, that function has no standalone address and cannot be independently identified. Some small LLVM utility functions (SmallVector operations, StringRef comparisons) are likely inlined throughout.
Proprietary NVIDIA code has no public reference. The 35 custom NVIDIA passes, the NVVM bridge layer, and the NVVMPassOptions system have no upstream source to compare against. These are identified purely from string evidence and structural analysis.
Version-specific. All findings apply to cicc v13.0 (build cuda_13.0.r13.0/compiler.36424714_0). Addresses, function sizes, and pass counts will differ in other CUDA toolkit versions.
Module classification accuracy degrades at the boundary. The automated taxonomy assigns ~60% of functions with high confidence (wiki ground truth or strong string evidence). The remaining functions are classified by call proximity propagation or address range heuristics at 30-60% confidence. Functions at module boundaries may be misclassified; the validate_callgraph.py script quantifies this.
Callgraph completeness depends on IDA's xref analysis. Indirect calls through function pointers (vtable dispatch, callback registrations) are not fully captured by IDA's static analysis. The call graph is therefore a lower bound on the true call relationships. This primarily affects LLVM's pass manager dispatch and the EDG frontend's visitor pattern implementations.

Version Tracking

This page documents the exact version identifiers embedded in the cicc v13.0 binary and the version relationships between its components. Every version listed here was recovered from string constants, constructor initializations, or binary header fields in the stripped ELF binary. This is the single source of truth for version-related questions across the wiki.

Version Summary

Component	Version	Evidence
cicc binary	v13.0	Build string `cuda_13.0.r13.0/compiler.36424714_0`
CUDA Toolkit	12.8	Toolkit release that ships cicc v13.0
LLVM base (internal)	20.0.0	`ctor_036` at `0x48CC90` falls back to `"20.0.0"`; string `"llvm-mc (based on LLVM 20.0.0)"` at `sub_E7A190`
Bitcode producer (emitted)	`"LLVM7.0.1"`	`ctor_154` at `0x4CE640` writes `"7.0.1"` to producer global
EDG frontend	6.6	String `"Based on Edison Design Group C/C++ Front End, version 6.6"`
NVVM IR version (user code)	3.2	Metadata gate at `sub_157E370`: `major == 3, minor <= 2`
NVVM IR version (libdevice)	2.0	`!nvvmir.version = !{i32 2, i32 0}` -- always-compatible sentinel
NVVM container format	1.x	Header field `version_major = 1`, `version_minor <= 0x41`
NVVM debug info version	3.2	Container header `nvvm_debug_major = 3`, `nvvm_debug_minor <= 2`
Embedded libdevice	`libdevice.10.bc`	455,876 bytes, 352 functions, triple `nvptx64-nvidia-gpulibs`
GCC emulation (EDG)	8.1	`DEFAULT_GNU_VERSION = 80100`
Clang emulation (EDG)	9.1	`DEFAULT_CLANG_VERSION = 90100`
jemalloc	5.3.x	~400 statically linked functions at `0x12FC000`
Default PTX ISA (sm_90)	8.5	`.version 8.5` computed from `PTXVersion / 10`, `PTXVersion % 10`
Default SM target	sm_75	Hardcoded `strcpy("compute_75")` in `sub_900130` and `sub_125FB30`

LLVM Version: The Dual Identity

CICC has two LLVM version identities. Internally, it is an LLVM 20.0.0 fork -- all modern instruction opcodes, metadata formats, type encodings, and pass infrastructure from LLVM 20 are present. Externally, the bitcode it emits identifies itself as "LLVM7.0.1" in the producer field.

The reason is historical: NVVM IR 2.0 was defined against LLVM 7.0.1. The entire NVVM toolchain ecosystem (libNVVM, nvcc's device pipeline, nvdisasm, third-party NVVM IR consumers) standardized on "LLVM7.0.1" as the format identifier. Changing the producer string would require a coordinated update across the entire CUDA toolkit and all downstream consumers.

Binary evidence:

ctor_036 at 0x48CC90: reads LLVM_OVERRIDE_PRODUCER environment variable, falls back to "20.0.0" (the true version).
ctor_154 at 0x4CE640: reads LLVM_OVERRIDE_PRODUCER, falls back to "7.0.1" (the compatibility marker). This is the constructor that runs for the bitcode writer path.
sub_E7A190: contains the string "llvm-mc (based on LLVM 20.0.0)".
sub_1538EC0 (writeModule): emits "LLVM" + "7.0.1" = "LLVM7.0.1" as the IDENTIFICATION_BLOCK producer.

Both constructors accept the LLVM_OVERRIDE_PRODUCER environment variable to override the default. Setting it changes the embedded producer string in output bitcode.

See Bitcode Reader/Writer for the full dual-producer mechanism.

EDG 6.6 Frontend

The EDG (Edison Design Group) frontend is a licensed commercial C/C++ frontend. Version 6.6 occupies 3.2 MB of code at 0x5D0000--0x8F0000. The version string is embedded literally as "Based on Edison Design Group C/C++ Front End, version 6.6" and is accessible via the --version flag.

EDG 6.6 in cicc is configured to emulate GCC 8.1 (DEFAULT_GNU_VERSION = 80100) and Clang 9.1 (DEFAULT_CLANG_VERSION = 90100). It supports C++23 as the newest C++ standard and C23 as the newest C standard, with C++17 as the default mode.

See EDG 6.6 Frontend for the full frontend documentation.

NVVM IR Version

The NVVM IR version is a metadata tuple (major, minor) embedded in every NVVM bitcode module via the !nvvmir.version named metadata node. CICC v13.0 has two distinct version contexts:

User code: the IR generation phase (sub_9151E0) emits !nvvmir.version with the current version tuple. The version checker at sub_157E370 enforces major == 3 and minor <= 2, making 3.2 the current maximum accepted version. Modules with major != 3 or minor > 2 are rejected with "Broken module found, compilation aborted!".

Libdevice: the embedded libdevice.10.bc carries !nvvmir.version = !{i32 2, i32 0}. The version (2, 0) is hard-coded in the version checker (sub_12BDA30) as an always-compatible sentinel -- it passes the check regardless of the current NVVM IR version. This ensures the embedded math library is compatible with any user module.

Container format: the NVVM container binary header stores version fields separately at offsets 0x06--0x07 (nvvm_ir_major, nvvm_ir_minor). These track the container-level IR spec version and may differ from the bitcode-level metadata tuple.

Bypass: setting NVVM_IR_VER_CHK=0 in the environment disables version validation entirely, allowing any version tuple to pass.

See Bitcode Reader/Writer for the version gate implementation and NVVM Container for the container-level version fields.

Embedded Libdevice

The embedded libdevice is libdevice.10.bc, a 455,876-byte LLVM bitcode library containing 352 GPU-optimized math functions. Two identical copies are statically embedded in the binary:

Copy	Address	Pipeline
A	`unk_3EA0080`	LibNVVM mode (Path A)
B	`unk_420FD80`	Standalone mode (Path B)

Key properties:

Target triple: nvptx64-nvidia-gpulibs
Function count: 352 (all alwaysinline nounwind)
NVVM IR version: (2, 0) -- always-compatible sentinel
Producer: "clang version 3.8.0 (tags/RELEASE_380/final)" -- the Clang version that originally compiled libdevice (not indicative of cicc's own compiler version)
NVVMReflect calls: uses __nvvm_reflect("__CUDA_FTZ"), __nvvm_reflect("__CUDA_ARCH"), and __nvvm_reflect("__CUDA_PREC_SQRT") for runtime specialization

The libdevice.10.bc naming convention carries forward from the CUDA 5.0 era. The 10 in the filename originally indicated "compute capability 1.0 and above" (i.e., universal), not a version number.

See Libdevice Linking for the linking algorithm, version validation, and NVVMReflect interaction.

NVVM Container Format Version

The NVVM container binary envelope uses its own versioning scheme, independent of the NVVM IR version:

Field	Offset	Value	Meaning
`version_major`	0x04	1	Container format major
`version_minor`	0x05	<= 0x41	Container format minor
`nvvm_ir_major`	0x06	2	NVVM IR spec major (container-level)
`nvvm_ir_minor`	0x07	<= 0x62	NVVM IR spec minor (container-level)
`nvvm_debug_major`	0x08	3	Debug info format major
`nvvm_debug_minor`	0x09	<= 2	Debug info format minor
`llvm_major`	0x0A	encoded	LLVM version (combined: `major * 100 + minor` = 2000)
`llvm_minor`	0x0B	encoded

The container's LLVM version encoding stores the combined value 20 * 100 + 0 = 2000, confirming the internal LLVM 20.0.0 base.

See NVVM Container for the full binary format specification.

Version Cross-Reference Matrix

How versions flow through the pipeline:

                    EDG 6.6 Frontend
                         |
                         v
                    NVVM IR Generation
                    (emits nvvmir.version = {3, 2})
                         |
                    +----+----+
                    |         |
              libdevice    user IR
           (version 2,0) (version 3,2)
                    |         |
                    +----+----+
                         |
                    NVVM IR Version Check
                    (gate: major==3, minor<=2)
                    (sentinel: 2,0 always passes)
                         |
                    LLVM 20.0.0 Optimizer
                         |
                    Bitcode Writer
                    (producer: "LLVM7.0.1")
                         |
                    NVVM Container Serializer
                    (container version 1.x, LLVM encoded as 2000)
                         |
                         v
                    .ptx / .optixir output

Future Updates

This wiki documents cicc v13.0 from CUDA 12.8. When a new CUDA toolkit release ships a newer cicc binary, the following version fields are the most likely to change:

LLVM base version: NVIDIA periodically rebases on newer LLVM releases. A jump from 20.0.0 to a later version would change the internal string, the container LLVM encoding, and potentially add new passes, opcodes, and metadata formats.
EDG version: EDG releases track independently of LLVM. A bump from 6.6 to a later version would affect C++ standard support, keyword handling, and the frontend error catalog.
NVVM IR version minor: the minor field (currently 2 in the major == 3 series) may increment to accommodate new metadata kinds or intrinsic conventions without breaking the major version.
PTX ISA version: new SM targets require new PTX versions. sm_100 Blackwell already uses a higher PTX version than sm_90 Hopper.
SM target range: new GPU architectures add new SM numbers. The sm_75--sm_121 range in v13.0 will expand in future releases.

The bitcode producer string ("LLVM7.0.1") is unlikely to change in the near term -- doing so would break backward compatibility with the entire NVVM IR ecosystem. The libdevice version sentinel (2, 0) is similarly stable because the version checker special-cases it.

To update this wiki for a new cicc version:

Extract the build string (search for cuda_XX.Y.rXX.Y/compiler.).
Check ctor_036 for the LLVM version fallback string.
Check the EDG version string at sub_617BD0.
Check the NVVM IR version gate constants at the version checker function.
Measure the embedded libdevice size and function count.
Verify the NVVM container header version fields.

Cross-References

Bitcode Reader/Writer -- producer string mechanism, version gate implementation
NVVM Container -- container binary format with version header fields
Libdevice Linking -- embedded math library, version sentinel
EDG 6.6 Frontend -- frontend version, GCC/Clang emulation modes
Binary Layout -- build ID string, ELF properties
Entry Point & CLI -- dual-path dispatch, version string arguments
Environment Variables -- LLVM_OVERRIDE_PRODUCER, NVVM_IR_VER_CHK
Debug Info Verification -- debug version field (3.2)

Compilation Pipeline Overview

This page maps the complete end-to-end flow of a CUDA compilation through cicc v13.0, from the initial CLI invocation to the final PTX text output. Each stage is a self-contained subsystem with its own address range, data structures, and failure modes. The links below lead to dedicated pages with reimplementation-grade detail for every stage.

Pipeline Diagram

nvcc
  |
  v
+===========================================================+
| cicc (60 MB, 80,562 functions)                            |
|                                                           |
|  1. CLI Parsing & Dispatch -----> [entry.md]              |
|     |  argv/envp, flag translation, arch detection        |
|     |  dual-path select: Path A (LibNVVM) / Path B        |
|     v                                                     |
|  2. nvcc-to-cicc Interface -----> [nvcc-interface.md]     |
|     |  flag tree (40+ mappings), 3-column arch fan-out    |
|     |  mode cookies: 0xABBA=CUDA, 0xDEED=OpenCL          |
|     v                                                     |
|  3. EDG 6.6 Frontend -----------> [edg.md]                |
|     |  CUDA C++ --> transformed C (.int.c/.device.c)      |
|     |  737 config #defines, GCC 8.1 / Clang 9.1 emu      |
|     v                                                     |
|  4. NVVM IR Generation ---------> [ir-generation.md]      |
|     |  EDG IL tree --> LLVM Module (NVVM IR)              |
|     |  address spaces, kernel metadata, builtins          |
|     v                                                     |
|  5. Libdevice Linking ----------> [../infra/libdevice-linking.md]
|     |  embedded 455KB bitcode, 352 __nv_* math fns        |
|     |  target triple validation, NVVM version check       |
|     v                                                     |
|  6. LLVM Optimizer -------------> [optimizer.md]          |
|     |  two-phase model (analysis -> codegen-oriented)     |
|     |  49.8KB pipeline assembler, ~150 pass insertions    |
|     |  concurrent per-function Phase II                   |
|     v                                                     |
|  7. LTO Pipeline ---------------> [../lto/index.md]       |
|     |  cross-TU inlining, devirt, GlobalOpt               |
|     |  closed-world GPU model: no dlopen, no .so          |
|     v                                                     |
|  8. Code Generation ------------> [codegen.md]            |
|     |  SelectionDAG, ISel, RegAlloc, MachineIR passes     |
|     |  37 MB of code, largest subsystem                   |
|     v                                                     |
|  9. PTX Emission ---------------> [emission.md]           |
|     |  .entry/.func headers, register decls, .loc/.file   |
|     |  AsmPrinter, GenericToNVVM addrspace rewrite        |
|     v                                                     |
|  OUTPUT: .ptx file (or NVVM bitcode, or OptiX IR)        |
+===========================================================+

Side paths:
  * OptiX IR (--emit-optix-ir) ----> [optix-ir.md]
  * Debug info (all stages) -------> [debug-info-pipeline.md]

Stage Descriptions

1. Entry Point & CLI Parsing

The real main (sub_8F9C90, 10KB) parses argv, detects wizard mode via NVVMCCWIZ=553282, selects the target architecture (default sm_75), and dispatches into one of two compilation paths. Path A serves the LibNVVM API; Path B serves standalone nvcc invocations. Both paths are functionally identical but duplicated in the binary at different address ranges. See Entry Point & CLI.

2. nvcc-to-cicc Interface

The flag translation layer (sub_8FE280) rewrites nvcc-facing flags into cicc-facing flags through a std::map red-black tree, then a second stage (sub_95EB40) fans each flag out into three columns targeting EDG, OPT, and LLC separately. Mode cookies (0xABBA for CUDA, 0xDEED for OpenCL) select language-specific behavior. See nvcc-to-cicc Interface.

3. EDG 6.6 Frontend

A licensed commercial frontend (3.2 MB, 0x5D0000--0x8F0000) parses CUDA C++ source and emits transformed C code into .int.c, .device.c, and .stub.c files. CUDA syntax (<<<>>>, __shared__, __device__) is fully resolved in this stage. The output is C source, not LLVM IR. See EDG 6.6 Frontend.

4. NVVM IR Generation

Translates the EDG intermediate language (IL) tree into an LLVM Module with proper NVPTX address space annotations, nvvm.annotations kernel metadata, and lowered builtins. This is cicc's equivalent of Clang's lib/CodeGen, but operates on EDG's proprietary IL node format. See NVVM IR Generation and its sub-pages for expressions, statements, functions, and types.

5. Libdevice Linking

A 455,876-byte LLVM bitcode library containing 352 GPU-optimized math functions (__nv_sinf, __nv_expf, etc.) is embedded directly in the cicc binary. The linker validates the nvptx64- target triple, checks NVVM IR version metadata, and merges the library into the compilation module. No filesystem access is required. See Libdevice Linking.

6. LLVM Optimizer

A proprietary two-phase pipeline (sub_12E54A0, 49.8KB) runs ~150 passes: Phase I performs module-wide analysis, Phase II performs codegen-oriented transforms with optional per-function parallelism using a jobserver or thread pool. All behavior is controlled by the 222-slot NVVMPassOptions system. See LLVM Optimizer and Pipeline & Ordering.

7. LTO Pipeline

Exploits the GPU's closed-world compilation model (no dlopen, no shared libraries, no symbol interposition) for aggressive cross-TU inlining, whole-program devirtualization, and global variable promotion. Activated in separate compilation mode (nvcc -dc), but GlobalOpt and the inliner run even in single-TU mode. See LTO & Module Optimization.

8. Code Generation

The largest subsystem (37 MB, 0x1700000--0x35EFFFF) lowers optimized LLVM IR to NVPTX MachineInstr through SelectionDAG construction, type legalization, instruction selection via a three-level pattern match engine (900KB), pressure-driven greedy register allocation, and ~30 machine-level passes including tensor core codegen for HMMA/IMMA/WGMMA/tcgen05. See Code Generation.

9. PTX Emission

The AsmPrinter (sub_31EC4F0, 72KB) walks the final MachineFunction and emits PTX text: .entry/.func headers with kernel attributes, register declarations for 9 register classes, .loc/.file debug directives, and instruction mnemonics. A GenericToNVVM pass rewrites any remaining generic address space references before emission. See PTX Emission.

Side Paths

OptiX IR -- When --emit-optix-ir is passed, the pipeline replaces LLC with an OPTIXIR stage that serializes the optimized LLVM module for the OptiX ray tracing runtime's continuation-based execution model. See OptiX IR Generation.

Debug Info -- Debug metadata flows through all stages: generated in IR-gen, preserved or stripped in the optimizer (5 stripping passes), verified after each pass, and emitted as .loc/.file PTX directives. See Debug Info Pipeline.

Internal Pipeline Encoding

Internally, cicc represents the active pipeline stages as a bitmask:

Stage	Internal Name	Bit	Description
LNK	Libdevice link	`0x01`	Merge embedded math library
OPT	Optimizer	`0x02`	LLVM IR optimization (Phase I + II)
OPTIXIR	OptiX IR	`0x40`	OptiX serialization (mutually exclusive with LLC)
LLC	Code generation	`0x04`	SelectionDAG through PTX emission

The standard CUDA compilation bitmask is LNK | OPT | LLC = 0x07. OptiX mode uses 0x43.

Cross-References

Binary Layout -- address ranges for every subsystem
Function Map -- master index of recovered function addresses
CLI Flags -- complete flag catalog
Optimization Levels -- what changes at -O0/-O1/-O2/-O3
NVIDIA Custom Passes -- 35 proprietary passes inserted into the LLVM pipeline
NVPTX Target Infrastructure -- TargetMachine, TTI, SubtargetFeatures

Entry Point & CLI

The cicc binary has a surprisingly complex entry point. Rather than a straightforward main → compile → exit flow, it implements a dual-path architecture where the same binary can operate as either a LibNVVM-based compiler (Path A) or a standalone compiler (Path B), selected at runtime through environment variables and obfuscated string comparisons. This design allows NVIDIA to ship a single binary that serves both the nvcc toolchain and the LibNVVM API.

The entry point region (0x8F0000–0x96FFFF, ~520 KB) handles CLI parsing, architecture detection with a 3-column flag fan-out system, and dispatch into one of several compilation pipelines. A hidden "wizard mode" gated behind an environment variable with a magic number enables developer diagnostics that are otherwise completely inaccessible.


main() thunk	`0x4396A0` (16 bytes) — `return sub_8F9C90(argc, argv, envp)`
Real main	`sub_8F9C90` (10,066 bytes, 1,990 lines)
Wizard mode	`getenv("NVVMCCWIZ") == 553282` → `byte_4F6D280 = 1`
Default arch	`compute_75` / `sm_75` (Turing)
Flag catalog	`sub_9624D0` (75KB, 2,626 lines, 4 output vectors)
Architecture map	`sub_95EB40` (38KB, 23 architectures, 3-column fan-out)
Flag translation	`sub_8FE280` (red-black tree at `qword_4F6D2A0`, 40+ nvcc→cicc mappings)
Pipeline stages	LNK → OPT → [OPTIXIR] → LLC
Dual path	Path A (`sub_905EE0`) / Path B (`sub_1265970`)
Libdevice	Path A: `unk_3EA0080` / Path B: `unk_420FD80` (455,876 bytes each)
Arch bitmask	`0x60081200F821` (validates SM 75–121)

Architecture

main (0x4396A0, 16B thunk)
  │
  └─ sub_8F9C90 (10KB, REAL MAIN)
       │
       ├─ getenv("NVVMCCWIZ") == 553282 → wizard mode
       ├─ sub_16C5290: extract program name from argv[0]
       │
       ├─ ARGUMENT LOOP (v15 = 1..argc)
       │    ├─ -o <file>              → v257 (output)
       │    ├─ -nvvmir-library <path> → v256 (libdevice)
       │    ├─ -lgenfe/-libnvvm/-lnk/-opt/-llc → v263 (mode)
       │    ├─ -arch/-mcpu/--nv_arch  → v242 (SM number)
       │    ├─ --emit-optix-ir        → v243=1, v258=1
       │    ├─ -nvc                   → v258=1
       │    ├─ -irversion             → print IR version, exit
       │    ├─ .bc/.ci/.i/.ii/.cup/.optixir → s (input file)
       │    └─ obfuscated option      → v253 (0 or 1)
       │
       ├─ v253 RESOLUTION (if still == 2)
       │    └─ getenv(obfuscated) → compare → set v253 = 0 or 1
       │
       ├─ DISPATCH (v263 × v253)
       │    ├─ v263==0, v253==1 → sub_902D10  (simple Path A)
       │    ├─ v263==0, v253==0 → sub_1262860 (simple Path B)
       │    ├─ v263==1          → sub_905E50 / sub_12658E0 (lgenfe)
       │    ├─ v263≥2, v253==1  → sub_905EE0  (multi-stage Path A)
       │    └─ v263≥2, v253==0  → sub_1265970 (multi-stage Path B)
       │
       └─ CLEANUP: free all vectors, strings, argv copy

Real Main — `sub_8F9C90`

The exported main() at 0x4396A0 is a 16-byte thunk that immediately tail-calls sub_8F9C90 — the actual entry point. This function is a monolithic CLI parser and dispatcher: it copies argv into a local buffer, checks for wizard mode, iterates over all arguments accumulating state in ~12 local variables, resolves the compilation path, and finally dispatches to the appropriate pipeline function. The entire function is a single 10KB basic-block-heavy control flow graph with ~80 branch targets.

Field	Value
Address	`0x8F9C90`–`0x8FC3E2`
Size	10,066 bytes
Stack frame	0x978 bytes (2,424 bytes)
Local buffers	`v284[2096]` for argv copy (stack if argc ≤ 256, else heap)

Argument Handling and Argv Copy

The function begins with a defensive copy of argv into a local buffer. When 8 * argc fits within 0x800 bytes (argc ≤ 256), the copy lives in v284[2096] on the stack. For larger argument lists -- which can occur during complex nvcc invocations with many pass-through flags -- it allocates heap memory via sub_16CD150. This copy is necessary because the argument loop modifies pointers (advancing i to skip flag values), and the caller's argv must not be disturbed.

if (8 * argc > 0x800)
    v284 = sub_16CD150(8 * argc);   // heap alloc for large argc
// else use stack buffer v284[2096]
memcpy(v284, argv, 8 * argc);       // copy all pointers

After copying, sub_16C5290 extracts the base program name from argv[0] -- stripping directory prefixes -- and stores it in dest. This name appears in error messages and verbose output throughout the pipeline.

Key Local Variables

The function's behavior is controlled by two critical dispatch variables: v253 (which compilation backend to use) and v263 (which phase of the pipeline to invoke). These are accumulated during the argument loop and combined after parsing to select one of ~10 possible code paths. The interaction between them creates a matrix of behaviors that covers everything from simple single-file compilation to multi-stage LibNVVM pipeline processing.

Variable	Init	Purpose
`v253`	2	Dispatch mode: 0=Path B, 1=Path A, 2=default (needs env resolution)
`v263`	0	Invocation mode: 0=default, 1=lgenfe, 2=libnvvm, 3=lnk, 4=opt, 6=llc
`v242`	0	Target architecture (SM number)
`v258`	0	NVC flag
`v243`	0	OptiX IR flag
`v259`	0	Verbose (only effective in wizard mode)
`v261`	0	Dryrun
`v262`	0	Keep intermediates (only effective in wizard mode)
`s`	NULL	Input file path
`v257`	NULL	Output file path
`v256`	NULL	NVVM IR library path
`v266`	vector	Pass-through options vector

Wizard Mode

v10 = getenv("NVVMCCWIZ");                    // 0x8F9D36
if (v10 && strtol(v10, NULL, 10) == 553282)   // 0x8F9D92
    byte_4F6D280 = 1;

Global byte_4F6D280 gates the effectiveness of -v, -keep, -dryrun. Without wizard mode, these flags are silently ignored — v259 and v262 stay 0. This is a deliberate anti-reverse-engineering measure: even if someone discovers the -v flag, it does nothing without the magic environment variable. The magic number 553282 (0x87142) appears to be arbitrary.

Invocation Modes (`v263`)

The v263 variable determines which stage of the compilation pipeline cicc enters. When nvcc invokes cicc directly, v263 stays at 0 (default). But cicc can also be invoked in sub-pipeline mode — for example, -lnk runs only the linking phase, -opt runs only the optimizer, and -llc runs only code generation. This is how the multi-stage pipeline works: the outer driver calls cicc multiple times with different -lXXX flags, or a single invocation with -libnvvm runs all stages internally.

Each mode has its own format for the -discard-value-names flag, which tells the LLVM backend whether to strip IR value names (reducing memory usage). The different formats exist because each sub-pipeline stage has its own option namespace:

v263	Flag	Mode	discard-value-names format
0	(none)	Default (nvcc invocation)	`-discard-value-names`
1	`-lgenfe`	EDG frontend linkage	`--discard_value_names=1` (underscores)
2	`-libnvvm`	LibNVVM API	`-discard-value-names=1` (dashes)
3	`-lnk`	Linker	`-lnk-discard-value-names=1`
4	`-opt`	Optimizer	`-opt-discard-value-names=1`
5	(internal)	Undocumented (sets `v278` high byte)	—
6	`-llc`	Standalone LLVM codegen	—

Input File Extensions

Input files are identified by extension during the argument loop. The last matching file wins (s is overwritten each time). Unrecognized arguments are added to the v266 pass-through vector and forwarded to sub-pipelines. The .cup extension has a special restriction — it's only accepted when the preceding argument is --orig_src_path_name or --orig_src_file_name, which are metadata flags inserted by nvcc to track the original source file.

Extension	Format	Condition
`.bc`	LLVM bitcode	Always accepted
`.ci`	CUDA intermediate (preprocessed)	Always accepted
`.i`	Preprocessed C/C++	Always accepted
`.ii`	Preprocessed C++	Always accepted
`.cup`	CUDA source	Only after `--orig_src_path_name` or `--orig_src_file_name`
`.optixir`	OptiX IR	Always accepted

Obfuscated Strings

At 0x8F98A0, sub_8F98A0 decrypts strings using an XOR + ROT13-like cipher:

v40 = v37 ^ (-109 * ((offset + 97) ^ 0xC5));
// then ROT13 on alphabetic characters

This hides an environment variable name and option prefix from static analysis. The decrypted strings control the v253 (Path A vs Path B) resolution when no explicit mode is specified.

Error Messages

Message	Condition	Address
`"Missing output file\n"`	`-o` with no next argument	`0x8FA365`
`"Missing NVVM IR library file\n"`	`-nvvmir-library` with no next arg	`0x8FAB34`
`"Unparseable architecture: "` + value	Invalid arch string	Multiple
`"Missing input file\n"`	No recognized input file	`0x8FBEAD`
`"Recognized input file extensions are: .bc .ci .i .cup .optixir"`	After missing input	`0x8FBE97`
`"Error: Output file was not specified (See -o option).\n"`	Multi-stage without `-o`	`0x8FB655`

The `v253` Dispatch Variable

The v253 variable is the single most important dispatch control in the entire entry point. It determines whether the compilation uses Path A (the EDG/PTX-producing pipeline) or Path B (the standalone LLVM-based pipeline). Understanding its resolution logic is essential to reproducing cicc's behavior.

Initialization and Explicit Setting

v253 begins at 2 (unresolved default). During the argument loop, obfuscated string matching can set it directly:

Source	Value	Meaning
Initial default	2	Needs environment variable resolution
Obfuscated option suffix matches `byte_3C23AC3`	1	Path A explicitly requested
Obfuscated option suffix matches `byte_3C23AB4`	0	Path B explicitly requested

Environment Variable Resolution

When v253 remains at 2 after argument parsing (the common case), cicc resolves it through the obfuscated environment variable NV_NVVM_VERSION (decrypted from byte_3C23A9F). The resolution has two sub-cases depending on the target architecture:

if (v253 == 2) {
    env = getenv(decrypt(byte_3C23A9F));   // NV_NVVM_VERSION
    if (env matches decrypt(byte_3C23A82))       // "nvvm-latest"
        v253 = 1;  // Path A
    else if (env matches decrypt(byte_3C23A7B))  // "nvvm70"
        v253 = 0;  // Path B
    else if (v242 > 99 && !v258)                 // SM >= 100, not -nvc
        v253 = 0;  // Path B (new architectures default to standalone)
    else
        v253 = 1;  // Path A (legacy default)
}

The architectural threshold at SM 100 (Blackwell) is notable: for SM < 100, the default is Path A (the EDG frontend path). For SM >= 100, unless the -nvc flag is present, the default switches to Path B. This suggests NVIDIA is migrating newer architectures toward the standalone LLVM pipeline, possibly as a precursor to eventually deprecating the EDG-based path.

Version Strings Injected per Path

After v253 is resolved and for multi-stage modes (v263 >= 3), the entry point injects a version string into the pass-through options:

v253	Injected string	Semantics
1 (Path A)	`"-nvvm-version=nvvm-latest"` (25 bytes from `xmmword_3C23BC0`)	Targets the latest NVVM IR specification
0 (Path B)	`"-nvvm-version=nvvm70"` (20 bytes)	Targets NVVM 7.0 IR (frozen at LLVM 7.0.1 bitcode format)

This version string propagates through the entire pipeline, controlling bitcode compatibility, intrinsic name resolution, and metadata format expectations.

Post-Parse Dispatch Logic

After the argument loop terminates, the dispatch logic combines v253 and v263 to select the target function. The combined keep-and-verbose flag v260 = v262 & v259 is also computed -- both wizard-mode flags must be active for intermediate file retention and verbose logging to function simultaneously.

Simple Dispatch (v263 == 0)

When cicc is invoked without any -lXXX mode flag (the standard nvcc invocation path):

if (v253 == 1)
    v8 = sub_902D10(dest, 0, &v266, s, v257, v256, v260, v262, v261);
    // Path A: CLI → lgenfe → LibNVVM pipeline
else
    v8 = sub_1262860(dest, 0, &v266, s, v257, v256, v260, v262, v261);
    // Path B: CLI → standalone LLVM pipeline

Both functions receive identical parameter signatures: program name, zero (unused), pass-through options, input file, output file, libdevice path, verbose+keep, keep, and dryrun. The return value becomes the process exit code.

lgenfe Dispatch (v263 == 1)

The -lgenfe mode builds a full argv-style array with the program name as the first entry, followed by all v266 pass-through options. This argv is then passed to one of two function pairs:

v253	Init function	Pipeline function
1 (Path A)	`sub_B6EEA0` (LLVMContext + metadata kind registration)	`sub_905880` (EDG lgenfe)
0 (Path B)	`sub_1602D10` (standalone context initialization)	`sub_1265340` (standalone lgenfe)

The init functions create the LLVM context and register the 42+ metadata kinds used throughout the pipeline (dbg, tbaa, prof, noalias, etc.). These must be registered before any IR construction begins.

Multi-Stage Dispatch (v263 >= 2)

For -libnvvm, -lnk, -opt, and -llc modes, the dispatch constructs a CompilationState structure with input/output strings, extra arguments, and the v278 mode byte, then calls:

v253	Function	Size	Role
1	`sub_905EE0`	43 KB	Path A multi-stage pipeline driver
0	`sub_1265970`	48 KB	Path B multi-stage pipeline driver

For -libnvvm (v263 == 2), the extra args are taken directly from v266 without prepending the program name. For -lnk/-opt/-llc (v263 >= 3), the appropriate version string (nvvm-latest or nvvm70) is appended to the pass-through options before dispatch.

Cleanup

After the pipeline function returns, sub_8F9C90 performs deterministic cleanup in reverse allocation order: the v281 extra-argument char** array and each entry, the v275 output string, the s2 input string, each element of the v266 pass-through vector, the vector's backing buffer, the dest program name, and the v282 argv copy buffer (if heap-allocated). The return value v8 is 0 on success, 1 on argument errors, or the pipeline function's return code (stored in v264).

Path A — EDG → LibNVVM Pipeline

Path A is the full CUDA C++ compilation path. It starts with the EDG 6.6 C++ frontend parsing CUDA source code into an IL tree, then converts that IL into LLVM IR via the lgenfe (LLVM Generation Front End) stage, and finally runs the LibNVVM pipeline to optimize and lower the IR to PTX. This is the path taken when cicc is invoked by nvcc for .cu file compilation, and it represents the standard CUDA compilation flow that most users encounter.

Path A Orchestrator — `sub_902D10`

The orchestrator is a 9 KB function that sequences the three major stages of Path A compilation. It acts as the conductor between the CLI processing layer, the EDG frontend, and the LibNVVM optimizer/codegen.

Field	Value
Address	`0x902D10`
Size	~9 KB
Timer	Creates 8-byte timer via `sub_22077B0` → `sub_B6EEA0`

Execution flow:

Timer creation. Allocates and initializes an 8-byte timing context. The sub_B6EEA0 init function also registers the 42+ LLVM metadata kinds (dbg=1, tbaa=2, prof=3, ... noalias.addrspace=42) that all subsequent IR construction depends on. This is why the timer creation happens first: the metadata registration is a side effect of context initialization.
CLI processing. Calls sub_900130 (39 KB) to parse the accumulated CLI flags into structured forms: command buffer v58, emit-llvm-bc flag v52, architecture compute/SM numbers v55/v56, and file paths. On failure: "Error processing command line: <cmd>\n".
Include path setup. If an input file is present (v64), calls sub_C98ED0 to configure system and user include paths for the EDG frontend.
EDG frontend (lgenfe). Calls sub_905880 with timer name "CUDA C++ Front-End". This stage:
- Allocates an 880-byte module object via sub_BA8740
- Processes lgenfe CLI options from the options struct
- In dryrun mode: skips execution, frees the module, returns null
- On success: returns a module pointer and sets the output path
LibNVVM pipeline. If lgenfe succeeds (module pointer is non-null), calls sub_905EE0 with the module for the full optimization and codegen pipeline.
Time profiler output. After pipeline completion, checks sub_C96F30() for active profiling. If profiling is enabled, writes timing data to the output file via sub_C9C600. Failure emits: "Error: Failed to write time profiler data.\n".
Cleanup. Frees the timer (sub_B6E710), option strings, and option arrays.

EDG Frontend Stage — `sub_905880`

The lgenfe stage bridges the EDG 6.6 C++ frontend to LLVM IR generation. This is where CUDA C++ source code becomes NVVM IR.

Field	Value
Address	`0x905880`
Size	~6 KB
Timer label	`"CUDA C++ Front-End"`
Module size	880 bytes (allocated by `sub_BA8740`)

The function reconstructs a verbose command line for diagnostic output (quoting paths for --orig_src_file_name, --orig_src_path_name, --compiler_bindir, --sdk_dir), builds an argument array, and calls sub_908750(numArgs, argArray, opt_level) to create the LLVM module. On success, it copies the output path into the module at offset 21*8 and, if the keep flag is set via a3->byte[66], calls sub_905860 to write intermediate files.

The actual EDG parsing and IL-to-IR conversion happens inside sub_908750, which eventually calls sub_617BD0 — the lgenfe_main function documented in the EDG Frontend page.

EDG Module Binding — `sub_908850`

After the EDG frontend produces its IL tree, sub_908850 (10 KB) bridges the output to the LLVM backend. This function performs the critical step of configuring the LLVM module's data layout and target triple based on the target architecture.

Data layout strings are selected based on unk_4F06A68 (address space width):

Width	p3 flag	Data layout string
8 (64-bit)	`unk_4D0461C` set	`"e-p:64:64:64-p3:32:32:32-i1:8:8-..."` (167 chars)
8 (64-bit)	Not set	`"e-p:64:64:64-i1:8:8-..."` (155 chars)
4 (32-bit)	—	`"e-p:32:32:32-i1:8:8-..."` (155 chars)

The p3:32:32:32 component enables 32-bit pointers in address space 3 (shared memory), which is critical for SM architectures where shared memory accesses use 32-bit addressing even in 64-bit compilation mode.

Target triple is set to "nvptx64-nvidia-cuda" for 64-bit or "nvptx-nvidia-cuda" for 32-bit. The function also:

Creates a 496-byte target info structure via sub_AE3F70
Iterates global function declarations, marking device functions for compilation via sub_91CA00
Iterates global variables, processing initializers for device-side storage via sub_9172F0
Runs LLVM module verification via sub_B89FE0 -- on failure: "there was an error in verifying the lgenfe output!"
Stores the module globally at unk_4F6D2F8

LibNVVM Pipeline Driver — `sub_905EE0`

This 43 KB function is the core of Path A. It orchestrates the full compilation through 14 sequential phases, using an interesting indirection mechanism: rather than calling LibNVVM API functions directly, it resolves them at runtime through sub_12BC0F0(id) — a dispatch function that takes a numeric ID and returns a function pointer.

Field	Value
Address	`0x905EE0`
Size	43KB (1,268 lines)
Timer	`"LibNVVM"`
Orchestrator	`sub_902D10` (simple mode)

14-Phase Compilation Flow

The compilation proceeds through these phases sequentially. Phases 2.1–2.14 are the core compilation unit lifecycle: create, populate, configure, compile, extract results, destroy. The -keep flag (when wizard mode is active) causes intermediate .lnk.bc and .opt.bc files to be written to disk, which is invaluable for debugging the pipeline.

Phase	Action
0	Verbose command-line reconstruction
1	Input file handling — NVVM container (`sub_9047E0`) or regular bitcode
2.1	Pipeline infrastructure: verbose stream, address space query, module prep
2.2	`nvvmCreateCU` (dispatch ID 2151)
2.3	`nvvmCUAddModule` (dispatch ID 4660)
2.4	Check for `-opt`/`-llc` direct stage invocation
2.5	Parse and add additional modules from extra args
2.6	Libdevice linking: external file or embedded `unk_3EA0080` (455,876 bytes)
2.7	Intermediate file path construction (`.lnk.bc`, `.opt.bc`)
2.8	Option parsing via `sub_9685E0` → 4 stage vectors (lnk/opt/lto/llc)
2.9	Verbose stage logging + callback registration
2.10	Load 37 LLVM options from `off_4B90FE0` via `nvvmSetOptionStrings` (ID 0xFEED)
2.11	`nvvmCUCompile` (ID 0xBEAD) with phase code `57069` (0xDEED)
2.12	Get compilation log/warnings
2.13	Get PTX output
2.14	`nvvmDestroyCU` (ID 21257 = 0x5309)
3	Error/warning output to caller
4	Write output file (text or binary detection via ELF magic)
5	Timer stop

Input File Handling — Phase 1

Phase 1 has two sub-paths based on the a3->byte[65] container flag:

Path 1A: NVVM IR Container. When the input is an NVVM container (a binary format wrapping IR plus compilation options), sub_9047E0 (10 KB) parses it. The container format encodes the target SM version, FTZ mode, precision settings, and IEEE mode. The parser extracts these and converts them to LLVM CLI flags:

// Pseudo-code for container option extraction
push("-march=nvptx");
push("-mcpu=sm_" + str(container->sm_version / 10));
if (container->flags[200] & 0x20) push("-nvptx-f32ftz");
if (container->flags[200] & 0x80) push("-nvptx-prec-sqrtf32=1");
else                               push("-nvptx-prec-sqrtf32=0");
push(container->flags[204] ? "-nvvm-ieee-mode=S" : "-nvvm-ieee-mode=T");
if (container->mode == 2) push("--device-c");  // relocatable compilation

If parsing fails, the error message is "Invalid NVVM IR Container" (error code 259).

Path 1B: Regular LLVM bitcode. For raw .bc files, the function creates a timer object, configures the SM architecture via sub_B6F950, opens the file via sub_C7EAD0, and parses it into an LLVM module via sub_A01950.

LibNVVM API Dispatch IDs

Internal function sub_12BC0F0(id) returns API function pointers by numeric ID. This indirection exists because the LibNVVM API is implemented within the same binary — these aren't dynamically-linked external functions but rather internal call points resolved through a dispatch table. The hex IDs double as a form of internal documentation:

ID	Hex	Function
2151	0x0867	`nvvmCreateCU`
4111	0x100F	`nvvmGetCompiledResult`
4660	0x1234	`nvvmCUAddModule`
17185	0x4321	`nvvmCUSetExtraArgs`
21257	0x5309	`nvvmDestroyCU`
41856	0xA380	`nvvmGetCompilationLog`
46903	0xB737	`nvvmGetCompiledResultLog`
46967	0xB777	`nvvmGetErrorString`
48813	0xBEAD	`nvvmCUCompile`
48879	0xBEEF	Callback registrar
61451	0xF00B	`nvvmGetCompiledResultSize`
62298	0xF37A	`nvvmCUAddModuleFromBuffer`
65261	0xFEED	`nvvmCUSetOptions`

The complete dispatch table in sub_12BC0F0 contains 25 entries implemented as a binary search tree on the ID value:

ID	Hex	Target	Semantic Name
2151	0x0867	`sub_12BB090`	`nvvmCreateCU`
2167	0x0877	`sub_12BB090`	(alias)
3911	0x0F47	`sub_12BBF40`	`nvvmCUSetProgressCallback`
4111	0x100F	`sub_12BA8F0`	`nvvmGetCompiledResult`
4606	0x11FE	`sub_12BA330`	`nvvmCULinkModule`
4660	0x1234	`sub_12BC650`	`nvvmCUAddModule`
8320	0x2080	`sub_12BB400`	`nvvmCUSetOption`
11245	0x2BED	`sub_12BB290`	`nvvmCUGetLog`
17185	0x4321	`sub_12BBD80`	`nvvmCUSetExtraArgs`
21257	0x5309	`sub_12B9C40`	`nvvmDestroyCU`
23294	0x5AFE	`sub_12BAF10`	`nvvmVerify`
41856	0xA380	`sub_12BA220`	`nvvmGetCompiledResultSize`
45242	0xB0BA	`sub_12BAB40`	`nvvmCUGetWarnings`
46903	0xB737	`sub_12BA7C0`	`nvvmGetCompiledResultLog`
46967	0xB777	`sub_12B9980`	`nvvmGetErrorString`
48813	0xBEAD	`sub_12BA110`	`nvvmCUCompile`
48879	0xBEEF	`sub_12BACF0`	`nvvmCURegisterCallback`
49522	0xC172	`sub_12BA470`	`nvvmCUGetIR`
51966	0xCAFE	`sub_12B9A50`	`nvvmGetVersion`
56495	0xDCEF	`sub_12B9A40`	(unknown)
57005	0xDEAD	`sub_12B9C00`	`nvvmInit`
61451	0xF00B	`sub_12BA560`	`nvvmGetCompiledResultPTXSize`
61453	0xF00D	`sub_12BA6A0`	`nvvmCURegisterLNKCallback`
61806	0xF16E	`sub_12BAA30`	`nvvmCUGetOptIR`
62298	0xF37A	`sub_12BC8B0`	`nvvmCUAddModuleFromBuffer`
65261	0xFEED	`sub_12B9AB0`	`nvvmSetOptionStrings`

Public LibNVVM API vs Internal CU API

The dispatch table above reveals a critical architectural detail: cicc's internal API uses compilation unit semantics (nvvmCreateCU, nvvmCUAddModule, nvvmCUCompile), while the public LibNVVM shared library (libnvvm.so) exports a different API surface using program semantics (nvvmCreateProgram, nvvmAddModuleToProgram, nvvmCompileProgram). The public API is documented in NVIDIA's nvvm.h header; the internal API exists only within cicc and is never exported.

Evidence for this mapping comes from nvlink's -dlto code path, which dynamically loads libnvvm.so via dlsym() and resolves symbols by their public names:

// nvlink sub_4BC290 — loads libnvvm.so for device LTO
dlsym(handle, "nvvmCreateProgram");    // → internally nvvmCreateCU
dlsym(handle, "nvvmCompileProgram");   // → internally nvvmCUCompile
dlsym(handle, "nvvmGetCompiledResultSize");
dlsym(handle, "nvvmGetCompiledResult");
dlsym(handle, "nvvmDestroyProgram");   // → internally nvvmDestroyCU

The complete mapping between the public libnvvm.so API (as used by external callers like nvlink and user programs) and cicc's internal CU dispatch IDs:

Public API (`libnvvm.so`)	Internal Name	Dispatch ID	Hex	Target
`nvvmCreateProgram`	`nvvmCreateCU`	2151	0x0867	`sub_12BB090`
`nvvmAddModuleToProgram`	`nvvmCUAddModule`	4660	0x1234	`sub_12BC650`
`nvvmLazyAddModuleToProgram`	`nvvmCUAddModuleFromBuffer`	62298	0xF37A	`sub_12BC8B0`
`nvvmCompileProgram`	`nvvmCUCompile`	48813	0xBEAD	`sub_12BA110`
`nvvmVerifyProgram`	`nvvmVerify`	23294	0x5AFE	`sub_12BAF10`
`nvvmGetCompiledResultSize`	`nvvmGetCompiledResultPTXSize`	61451	0xF00B	`sub_12BA560`
`nvvmGetCompiledResult`	`nvvmGetCompiledResult`	4111	0x100F	`sub_12BA8F0`
`nvvmGetProgramLogSize`	`nvvmGetCompiledResultSize`	41856	0xA380	`sub_12BA220`
`nvvmGetProgramLog`	`nvvmGetCompiledResultLog`	46903	0xB737	`sub_12BA7C0`
`nvvmDestroyProgram`	`nvvmDestroyCU`	21257	0x5309	`sub_12B9C40`

Note the naming confusion in the internal API: nvvmGetCompiledResultSize (ID 0xA380) returns the log size, while nvvmGetCompiledResultPTXSize (ID 0xF00B) returns the actual PTX output size. The public API resolves this with clearer names (nvvmGetProgramLogSize vs nvvmGetCompiledResultSize).

The internal-only API entries have no public equivalents:

Internal Name	Dispatch ID	Hex	Target	Purpose
`nvvmInit`	57005	0xDEAD	`sub_12B9C00`	One-time initialization of LLVM infrastructure
`nvvmGetVersion`	51966	0xCAFE	`sub_12B9A50`	Returns internal NVVM version tuple
`nvvmGetErrorString`	46967	0xB777	`sub_12B9980`	Maps `nvvmResult` code to human-readable string
`nvvmSetOptionStrings`	65261	0xFEED	`sub_12B9AB0`	Bulk-loads LLVM CLI option table (37 entries)
`nvvmCUSetExtraArgs`	17185	0x4321	`sub_12BBD80`	Passes additional argc/argv to compilation
`nvvmCUSetOption`	8320	0x2080	`sub_12BB400`	Sets a single compilation option
`nvvmCUSetProgressCallback`	3911	0x0F47	`sub_12BBF40`	Registers progress/cancellation callback
`nvvmCURegisterCallback`	48879	0xBEEF	`sub_12BACF0`	Registers stage-boundary callback (verbose output)
`nvvmCURegisterLNKCallback`	61453	0xF00D	`sub_12BA6A0`	Registers LNK-stage-specific callback
`nvvmCUGetLog`	11245	0x2BED	`sub_12BB290`	Alternative log retrieval interface
`nvvmCUGetWarnings`	45242	0xB0BA	`sub_12BAB40`	Retrieves warning-only messages
`nvvmCUGetIR`	49522	0xC172	`sub_12BA470`	Retrieves intermediate LLVM IR after linking
`nvvmCUGetOptIR`	61806	0xF16E	`sub_12BAA30`	Retrieves optimized IR (post-OPT stage); also used by `-irversion`
`nvvmCULinkModule`	4606	0x11FE	`sub_12BA330`	Explicit module linking (separate from add-then-compile)
(unknown)	56495	0xDCEF	`sub_12B9A40`	Unknown (one byte smaller than `nvvmGetVersion`)
(alias)	2167	0x0877	`sub_12BB090`	Alias for `nvvmCreateCU` (same target, different ID)

The nvvmCUGetOptIR function at sub_12BAA30 serves double duty: it is both the post-optimization IR retrieval API and the target of sub_12BC0E0 (a thunk called from sub_8F9C90 for the -irversion flag). When the user passes -irversion, the real main calls sub_12BC0E0 which dispatches to sub_12BAA30, which returns the IR version tuple as major * 100 + minor. This value is printed to stdout and the process exits immediately.

The `sub_12BC0F0` Dispatch Mechanism

sub_12BC0F0 is a ~3 KB function at 0x12BC0F0 that implements a binary search tree over the 25 dispatch IDs. The function takes a single unsigned int argument (the ID) and returns a function pointer (void*). The tree is hardcoded as a series of comparison-and-branch instructions, not as a data-driven lookup table.

// Pseudocode for sub_12BC0F0(unsigned int id)
void* nvvm_dispatch(unsigned int id) {
    // Binary search over 25 IDs
    if (id < 17185) {
        if (id < 4660) {
            if (id == 2151 || id == 2167) return sub_12BB090;
            if (id == 3911) return sub_12BBF40;
            if (id == 4111) return sub_12BA8F0;
            if (id == 4606) return sub_12BA330;
        } else {
            if (id == 4660)  return sub_12BC650;
            if (id == 8320)  return sub_12BB400;
            if (id == 11245) return sub_12BB290;
        }
    } else {
        // ... upper half of the tree
        if (id == 48813) return sub_12BA110;   // 0xBEAD
        if (id == 65261) return sub_12B9AB0;   // 0xFEED
        // etc.
    }
    return NULL;  // unknown ID
}

The hex IDs are deliberately memorable patterns used as a form of internal documentation: 0xDEAD = init, 0xBEAD = compile, 0xBEEF = callback, 0xCAFE = version, 0xFEED = options, 0xF00D = LNK callback, 0xF00B = result size. The secondary ID 0x0877 (2167) is an alias for 0x0867 (2151) and dispatches to the same sub_12BB090 target, suggesting an internal API version migration where both old and new IDs must remain functional.

Dual-Path Initialization

The two compilation paths (Path A and Path B) use independent initialization sequences, creating a dual-path initialization architecture where the same underlying LLVM infrastructure is bootstrapped through different entry points. This is why two copies of libdevice, two LLVM options tables, and two sets of verbose callbacks exist.

Path A initialization (EDG → LibNVVM):
  sub_B6EEA0  — Creates LLVMContext + registers 42+ metadata kinds
                 (dbg=1, tbaa=2, prof=3, ... noalias.addrspace=42)
  sub_900130  — 39 KB CLI parser for Path A flags
  sub_905880  — EDG frontend produces LLVM module (880-byte object)
  sub_908850  — Binds module to target: data layout, triple, verification
  → sub_905EE0 enters LibNVVM pipeline with module

Path B initialization (Standalone):
  sub_1602D10 — Creates standalone LLVMContext (no EDG metadata assumptions)
  sub_125FB30 — 8 KB CLI parser for Path B flags
  sub_1265340 — Pre-compilation setup (configure output path, timer)
  → sub_1265970 enters LibNVVM pipeline with bitcode input

The version resolver sub_12B9F70 at 0x12B9F70 is shared between both paths and determines which NVVM IR compatibility mode to use. It reads two obfuscated environment variables in sequence:

// Pseudocode for sub_12B9F70(unsigned int sm_version)
int nvvm_version_resolve(unsigned int sm_version) {
    // Try NV_NVVM_VERSION first (decrypted from 0x3C23A90)
    char *env = getenv(decrypt("NV_NVVM_VERSION"));
    if (!env) {
        // Fallback: try LIBNVVM_NVVM_VERSION (decrypted from 0x42812F0)
        env = getenv(decrypt("LIBNVVM_NVVM_VERSION"));
    }
    if (env) {
        if (strcmp(env, "nvvm70") == 0)      return 0;  // Path B mode
        if (strcmp(env, "nvvm-latest") == 0)  return 1;  // Path A mode
    }
    // Default: SM >= 100 uses Path B, SM < 100 uses Path A
    return (sm_version > 99) ? 0 : 1;
}

This function is called from both sub_8F9C90 (the real main, for v253 resolution) and sub_12BB580 (inside the LibNVVM compilation unit initialization). The dual call-site ensures that the version mode is consistent regardless of whether the compiler was invoked via CLI or via the LibNVVM API.

The nvvmInit function (ID 0xDEAD, sub_12B9C00) performs one-time LLVM infrastructure initialization. It is called implicitly during nvvmCreateCU (sub_12BB090) via a pthread_once guard at dword_4F92D9C. The initialization includes:

Registering LLVM target triples (nvptx64-nvidia-cuda, nvptx-nvidia-cuda)
Initializing the NVPTX target machine factory
Setting up the LLVM pass registry
Configuring thread-safety based on LIBNVVM_DISABLE_CONCURRENT_API (byte_4F92D70)

When byte_4F92D70 == 1 (concurrent API disabled), the pipeline operates in single-threaded mode — no pthread_mutex locks are acquired around compilation unit operations, and Phase II concurrent optimization is disabled regardless of the module's function count.

Internal API Usage Sequence

The complete sequence of dispatch table calls during a standard Path A compilation (from sub_905EE0):

1.  sub_12BC0F0(2151)   → nvvmCreateCU(&handle)
    Creates compilation unit. Calls nvvmInit via pthread_once on first use.

2.  sub_12BC0F0(46967)  → nvvmGetErrorString
    Saved for later error message formatting.

3.  sub_12BC0F0(4660)   → nvvmCUAddModule(handle, IR_data, IR_size, NULL)
    Adds the user's LLVM bitcode module.

4.  sub_12BC0F0(21257)  → nvvmDestroyCU
    Saved as cleanup function pointer (not called yet).

5.  sub_12BCB00 [thunk]  → nvvmCUAddModuleFromBuffer(handle, buf, size, NULL)
    Called N times: once per additional module from extra args,
    once for libdevice (embedded or external).

6.  sub_12BC0F0(48879)  → nvvmCURegisterCallback
    Registers verbose stage callbacks:
      sub_903BA0 with ID 61453 (LNK stage)
      sub_903730 with ID 47710 (LLC stage)
    When -keep mode active, also registers:
      sub_9085A0 with ID 64222 (OPT output → .opt.bc file)
      sub_908220 with ID 56993 (LLC output → final file)

7.  sub_12BC0F0(65261)  → nvvmSetOptionStrings(opts_table, 37)
    Loads 37 LLVM backend configuration strings from off_4B90FE0.
    Calls sub_1C31130() internally to register/reset LLVM options.

8.  sub_12BC0F0(48813)  → nvvmCUCompile(handle, 57069)
    Main compilation. Phase code 57069 (0xDEED) triggers full
    LNK → OPT → [OPTIXIR] → LLC pipeline in sub_12C35D0.

9.  sub_12BC0F0(17185)  → nvvmCUSetExtraArgs(handle, argc, argv)
    Passes additional arguments collected from the CLI.

10. sub_12BC0F0(41856)  → nvvmGetCompiledResultSize(handle, &log_size)
    Queries the compilation log size.

11. sub_12BC0F0(46903)  → nvvmGetCompiledResultLog(handle, log_buf)
    Retrieves the compilation log (warnings/errors).

12. sub_12BC0F0(61451)  → nvvmGetCompiledResultPTXSize(handle, &ptx_size)
    Queries the PTX output size.

13. sub_12BC0F0(4111)   → nvvmGetCompiledResult(handle, ptx_buf)
    Copies the generated PTX into the caller's buffer.

14. sub_12BC0F0(21257)  → nvvmDestroyCU(&handle)
    Destroys the compilation unit, frees all internal resources.

Path B (sub_1265970) follows the identical sequence but uses off_4C6EEE0 for the options table (step 7), unk_420FD80 for the embedded libdevice (step 5), and appends "-nvvm-version=nvvm70" instead of "-nvvm-version=nvvm-latest" to the pipeline arguments.

nvvmResult Error Codes

The nvvmGetErrorString function (ID 0xB777, sub_12B9980) maps integer result codes from all API functions to descriptive strings:

Code	Constant	Description
0	`NVVM_SUCCESS`	Operation completed successfully
1	`NVVM_ERROR_OUT_OF_MEMORY`	Memory allocation failed
2	`NVVM_ERROR_PROGRAM_CREATION_FAILURE`	Failed to create compilation unit
3	`NVVM_ERROR_IR_VERSION_MISMATCH`	Incompatible NVVM IR version detected
4	`NVVM_ERROR_INVALID_INPUT`	Malformed input (bad bitcode, wrong magic)
5	`NVVM_ERROR_INVALID_PROGRAM`	Null or invalid compilation unit handle
6	`NVVM_ERROR_INVALID_IR`	IR failed verification
7	`NVVM_ERROR_INVALID_OPTION`	Unrecognized compilation option
8	`NVVM_ERROR_NO_MODULE_IN_PROGRAM`	Compilation unit has no modules added
9	`NVVM_ERROR_COMPILATION`	Compilation failed (linker, optimizer, or codegen error)
10	`NVVM_ERROR_CANCELLED`	Compilation cancelled by user callback

The pipeline orchestrator sub_12C35D0 maps its internal return codes to these: 0 → NVVM_SUCCESS, 7 → NVVM_ERROR_INVALID_OPTION, 9 → NVVM_ERROR_COMPILATION, 10 → NVVM_ERROR_CANCELLED, 100 → NVVM_ERROR_COMPILATION (post-pipeline verification failure).

37 LLVM Options from `off_4B90FE0`

Phase 2.10 loads a hardcoded table of 37 LLVM option strings from off_4B90FE0 (296 bytes = 37 pointers). These are static, compiled-in LLVM backend configuration flags that are injected into every compilation unit via nvvmSetOptionStrings (ID 0xFEED). The options include target architecture flags (-march=nvptx64, -mcpu=sm_XX), math precision controls (-nvptx-f32ftz, -nvptx-prec-sqrtf32=), optimization levels, debug info flags, and NVPTX-specific feature knobs. The sub_12B9AB0 target function calls sub_1C31130() -- the LLVM option registration/reset function -- to apply them.

Embedded Libdevice

A key design decision: two identical copies of the libdevice bitcode are statically embedded in the binary. Each is 455,876 bytes (~445 KB) of LLVM bitcode containing ~400+ math functions (__nv_sin, __nv_cos, __nv_exp, __nv_log, __nv_sqrt, etc.) plus atomic operation helpers and FP16/BF16 conversion routines. The duplication exists because Path A and Path B have separate initialization sequences and the linker didn't deduplicate the .rodata sections.

When the user provides -nvvmir-library <path>, the external file is used instead. This allows overriding the built-in math library — useful for testing custom libdevice builds.

Path	Address	Size	Purpose
Path A	`unk_3EA0080`	455,876 bytes	Default libdevice for LibNVVM mode
Path B	`unk_420FD80`	455,876 bytes	Default libdevice for standalone mode

Verbose Callbacks and Intermediate Files

Phase 2.9 registers callback functions that fire at pipeline stage boundaries. When verbose mode is active, these callbacks produce reconstructed command-line output for each stage:

[ "<src>" -lnk -nvvmir-library "<path>" "<input>" -o "<file>.lnk.bc" <opts> -nvvm-version=nvvm-latest ]
[ "<src>" -llc "<llc_path>" -o "<output>" <opts> -nvvm-version=nvvm-latest ]

The callback registration uses sub_12BC0F0(48879) (ID 0xBEEF = nvvmCURegisterCallback) with stage-specific callback IDs:

Callback	ID	Stage
`sub_903BA0`	61453	LNK stage output
`sub_903730`	47710	LLC stage output
`sub_9085A0`	64222	OPT output (keep mode)
`sub_908220`	56993	LLC output (keep mode)

Intermediate file paths (.lnk.bc for linked-but-unoptimized, .opt.bc for optimized-but-not-yet-codegen'd) are always constructed as strings, but the actual files are only written to disk when the -keep flag is active in wizard mode.

Path A Error Messages

All errors from sub_905EE0 are written to stderr via sub_223E0D0. Error categories:

Category	Prefix	Example
File I/O	`"<src>: "`	`"error in open <file>"`, `"input file <f> read error"`
LibNVVM API	`"libnvvm: error: "`	`"failed to create the libnvvm compilation unit"`
Output	`"<src>: "`	`"IO error: <system_error_msg>"`
Fatal	(none)	`"basic_string::append"` (std::string overflow at 0x3FFFFFFFFFFFFFFF)

The error code from LibNVVM API calls maps to nvvmResult: 0 = success, 1 = out of memory, 4 = invalid input, 5 = invalid compilation unit (null handle).

Path B — Standalone cicc Pipeline (`sub_1265970`)

Path B is the standalone compilation path used when cicc is invoked with LLVM bitcode input (.bc files), by the LibNVVM API directly, or as the default for SM >= 100 architectures. Despite the different entry point, it shares the same underlying LLVM infrastructure as Path A — the difference is in how modules are loaded and how the pipeline stages are orchestrated. Path B appends -nvvm-version=nvvm70 to the optimizer arguments, indicating it targets the NVVM 7.0 IR specification (corresponding to LLVM 7.0.1 bitcode format, the version NVIDIA froze their IR compatibility at).

The 4-stage pipeline (LNK → OPT → OPTIXIR → LLC) runs in-memory: each stage takes an LLVM Module, transforms it, and passes it to the next stage. The OPTIXIR stage is optional and only active when --emit-optix-ir is specified. A user-provided cancellation callback can abort compilation between stages (return code 10).

Field	Value
Address	`0x1265970`
Size	~48KB (1,371 lines)
Timer	`"LibNVVM"` (same name as Path A)
Version string	`-nvvm-version=nvvm70`

Path B Entry — `sub_1262860`

sub_1262860 (418 lines) is the command-line entry point for Path B, analogous to sub_902D10 for Path A. It parses CLI flags, initializes the compilation context, and calls sub_1265970 for the actual compilation.

Field	Value
Address	`0x1262860`
Timer init	`sub_1602D10` (standalone context, contrasted with Path A's `sub_B6EEA0`)
CLI parser	`sub_125FB30` (Path B's equivalent of Path A's `sub_900130`)

The flow is: allocate timer handle → parse CLI via sub_125FB30 → configure output path → call sub_1265340 for pre-compilation setup → call sub_1265970 for compilation → write output. Output can go to stdout if the output path is "-", handled by sub_125C500. On failure: "\n Error processing command line: <details>".

Path B Compilation Orchestrator — `sub_1265970`

This 48 KB function mirrors sub_905EE0's role but with Path B's initialization and context. It handles both LibNVVM API invocations (when a11 = 1) and CLI invocations (when a11 = 0), with the same 14-phase structure as Path A but using Path B's context objects and the nvvm70 version string.

Key behavioral differences from Path A:

Context initialization. Path B uses sub_1602D10 for context init (rather than sub_B6EEA0), which creates a standalone LLVM context without the EDG frontend's metadata registration assumptions.
NVVM IR container handling. Container parsing is performed by sub_12642A0 (Path B's container parser) rather than sub_9047E0.
Embedded libdevice address. Uses unk_420FD80 (the second copy) rather than unk_3EA0080.
LLVM options table. Loads 37 options from off_4C6EEE0 (Path B's copy) rather than off_4B90FE0.
Verbose callbacks. Registers sub_1263280 (ID 61453) and sub_12636E0 (ID 47710) for LNK and OPT stage output respectively, and sub_1268040/sub_1267CC0 for keep-mode output.
Version string. Always appends "-nvvm-version=nvvm70" rather than "-nvvm-version=nvvm-latest".

4-Stage Pipeline Orchestrator — `sub_12C35D0`

The orchestrator creates two backend objects — nvopt (512 bytes, the optimizer) and nvllc (480 bytes, the code generator) — and wires them together with the stage dispatch structure. Each stage is controlled by a bit in a stage bitmask derived from sub_12D2AA0, which parses architecture and options into per-stage configuration.

Field	Value
Address	`0x12C35D0`
Size	41KB (1,446 lines)
Backend objects	`nvopt` (512 bytes) + `nvllc` (480 bytes)

Stage	Bit	Timer String	Core Function
LNK	0x01	`"LNK"` / `"LibNVVM module linking step."`	`sub_12C06E0` (63KB, module linker)
OPT	0x80	`"OPT"` / `"LibNVVM optimization step."`	`sub_12E7E70` (full LLVM pipeline)
OPTIXIR	0x40	`"OPTIXIR"` / `"LibNVVM Optix IR step."`	`sub_12F9270` (OptiX IR gen)
LLC	0x04	`"LLC"` / `"LibNVVM code-generation step."`	`sub_12F5100` (SelectionDAG codegen)

Pipeline stage bitmask (from sub_12D2AA0): bit 0=LNK, bit 2=LLC, bit 5=verify, bit 6=OPTIXIR, bit 7=OPT.

Return codes: 0=success, 7=parse failure, 9=link/layout/verification error, 10=cancelled, 100=post-pipeline verification failure.

Backend Object Initialization

The orchestrator allocates and initializes two backend objects with distinct vtables:

// nvllc — code generator backend (480 bytes)
v8 = sub_22077B0(480);
sub_12EC960(v8, "nvllc", 5);
v8->vtable = &unk_49E7FF0;

// nvopt — optimizer backend (512 bytes)
v10 = sub_22077B0(512);
sub_12EC960(v10, "nvopt", 5);
v10->vtable = &unk_49E6A58;
v10->sub_vtable = &unk_49E6B20;    // at offset +60*8
v10->plugin_slots[0..2] = 0;       // offsets 61-63 cleared

A stage dispatch structure (vtable &unk_49E6B38) links the OPT output to the LLC input and stores the cancellation callback pointer.

Cancellation Callback

Between every pipeline stage, the orchestrator checks an optional user-provided cancellation callback stored at state[26]:

cancellation_fn = state[26];
if (cancellation_fn && cancellation_fn(state[27], 0))
    return 10;   // CANCELLED

This mechanism allows the LibNVVM API caller to abort a long-running compilation. Return code 10 propagates up through the entire call chain, causing sub_8F9C90 to return 10 as the process exit code.

Two-Phase Optimization (OPT Stage)

The OPT stage calls sub_12E7E70, which implements a two-phase optimization protocol. Both phases call the same underlying pipeline function sub_12E54A0, but a TLS variable qword_4FBB3B0 is set to 1 or 2 to indicate which phase is active:

Phase	TLS value	Purpose
Phase I	1	Analysis + early IR optimization (module-level, CGSCC, function passes)
Phase II	2	Backend optimization + codegen preparation (lowering, legalization)
Complete	3	Compilation finished for this module

Between phases, sub_12D4250 checks concurrency eligibility: if the module contains more than one defined function (non-declaration), and the options permit it, Phase II can run with multiple threads. Thread count is determined from opts[1026] or falls back to get_nprocs(). When concurrency is enabled, sub_12E7B90 is the concurrent worker entry point.

For single-function modules, the optimizer skips the two-phase protocol entirely and runs a single un-phased call to sub_12E54A0 -- no phase counter is set, and the optimizer executes both analysis and backend passes in one invocation.

Data Layout Validation

After the LLC stage but before returning, the orchestrator validates the module's data layout string. If the module has no data layout:

"DataLayoutError: Data Layout string is empty"
→ return 9

On layout mismatch, it produces a detailed diagnostic:

"<error details>\nExample valid data layout:\n64-bit: <reference_layout>"

The reference layout string is loaded from off_4CD4948[0].

Module Linker — `sub_12C06E0`

The LNK stage's core function (63KB) links multiple LLVM bitcode modules into a single module. This is where user code gets linked with the libdevice math library and any additional modules. The linker performs several validation steps to catch incompatible IR early — before the expensive optimization and codegen stages:

Bitcode magic validation: checks for 0xDE,0xC0,0x17,0x0B (raw LLVM bitcode) or 0x42,0x43,0xC0,0xDE (bitcode wrapper). Anything else → error code 9.
Triple validation: every module's target triple must start with "nvptx64-". Modules without a triple get a clear error: "Module does not contain a triple, should be 'nvptx64-'".
IR version compatibility: sub_12BFF60 reads "nvvmir.version" metadata (2 or 4 element tuples: major.minor or major.minor.debug_major.debug_minor). The NVVM_IR_VER_CHK environment variable can disable this check entirely (set to "0"), useful when mixing IR from different CUDA toolkit versions.
Symbol size matching: for multi-module linking, compares the byte sizes of identically-named globals across modules. Size computation uses type codes (1=half(16b), 2=float(32b), 3=double(64b), 7=ptr, 0xB=integer, 0xD=struct, 0xE=array). A mismatch produces: "Size does not match for <sym> in <mod> with size X specified in <other> with size Y."

Single-module fast path: When only one module is present (after adding user code and libdevice), the linker returns it directly via sub_1C3DFC0 without invoking the full linking machinery.

Multi-module linking: For N > 1 modules, the linker copies the primary module's target triple to all secondary modules, then calls sub_12F5610 to perform the LLVM link. After user modules are linked, builtin modules (from a1[3..4]) are linked via sub_1CCEBE0, followed by target feature configuration via sub_1CB9110 and sub_1619140.

NVVM IR Version Checker — `sub_12BFF60`

The version checker reads "nvvmir.version" named metadata and validates it against the compiler's expected version range.

Field	Value
Address	`0x12BFF60`
Size	~9 KB (362 lines)
Metadata key	`"nvvmir.version"`
Debug metadata	`"llvm.dbg.cu"`

Version tuples come in two forms:

2-element: (major, minor) — IR version only. Special case: (2, 0) always passes.
4-element: (major, minor, debug_major, debug_minor) — IR version plus debug info version. Special case: debug_major == 3, debug_minor <= 2 always passes.

The NVVM_IR_VER_CHK environment variable is checked multiple times throughout the validation. When set to "0", all version checks are bypassed, returning 0 (compatible). This is a critical escape hatch for mixing bitcode from different CUDA toolkit versions.

Memory Management

jemalloc — The Global Allocator

cicc statically links a jemalloc 5.x allocator in the address range 0x12FC000–0x131FFFF (~400 functions). This replaces the system malloc/free entirely. The jemalloc configuration parser (sub_12FCDB0, 131,600 bytes -- the largest single function in this range) handles the MALLOC_CONF environment variable and /etc/malloc.conf symlink, supporting dozens of tuning options: abort, cache_oblivious, metadata_thp, trust_madvise, retain, dss, tcache, narenas, percpu_arena, background_thread, san_guard_small, san_guard_large, and more.

The choice of jemalloc over glibc's allocator is significant for compiler workloads. jemalloc's thread-local caching (tcache) and arena-per-CPU design (percpu_arena) reduce contention during the concurrent Phase II optimization, where multiple threads may be simultaneously allocating and freeing IR nodes, instruction objects, and analysis results.

The jemalloc stats subsystem (functions at 0x400000–0x42FFFF) provides comprehensive per-arena statistics including allocation counts, active/dirty/muzzy page tracking, mutex contention metrics, and HPA hugify counts. These can be triggered via MALLOC_CONF="stats_print:true".

EDG Memory Regions — `sub_822260`

The EDG 6.6 frontend uses a custom memory region system configured with USE_MMAP_FOR_MEMORY_REGIONS = 1. During post-parse validation in sub_617BD0 (lgenfe_main), sub_822260() is called 11 times to initialize memory regions 1 through 11. These regions serve as arena-style allocators for different categories of EDG internal data:

Token buffers (preprocessor token storage)
IL node pools (intermediate language tree nodes)
Symbol tables (name→declaration mappings)
Type representations (structural type information)

The mmap-backed regions grow by mapping additional pages on demand, avoiding the fragmentation problems that would occur with individual malloc calls for the millions of small, short-lived objects the frontend creates during parsing. Region cleanup happens in bulk when the frontend completes -- all pages for a region are unmapped at once rather than individually freed.

The EDG heap allocator cluster at 0x821000–0x823FFF includes tracked allocation (sub_822B10/sub_822B90) with a 1024-entry inline tracking array (unk_4F19620, 1024 * 24 bytes) that overflows to heap when exceeded. The tracking count is maintained in dword_4F19600. The finalization function sub_823310 walks bucket chains to free all tracked allocations.

Large Argument Lists

The argv copy in sub_8F9C90 uses a threshold-based allocation strategy:

if (8 * argc <= 0x800)   // argc <= 256
    v284 = stack_buffer;  // 2096 bytes on stack
else
    v284 = sub_16CD150(8 * argc);  // heap allocation

This avoids heap allocation for the common case (most cicc invocations have fewer than 256 arguments) while handling the worst case gracefully. The heap path uses sub_16CD150 (a realloc-like wrapper), and the buffer is freed during cleanup if it was heap-allocated.

Signal Handling and Crash Recovery

EDG Signal Handler

The EDG frontend registers a signal handler at 0x723610 during initialization:

// signal handler (0x723610)
void handler(int sig) {
    write(STDERR_FILENO, "\n", 1);
    dword_4F0790C = 1;    // set "interrupted" flag
    sub_7235F0(9);         // initiate orderly shutdown
}

This handler is registered for SIGINT, allowing the compiler to be interrupted gracefully during long frontend operations (template instantiation, constexpr evaluation). The global dword_4F0790C flag is checked periodically by the parser loop, enabling cooperative cancellation.

LLVM Crash Recovery

The LLVM infrastructure provides its own crash handling via the print-on-crash and print-on-crash-path CLI options (registered in the 0x4F0000–0x51FFFF range). When enabled, the LLVM pass manager dumps the current IR to a specified path on any unhandled signal (SIGSEGV, SIGABRT, etc.). This is separate from the EDG handler and covers the optimization and codegen phases.

Concurrent API Protection

The global constructor at 0x4A5810 checks LIBNVVM_DISABLE_CONCURRENT_API. When set (to any value), byte_4F92D70 = 1 disables thread-safe LibNVVM API usage. The pipeline orchestrator (sub_12C35D0) uses pthread_once(&dword_4F92D9C, init_routine) for one-time setup, and TLS at __readfsqword(0)-24 stores exception handling stack frames while __readfsqword(0)-32 stores the cleanup function sub_12BCC20. These TLS slots ensure that concurrent compilations in the same process do not corrupt each other's state.

Timer Infrastructure

Compilation timing is implemented through a hierarchical timer system. Timer creation (sub_C996C0) takes a label and context string; timer stop (sub_C9AF60) records the elapsed time. The timer hierarchy is:

"CUDA C++ Front-End"     ← EDG parsing + IL-to-IR conversion (Path A only)
  └─ "LibNVVM"           ← Full optimization + codegen pipeline
       ├─ "LNK"          ← Module linking (sub_12C06E0)
       ├─ "OPT"          ← LLVM optimization (sub_12E7E70)
       │    ├─ "Phase I"  ← Analysis + early optimization
       │    └─ "Phase II" ← Backend optimization + codegen prep
       ├─ "OPTIXIR"      ← OptiX IR generation (optional)
       └─ "LLC"          ← SelectionDAG codegen (sub_12F5100)

The profiler is controlled by sub_C96F30() (returns nonzero when active). Timer data is written to the output file after compilation via sub_C9C600 (Path A) or sub_16DD960 (Path B). The -time flag or environment variable controls activation. The timer names appear in the profiler output, making them essential for identifying compilation bottlenecks.

Architecture Detection — `sub_95EB40`

One of the most important functions in cicc: the architecture detection system translates a single user-facing flag like -arch=compute_90a into three independent flag strings, one for each pipeline stage. This 3-column fan-out is necessary because the EDG frontend, the LLVM optimizer, and the LLVM backend each use different flag formats to specify the target architecture. The mapping is stored in a std::map<string, ArchTriple> in a red-black tree at a1+248.

Column	Target	Example
Column 1	EDG frontend	`-R __CUDA_ARCH=750`
Column 2	Optimizer	`-opt-arch=sm_75`
Column 3	LLC backend	`-mcpu=sm_75`

Architecture Validation Bitmask

Before the 3-column mapping is consulted, the architecture number is validated against a hardcoded 64-bit bitmask. This is a fast rejection filter: the SM number minus 75 gives a bit index, and if that bit isn't set in the constant 0x60081200F821, the architecture is rejected. This means cicc v13.0 has a fixed, compile-time-determined set of supported architectures — you cannot add new SM targets without rebuilding the binary.

offset = arch_number - 75;
if (offset > 0x2E || !_bittest64(&0x60081200F821, offset))
    → ERROR: "is an unsupported option"

Valid architectures (bit positions in 0x60081200F821). Note the gaps — SM 81–85, 91–99, 101–102, 104–109, 111–119 are all absent:

Bit	SM	Generation
0	75	Turing
5	80	Ampere
11	86	Ampere
12	87	Ampere (Jetson Orin)
13	88	Ada (undocumented)
14	89	Ada Lovelace
15	90	Hopper
25	100	Blackwell
28	103	Blackwell
35	110	Jetson Thor
45	120	Blackwell (sm120) — RTX 50xx / Pro
46	121	Blackwell (sm120) — DGX Spark

Suffix handling: a and f variants share the base SM number for validation but get distinct -mcpu=sm_XXa/-mcpu=sm_XXf strings.

Architecture Parsing in the EDG Frontend

The EDG frontend (sub_617BD0, option ID 0x52 = --nv_arch) performs its own independent architecture parsing that produces three global variables:

Global	Address	Purpose
`unk_4D045E8`	`0x4D045E8`	SM compute version (integer: 75, 80, ..., 121)
`unk_4D045E4`	`0x4D045E4`	Accelerated flag (1 if suffix `a`)
`unk_4D045E0`	`0x4D045E0`	Fast flag (1 if suffix `f`; also sets accelerated=1)

The f suffix (fast-mode) is new to SM >= 100 architectures. When present, it implies a forward-compatible feature set that may not exactly match the base SM version's capabilities.

Flag Catalog — `sub_9624D0`

The flag catalog is the second-largest function in the entry point range at 75KB. It takes the raw CLI arguments and sorts them into four output vectors — one per pipeline stage (lnk, opt, lto, llc). This is the translation layer between user-facing flags and the internal per-stage options that each pipeline component understands.

A clever detail: the function takes a "mode cookie" parameter (a4) that distinguishes CUDA compilation (0xABBA) from OpenCL compilation (0xDEED). Several flags behave differently depending on this cookie — for example, -prec-div=0 maps to -nvptx-prec-divf32=1 in CUDA mode but -nvptx-prec-divf32=0 in OpenCL mode, reflecting the different default precision expectations of the two languages.

Field	Value
Address	`0x9624D0`
Size	75KB (2,626 lines)
Mode cookie	`a4`: `0xABBA`=CUDA, `0xDEED`=OpenCL
Output vectors	lnk, opt, lto, llc (32-byte std::string elements with SSO)

-Ofast-compile Levels

NVIDIA's -Ofast-compile is a compile-time vs runtime-performance tradeoff. At "max" level, it disables memory space optimization and LSA optimization entirely — these are expensive analysis passes that improve runtime performance but slow compilation significantly. The "mid" and "min" levels provide intermediate points. This feature is targeted at iterative development workflows where compile speed matters more than code quality.

Level String	Internal Value	Effect
`"max"`	2	Most optimizations skipped, forces `-lsa-opt=0 -memory-space-opt=0`
`"mid"`	3	Medium speedup
`"min"`	4	Minimal speedup
`"0"`	1 → reset to 0	Disabled

Error: "libnvvm : error: -Ofast-compile specified more than once". Only one -Ofast-compile per compilation is allowed.

Flag-to-Pipeline Routing (Selected)

This table shows how a single user-facing flag gets split into per-stage options. The pattern reveals NVIDIA's compilation architecture: the LNK stage communicates via -R macro definitions (these become #defines visible to the linker), the OPT stage uses NVIDIA-specific optimizer flags (-opt-use-*), and the LLC stage uses LLVM backend flags (-nvptx-*). Some flags like -ftz=1 propagate to all three stages, while others like -aggressive-inline only affect the optimizer.

User Flag	LNK Forward	OPT Forward	LLC Forward
`-ftz=1`	`-R __CUDA_FTZ=1`	`-nvptx-f32ftz`	`-nvptx-f32ftz`
`-prec-div=1` (CUDA)	`-R __CUDA_PREC_DIV=1`	`-opt-use-prec-div=true`	`-nvptx-prec-divf32=2`
`-prec-div=0` (CUDA)	—	`-opt-use-prec-div=false`	`-nvptx-prec-divf32=1`
`-prec-sqrt=1`	`-R __CUDA_PREC_SQRT=1`	—	`-nvptx-prec-sqrtf32=1`
`-fma=1`	—	—	`-nvptx-fma-level=1`
`-fast-math` (CUDA)	`-R __CUDA_USE_FAST_MATH=1`	`-opt-use-fast-math`	—
`-unsafe-math`	`-R FAST_RELAXED_MATH=1 -R __CUDA_FTZ=1`	`-opt-use-fast-math -nvptx-f32ftz`	`-nvptx-fma-level=1 -nvptx-f32ftz`
`-aggressive-inline`	—	`-inline-budget=40000`	—
`-new-nvvm-remat`	—	—	`-enable-new-nvvm-remat=true -nv-disable-remat=true -rp-aware-mcse=true`

nvcc→cicc Flag Translation — `sub_8FE280`

When cicc is invoked by nvcc (the CUDA compiler driver), the flags arrive in nvcc's format and need to be translated to cicc's internal format. This translation happens through a red-black tree at qword_4F6D2A0, populated once on first use (guarded by qword_4F6D2C8). Each entry maps an nvcc flag to a pair: an EDG passthrough string and a cicc internal string. Some flags only affect one side — for example, -fmad=1 has no EDG equivalent (FMA is a backend concern) but maps to cicc's -fma=1. Others are dual-mapped: -O0 becomes both --device-O=0 for EDG and -opt=0 for cicc.

nvcc Flag	EDG Passthrough	cicc Internal
`-O0`..`-O3`	`--device-O=N`	`-opt=N`
`-fmad=1`	—	`-fma=1`
`-prec_sqrt=1`	—	`-prec-sqrt=1`
`-Ofast-compile=max`	—	`-Ofast-compile=max`
`-Ofc=max`	—	`-Ofast-compile=max` (alias)
`--emit-optix-ir`	`--emit-lifetime-intrinsics`	`--emit-optix-ir`
`-discard-value-names`	`--discard_value_names=1`	`-discard-value-names=1`

Environment Variables

cicc checks 20 distinct environment variables across its subsystems. The six NVIDIA-specific variables are the most important for understanding and reimplementing the entry point behavior:

Variable	Function	Effect
`NVVMCCWIZ`	`sub_8F9C90`	Set to `553282` → enables wizard mode (`byte_4F6D280 = 1`)
`NVVM_IR_VER_CHK`	`sub_12BFF60`	Set to `"0"` → disables NVVM IR version checking
`LIBNVVM_DISABLE_CONCURRENT_API`	ctor at `0x4A5810`	Any value → disables thread-safe API (`byte_4F92D70 = 1`)
`NV_NVVM_VERSION`	`sub_8F9C90`, `sub_12B9F70`	`"nvvm70"` or `"nvvm-latest"` → controls Path A/B default and IR compat mode
`LIBNVVM_NVVM_VERSION`	`sub_12B9F70`	Same as `NV_NVVM_VERSION` (checked as fallback)
`LLVM_OVERRIDE_PRODUCER`	ctors at `0x48CC90`, `0x4CE640`	Overrides the producer string in output bitcode metadata

The NV_NVVM_VERSION and LIBNVVM_NVVM_VERSION variables are obfuscated in the binary using the same XOR+ROT13 cipher as the CLI option strings. They are decrypted from 0x3C23A90 and 0x42812F0 respectively.

Key Global Variables

These globals persist across the entire compilation and are accessed from multiple subsystems. The wizard mode flag and flag mapping tree are set during CLI parsing and read throughout the pipeline. The embedded libdevice addresses are compile-time constants (.rodata), while the data model width is set during architecture configuration.

Variable	Purpose
`byte_4F6D280`	Wizard mode flag (gates `-v`, `-keep`)
`qword_4F6D2A0`	Flag mapping red-black tree root
`qword_4F6D2C8`	Tree initialization guard
`byte_4F6D2D0`	`--partial-link` active flag
`byte_4F6D2DC`	`--force-llp64` active flag
`unk_3EA0080`	Embedded libdevice bitcode (Path A, 455,876 bytes)
`unk_420FD80`	Embedded libdevice bitcode (Path B, 455,876 bytes)
`off_4B90FE0`	LLVM options table (Path A, 37 entries)
`off_4C6EEE0`	LLVM options table (Path B, 37 entries)
`unk_4F06A68`	Data model width (8=64-bit, 4=32-bit)
`unk_4D0461C`	Enable `p3:32:32:32` in data layout (shared mem 32-bit ptrs)
`byte_4F92D70`	Concurrent API disabled flag
`dword_4F92D9C`	pthread_once guard for one-time pipeline setup
`qword_4FBB3B0`	TLS: optimization phase counter (1=Phase I, 2=Phase II, 3=done)
`unk_4F6D2F8`	Global module pointer (set by `sub_908850` after EDG binding)

Function Map — Entry Point Cluster

Function	Address	Size	Role
`main()` thunk → `sub_8F9C90`	`0x4396A0`	16 B	--
String deobfuscation (XOR + ROT13)	`0x8F98A0`	~512 B	--
Push string to `std::vector<std::string>`	`0x8F9C20`	~128 B	--
Real main — CLI parser + dispatcher	`0x8F9C90`	10,066 B	--
nvcc→cicc flag translation (red-black tree)	`0x8FE280`	~4 KB	--
Path A CLI processing	`0x900130`	39 KB	--
Path A orchestrator (simple mode)	`0x902D10`	~9 KB	--
LLC stage verbose callback	`0x903730`	~5 KB	--
LNK stage verbose callback	`0x903BA0`	~5 KB	--
NVVM IR container parser (Path A)	`0x9047E0`	10 KB	--
CUDA C++ Front-End (lgenfe stage)	`0x905880`	~6 KB	--
lgenfe single-stage wrapper (Path A)	`0x905E50`	~256 B	--
LibNVVM pipeline driver (Path A)	`0x905EE0`	43 KB	--
Backend SM config + EDG module binding	`0x908850`	10 KB	--
Architecture detection (3-column fan-out)	`0x95EB40`	38 KB	--
Flag catalog (4 output vectors)	`0x9624D0`	75 KB	--
Pipeline option parser (4 stage vectors)	`0x9685E0`	~8 KB	--
Path B CLI processing	`0x125FB30`	~8 KB	--
Path B entry (simple mode)	`0x1262860`	~4 KB	--
Path B LNK verbose callback	`0x1263280`	~1 KB	--
Path B OPT verbose callback	`0x12636E0`	~1 KB	--
NVVM container parser (Path B)	`0x12642A0`	~3 KB	--
Path B pre-compilation setup	`0x1265340`	~4 KB	--
lgenfe single-stage wrapper (Path B)	`0x12658E0`	~256 B	--
LibNVVM compilation entry (Path B)	`0x1265970`	48 KB	--
LibNVVM API dispatch table (25 entries)	`0x12BC0F0`	~3 KB	--
Thunk → `sub_12BC8B0` (nvvmCUAddModuleFromBuffer)	`0x12BCB00`	~64 B	--
NVVM IR version checker	`0x12BFF60`	~9 KB	--
Module linker (LNK stage core)	`0x12C06E0`	63 KB	--
4-stage pipeline orchestrator	`0x12C35D0`	41 KB	--
Stage bitmask parser	`0x12D2AA0`	~4 KB	--
Concurrency eligibility check	`0x12D4250`	~2 KB	--
Two-phase optimizer entry	`0x12E7E70`	~8 KB	--
Concurrent worker entry point	`0x12E7B90`	~4 KB	--
LLC core (SelectionDAG codegen)	`0x12F5100`	~12 KB	--
OptiX IR generator	`0x12F9270`	~6 KB	--
Path B context initialization	`0x1602D10`	~2 KB	--

Cross-References

EDG Frontend — sub_617BD0 (lgenfe_main), the 282-case CLI dispatch inside the EDG 6.6 frontend
NVVM Container Format — Container parsing by sub_9047E0 (Path A) and sub_12642A0 (Path B)
Optimizer Pipeline — The OPT stage driven by sub_12E7E70 (two-phase optimization)
IR Generation — Module creation via sub_908850 (EDG module binding)
PTX Emission — The LLC stage's PTX output via sub_12F5100

nvcc-to-cicc Interface Contract

When nvcc compiles device code, it invokes cicc as an external process, passing the preprocessed CUDA source (or LLVM bitcode) along with a carefully translated set of flags. cicc never sees the raw -fmad=1 or -prec_sqrt=0 flags that the user typed on the nvcc command line -- those are rewritten through a flag translation table implemented as a global std::map red-black tree at sub_8FE280. This page documents the complete interface contract: how nvcc invokes cicc, how flags are translated, how the mode cookie selects CUDA vs. OpenCL behavior, what input formats are accepted, and what output modes are available.

The flag translation is split into two stages. Stage 1 (sub_8FE280) translates nvcc-facing flags into cicc-facing flags, producing a dual-slot result with an EDG front-end flag and an internal cicc flag. Stage 2 (sub_95EB40) further expands each cicc-facing flag into a three-column architecture mapping, routing each flag to the EDG frontend, the NVVM optimizer, and the LLC backend. The composition of these two stages means a single nvcc flag like -fmad=1 can silently become --emit-llvm-bc (always injected), nothing to EDG, nothing to OPT, and -nvptx-fma-level=1 to LLC.


Flag translation tree	`sub_8FE280` -- global `std::map` at `qword_4F6D2A0`, 40+ entries
Tree guard	`qword_4F6D2C8` (set to 1 after first initialization)
Tree node size	72+ bytes: key at +32, length at +40, `FlagPair*` at +64
CLI parser (Path A)	`sub_900130` (39 KB, 12 parameters)
Flag catalog (Path A/B)	`sub_9624D0` (75 KB, 2,626 lines, 4 output vectors)
3-column arch table	`sub_95EB40` (38 KB, 23 architectures, 3-column fan-out)
Mode cookies	`0xABBA` = CUDA, `0xDEED` = OpenCL
Default architecture	`compute_75` / `sm_75` (Turing)
Input extensions	`.bc`, `.ci`, `.i`, `.cup`, `.optixir`, `.ii`
Default opt level	`-opt=3` (O3)

Invocation Contract

nvcc invokes cicc as a subprocess with a single input file and a set of translated flags. The general invocation form is:

cicc [mode-flags] [translated-flags] [pass-through-flags] -o <output> <input>

For the standard CUDA compilation path (no explicit -lXXX mode flag), cicc enters sub_8F9C90 (real main, 10,066 bytes at 0x8F9C90), parses all arguments into ~12 local variables, resolves the Path A / Path B dispatch variable v253, and calls one of:

Path A (EDG pipeline): sub_902D10 -- invokes sub_900130 for CLI parsing, then the EDG frontend via sub_905880, then the LibNVVM pipeline via sub_905EE0.
Path B (standalone LLVM pipeline): sub_1262860 -- similar flow but through standalone LLVM infrastructure at 0x1262860.

Path selection is controlled by v253, which defaults to 2 (unresolved) and is resolved through the obfuscated environment variable NV_NVVM_VERSION. For SM >= 100 (Blackwell and later), the default is Path B unless the -nvc flag is present. For SM < 100, the default is Path A. See Entry Point for the full dispatch matrix.

When cicc is invoked in multi-stage mode (-lnk, -opt, -llc, -libnvvm), the entry point dispatches to sub_905EE0 (Path A, 43 KB) or sub_1265970 (Path B, 48 KB), which orchestrate the LNK, OPT, and LLC sub-pipelines internally.

Parameter Passing to sub_900130

The Path A CLI parser sub_900130 receives 12 parameters and performs a two-pass argument scan:

unsigned int sub_900130(
    const char *input_file,    // a1: input filename
    const char *opencl_src,    // a2: OpenCL source path (NULL for CUDA)
    const char *output_file,   // a3: output filename
    __int64    *arg_vector,    // a4: pointer to std::vector<std::string>
    char        mode_flag,     // a5: mode flag (0=normal, 1=special)
    __int64     job_desc,      // a6: output compilation job struct
    __int64     error_out,     // a7: error string output
    _BYTE      *m64_flag,     // a8: output - set to 1 if -m64 seen
    _BYTE      *discard_names, // a9: output - set to 1 if -discard-value-names
    __int64     trace_path,    // a10: device time trace path
    __int64     trace_pid,     // a11: trace PID
    __int64     trace_env      // a12: trace env value
);
// Returns: 0 = success, 1 = error

Pass 1: Scans for -arch flag via sub_8FD0D0, extracts architecture string.

Pass 2: Iterates all arguments, looking each up in the red-black tree at qword_4F6D2A0. For tree hits, the EDG slot is pushed to the EDG argument vector (v145) and the cicc slot is pushed to the backend argument vector (v148). For tree misses, sequential string comparisons handle extended flags (-maxreg=N, -split-compile=N, --Xlgenfe, --Xlibnvvm, etc.).

Before any user flags, sub_900130 unconditionally injects:

--emit-llvm-bc into the EDG argument vector
--emit-nvvm-latest into the backend argument vector

After all arguments are processed, architecture strings are appended:

--nv_arch + sm_XX to EDG arguments
-arch=compute_XX to backend arguments

Mode Cookies

The sub_9624D0 flag catalog function takes a fourth parameter a4 that selects the language mode. This is not a user-visible flag -- it is passed internally by the pipeline orchestrator.

Cookie	Hex	Decimal	Language
`0xABBA`	`0xABBA`	43,962	CUDA compilation
`0xDEED`	`0xDEED`	57,069	OpenCL compilation

The cookie affects multiple behaviors:

Precision division routing. In CUDA mode (0xABBA), -prec-div=0 maps to -nvptx-prec-divf32=1 (not 0) at LLC, while -prec-div=1 maps to -nvptx-prec-divf32=2. In OpenCL mode (0xDEED), the mapping is straightforward: -prec-div=0 maps to -nvptx-prec-divf32=0, -prec-div=1 to -nvptx-prec-divf32=1, and OpenCL additionally supports -prec-div=2 mapping to -nvptx-prec-divf32=3.

Fast-math routing. In CUDA mode, -fast-math maps to -R __CUDA_USE_FAST_MATH=1 for EDG and -opt-use-fast-math for OPT, with no LLC flag. In OpenCL mode, -fast-math maps to -R FAST_RELAXED_MATH=1 -R __CUDA_FTZ=1 for EDG and -opt-use-fast-math -nvptx-f32ftz for OPT.

Default precision. -prec-sqrt defaults to 1 (precise) in CUDA mode, 0 (imprecise) in OpenCL mode.

Discard value names. In CUDA mode (0xABBA), without explicit override, value names are discarded by default (a1+232 = 1), generating -lnk-discard-value-names=1, -opt-discard-value-names=1, and -lto-discard-value-names=1. In OpenCL mode (0xDEED), this only applies when (a13 & 0x20) is set (LTO generation active).

OptiX IR emission. The --emit-optix-ir flag is only valid when the cookie is 0xABBA or 0xDEED.

Internal compile call. The LibNVVM compile function nvvmCUCompile (dispatch ID 0xBEAD) is called with phase code 57,069 (0xDEED) regardless of the outer cookie -- this is the internal LibNVVM compile phase code, not a language selector.

Flag Translation Table

sub_8FE280 populates a global std::map<std::string, FlagPair*> in the red-black tree at qword_4F6D2A0. Each FlagPair is a 16-byte struct with two slots: slot 0 for the EDG frontend passthrough, slot 1 for the internal cicc flag. The function is called exactly once, guarded by qword_4F6D2C8.

Red-Black Tree Structure

qword_4F6D2A0  -- tree root pointer (std::_Rb_tree)
dword_4F6D2A8  -- sentinel node (tree.end())
qword_4F6D2B0  -- root node pointer
qword_4F6D2B8  -- begin iterator (leftmost node)
qword_4F6D2C8  -- initialization guard (1 = already built)

Each node is 72+ bytes:

Offset	Field
+0	Color (0=red, 1=black)
+8	Parent pointer
+16	Left child pointer
+24	Right child pointer
+32	Key data pointer (std::string internals)
+40	Key length
+48	Key capacity
+64	Value pointer (`FlagPair*`)

Lookup is via sub_8FE150 (lower_bound + insert-if-not-found). Insert is via sub_8FDFD0 (allocate node + rebalance). Comparison uses standard std::string::compare.

Complete nvcc-to-cicc Mapping

The table below shows every entry in the sub_8FE280 red-black tree. Slot 0 is forwarded to the EDG frontend; slot 1 is forwarded to the cicc backend pipeline. <null> means no flag is generated for that slot.

nvcc flag	EDG passthrough (slot 0)	cicc internal (slot 1)	Notes
`-m32`	`--m32`	`<null>`
`-m64`	`--m64`	`<null>`	Also sets `*a8 = 1`
`-fast-math`	`<null>`	`-fast-math`
`-ftz=1`	`<null>`	`-ftz=1`
`-ftz=0`	`<null>`	`-ftz=0`
`-prec_sqrt=1`	`<null>`	`-prec-sqrt=1`	Underscore to hyphen
`-prec_sqrt=0`	`<null>`	`-prec-sqrt=0`	Underscore to hyphen
`-prec_div=1`	`<null>`	`-prec-div=1`	Underscore to hyphen
`-prec_div=0`	`<null>`	`-prec-div=0`	Underscore to hyphen
`-fmad=1`	`<null>`	`-fma=1`	`fmad` renamed to `fma`
`-fmad=0`	`<null>`	`-fma=0`	`fmad` renamed to `fma`
`-O0`	`--device-O=0`	`-opt=0`	Dual-mapped
`-O1`	`--device-O=1`	`-opt=1`	Dual-mapped
`-O2`	`--device-O=2`	`-opt=2`	Dual-mapped
`-O3`	`--device-O=3`	`-opt=3`	Dual-mapped
`-Osize`	`<null>`	`-Osize`
`-Om`	`<null>`	`-Om`
`-Ofast-compile=max`	`<null>`	`-Ofast-compile=max`
`-Ofc=max`	`<null>`	`-Ofast-compile=max`	Alias
`-Ofast-compile=mid`	`<null>`	`-Ofast-compile=mid`
`-Ofc=mid`	`<null>`	`-Ofast-compile=mid`	Alias
`-Ofast-compile=min`	`<null>`	`-Ofast-compile=min`
`-Ofc=min`	`<null>`	`-Ofast-compile=min`	Alias
`-Ofast-compile=0`	`<null>`	`<null>`	No-op
`-Ofc=0`	`<null>`	`<null>`	No-op alias
`-g`	`--device-debug`	`-g`	Dual-mapped
`-show-src`	`<null>`	`-show-src`
`-disable-allopts`	`<null>`	`-disable-allopts`
`-disable-llc-opts`	`<null>`	`disable-llc-opts`
`-w`	`-w`	`-w`	Dual-mapped
`-Wno-memory-space`	`<null>`	`-Wno-memory-space`
`-disable-inlining`	`<null>`	`-disable-inlining`
`-aggressive-inline`	`<null>`	`-aggressive-inline`
`--kernel-params-are-restrict`	`--kernel-params-are-restrict`	`-restrict`	Dual-mapped, renamed
`-allow-restrict-in-struct`	`<null>`	`-allow-restrict-in-struct`
`--device-c`	`--device-c`	`--device-c`	Dual-mapped
`--generate-line-info`	`--generate-line-info`	`-generate-line-info`	Dual-mapped
`--enable-opt-byval`	`--enable-opt-byval`	`-enable-opt-byval`	Dual-mapped
`--no-lineinfo-inlined-at`	`<null>`	`-no-lineinfo-inlined-at`
`--keep-device-functions`	`--keep-device-functions`	`<null>`	EDG only
`--emit-optix-ir`	`--emit-lifetime-intrinsics`	`--emit-optix-ir`	Triggers lifetime intrinsics in EDG
`-opt-fdiv=0`	`<null>`	`-opt-fdiv=0`
`-opt-fdiv=1`	`<null>`	`-opt-fdiv=1`
`-new-nvvm-remat`	`<null>`	`-new-nvvm-remat`
`-disable-new-nvvm-remat`	`<null>`	`-disable-new-nvvm-remat`
`-disable-nvvm-remat`	`<null>`	`-disable-nvvm-remat`
`-discard-value-names`	`--discard_value_names=1`	`-discard-value-names=1`	Also sets `*a9 = 1`
`-gen-opt-lto`	`<null>`	`-gen-opt-lto`

Key translation patterns:

Underscore to hyphen: nvcc uses underscores (-prec_sqrt), cicc uses hyphens (-prec-sqrt).
Rename: -fmad becomes -fma internally.
Dual-mapping: -O0 through -O3 emit both an EDG flag (--device-O=N) and a cicc flag (-opt=N).
Alias expansion: -Ofc=X is silently rewritten to -Ofast-compile=X.
Implicit dependency: --emit-optix-ir adds --emit-lifetime-intrinsics to the EDG frontend, enabling lifetime intrinsic generation that the OptiX IR output path requires.

Extended Flags (Not in Tree)

The following flags are handled by sequential string comparison in sub_900130 when a tree lookup misses:

nvcc flag	Expansion	Notes
`-maxreg=N`	`-maxreg=<N>` to backend
`-split-compile=N`	`-split-compile=<N>` to OPT	Error if specified twice
`-split-compile-extended=N`	`-split-compile-extended=<N>` to OPT	Mutually exclusive with `-split-compile`
`--Xlgenfe <arg>`	`<arg>` to EDG
`--Xlibnvvm <arg>`	`<arg>` to backend
`--Xlnk <arg>` / `-Xlnk <arg>`	`-Xlnk` + `<arg>` to backend
`--Xopt <arg>` / `-Xopt <arg>`	`-Xopt` + `<arg>` to backend
`--Xllc <arg>` / `-Xllc <arg>`	`-Xllc` + `<arg>` to backend
`-Xlto <arg>`	`<arg>` to LTO vector
`-covinfo <file>`	`-Xopt -coverage=true -Xopt -covinfofile=<file>`
`-profinfo <file>`	`-Xopt -profgen=true -Xopt -profinfofile=<file>`
`-profile-instr-use <file>`	`-Xopt -profuse=true -Xopt -proffile=<file>`
`-lto`	`-gen-lto` to backend; enables LTO
`-olto <file>`	`-gen-lto-and-llc` + flag + next arg
`--promote_warnings`	`-Werror` to backend; flag to EDG
`-inline-info`	`-Xopt -pass-remarks=inline` + missed + analysis
`-jump-table-density=N`	`-jump-table-density=<N>` to backend
`-opt-passes=<val>`	`-opt-passes=<val>` to backend
`--orig_src_file_name <val>`	`--orig_src_file_name` + `<val>` to EDG
`--force-llp64`	Pass to EDG; sets `byte_4F6D2DC = 1`
`--partial-link`	Complex: may add `-memdep-cache-byval-loads=false` to OPT and LLC	Sets `byte_4F6D2D0 = 1`
`--tile-only`	Pass to EDG + `--tile_bc_file_name` + output path
`--device-time-trace`	Pass to EDG; next arg becomes trace path
`-jobserver`	`-jobserver` to backend or pass to EDG

Input Extensions

Input files are identified by extension during the argument loop in sub_8F9C90. The last matching file wins (the input variable s is overwritten each time). Extension matching proceeds by checking trailing characters: last 3 for .bc/.ci, last 2 for .i, last 3 for .ii, last 4 for .cup, last 8 for .optixir.

Extension	Format	Condition	Address
`.bc`	LLVM bitcode	Always accepted	`0x8FAA0A`
`.ci`	CUDA intermediate (preprocessed)	Always accepted	`0x8FAA29`
`.i`	Preprocessed C/C++	Always accepted	`0x8FA9xx`
`.ii`	Preprocessed C++	Always accepted	`0x8FBF7E`
`.cup`	CUDA source	Only after `--orig_src_path_name` or `--orig_src_file_name`	`0x8FBFC4`
`.optixir`	OptiX IR	Always accepted	`0x8FC001`

Unrecognized arguments (those failing both tree lookup and sequential matching, and lacking a recognized extension) are silently appended to the v266 pass-through vector, which is forwarded to sub-pipelines.

If no input file is found after parsing all arguments:

Missing input file
Recognized input file extensions are: .bc .ci .i .cup .optixir

Note that .ii is not mentioned in the error message despite being accepted -- this appears to be a minor oversight in the error string.

Output Modes

cicc can produce several output formats, controlled by the combination of flags in the a13 compilation mode bitmask. The bitmask is accumulated during flag parsing in sub_9624D0:

a13 Value	Mode	Output Format
`0x07`	Default (all phases)	PTX text assembly
`0x10`	Debug/line-info	PTX with debug metadata
`0x21`	`-gen-lto`	LTO bitcode (`.lto.bc`)
`0x23`	`-lto` (full LTO)	LTO bitcode + link
`0x26`	`-link-lto`	Linked LTO output
`0x43`	`--emit-optix-ir`	OptiX IR (`.optixir`)
`0x80`	`-gen-opt-lto`	Optimized LTO bitcode
`0x100`	`--nvvm-64`	64-bit NVVM mode modifier
`0x200`	`--nvvm-32`	32-bit NVVM mode modifier

The default output is PTX text, written through the LLC backend's PTX printer. The output file path is specified by -o <file> (fatal if missing in multi-stage modes). When no output path is provided in simple mode, sub_900130 constructs a .ptx filename from the input.

PTX Text Output (Default)

The standard path runs all four internal phases: LNK (IR linking), OPT (NVVM optimizer), optionally OptiX IR emission, then LLC (code generation). The LLC backend writes PTX assembly text to the output file. In sub_905EE0, the output writing (Phase 4) checks the first bytes of the result for ELF magic (0x7F, 0xED) to detect accidentally binary output; if the mode is text mode (0) and ELF headers are present, it indicates an internal error.

LTO Bitcode Output

When -lto or -gen-lto is active, cicc produces LLVM bitcode instead of PTX. The -gen-lto flag sets a13 = (a13 & 0x300) | 0x21 and adds -gen-lto to the LTO argument vector. The -gen-lto-and-llc variant additionally runs LLC after producing the LTO bitcode, generating both outputs. The -olto flag takes a next argument (the LTO optimization level) and combines LTO bitcode generation with LLC execution.

OptiX IR Output

The --emit-optix-ir flag sets a13 = (a13 & 0x300) | 0x43. In the flag translation tree, it also injects --emit-lifetime-intrinsics into the EDG frontend, enabling lifetime intrinsic emission that is required for the OptiX IR format. In the flag catalog (sub_9624D0), it additionally routes -do-ip-msp=0 and -do-licm=0 to the optimizer, disabling interprocedural memory space promotion and LICM for OptiX compatibility.

Split Compile

The -split-compile=N flag (or -split-compile-extended=N) routes to the optimizer as -split-compile=<N> (or -split-compile-extended=<N>). These are mutually exclusive and error if specified more than once ("split compilation defined more than once"). When -split-compile-extended is used, it also sets the flag at a1+1644 to 1. The split compile mechanism divides the compilation unit into N partitions for parallel processing.

Exit Codes

The process exit code is the return value of sub_8F9C90 (real main), stored in v8:

Code	Meaning	Source
0	Success	Normal compilation; `-irversion` query
1	Argument error	Missing input file, missing output file, CLI parse failure
`v264`	Pipeline error	Return code from `sub_905EE0` / `sub_1265970` / `sub_905880`

Within the pipeline, error codes from sub_905EE0 are set via *a8:

`*a8` Value	Meaning
0	Success (`NVVM_SUCCESS`)
-1	File open/read error
1	`NVVM_ERROR_OUT_OF_MEMORY`
4	`NVVM_ERROR_INVALID_INPUT`
5	`NVVM_ERROR_INVALID_CU` (null compilation unit)

Error messages are written to qword_4FD4BE0 (stderr stream) via sub_223E0D0. All LibNVVM-originated errors are prefixed with "libnvvm : error: ". Representative errors:

"Error processing command line: <cmd>" (from sub_900130 failure)
"Missing input file" / "Missing output file"
"<src>: error in open <file>" (file I/O)
"libnvvm: error: failed to create the libnvvm compilation unit"
"libnvvm: error: failed to add the module to the libnvvm compilation unit"
"libnvvm: error: failed to get the PTX output"
"Invalid NVVM IR Container" (error code 259, from sub_C63EB0)
"Error opening '<file>': file exists!" / "Use -f command line argument to force output"
"Error: Failed to write time profiler data."
"Unparseable architecture: <val>"
"libnvvm : error: <flag> is an unsupported option"
"libnvvm : error: <flag> defined more than once" (duplicate -maxreg, etc.)

Special Behaviors

.cup Extension Gate

The .cup extension (CUDA preprocessed source) is only accepted as an input file when the preceding argument is --orig_src_path_name or --orig_src_file_name. These are metadata flags inserted by nvcc to track the original source file path for diagnostic messages. The check is:

// At 0x8FBFC4 and 0x8FBFDE:
if (strcmp(argv[i-1], "--orig_src_path_name") == 0 ||
    strcmp(argv[i-1], "--orig_src_file_name") == 0) {
    s = argv[i];  // accept .cup as input
}

This means cicc will silently ignore a .cup file that appears without a preceding metadata flag. When accepted, the .cup extension triggers --orig_src_path_name / --orig_src_file_name handling in sub_900130, which forwards the original source path to the EDG frontend for accurate error location reporting.

-Ofc Alias Handling

The -Ofc=X form is a shorthand alias for -Ofast-compile=X, handled entirely within the sub_8FE280 flag translation tree. The tree contains six entries for fast-compile control:

Tree Key	cicc Internal	Effect
`-Ofast-compile=max`	`-Ofast-compile=max`	Identity
`-Ofc=max`	`-Ofast-compile=max`	Alias
`-Ofast-compile=mid`	`-Ofast-compile=mid`	Identity
`-Ofc=mid`	`-Ofast-compile=mid`	Alias
`-Ofast-compile=min`	`-Ofast-compile=min`	Identity
`-Ofc=min`	`-Ofast-compile=min`	Alias
`-Ofast-compile=0`	`<null>`	No-op
`-Ofc=0`	`<null>`	No-op alias

The aliasing happens at the tree level, before sub_9624D0 ever sees the flag. By the time the flag catalog processes the argument, -Ofc=max and -Ofast-compile=max are indistinguishable. See Optimization Levels for what each fast-compile tier actually does.

In sub_9624D0, -Ofast-compile is stored at offset a1+1640 as an integer:

Level string	Integer value	Behavior
`"0"`	1	Disabled (then reset to 0)
`"max"`	2	Most optimizations skipped; forces `-lsa-opt=0`, `-memory-space-opt=0`
`"mid"`	3	Medium pipeline
`"min"`	4	Close to full optimization

Any other value produces: "libnvvm : error: -Ofast-compile called with unsupported level, only supports 0, min, mid, or max".

Only one -Ofast-compile is permitted per invocation. A second occurrence triggers: "libnvvm : error: -Ofast-compile specified more than once".

Discard Value Names

The -discard-value-names flag has complex interaction semantics. In the tree, it dual-maps to --discard_value_names=1 (EDG, note underscores) and -discard-value-names=1 (cicc, note hyphens). Additionally, per-phase overrides are possible via -Xopt -opt-discard-value-names=0, -Xlnk -lnk-discard-value-names=0, or -Xlto -lto-discard-value-names=0.

In CUDA mode, without explicit flags, value names are discarded by default. In OpenCL mode, the default only applies when LTO generation is active (a13 & 0x20). This reflects the fact that value names are useful for debugging but waste memory in production builds.

Wizard Mode Interaction

The -v (verbose), -keep (keep intermediates), and -dryrun flags are parsed in sub_8F9C90 but are only effective when wizard mode is active. Wizard mode is gated by getenv("NVVMCCWIZ") == 553282, which sets byte_4F6D280 = 1. Without wizard mode, these flags are silently accepted but have no effect -- v259 (verbose) and v262 (keep) remain 0. This is a deliberate anti-reverse-engineering measure.

Default Values When Flags Are Absent

When a flag is not explicitly provided, sub_9624D0 applies these defaults (checking stored-value sentinels):

Flag	Default Value	Sentinel Offset
`-opt=`	`-opt=3` (O3)	`a1+400`
`-arch=compute_`	`-arch=compute_75` (Turing)	`a1+560`
`-ftz=`	`-ftz=0` (no flush-to-zero)	`a1+592`
`-prec-sqrt=`	`-prec-sqrt=1` (CUDA) / `-prec-sqrt=0` (OpenCL)	`a1+624`
`-prec-div=`	`-prec-div=1` (precise)	`a1+656`
`-fma=`	`-fma=1` (enabled)	`a1+688`
`-opt-fdiv=`	`-opt-fdiv=0`	`a1+464`

Configuration

Four Output Vectors

sub_9624D0 builds four independent std::vector<std::string> that are serialized into char** arrays at function exit:

Vector	Seed	Output	Pipeline Phase
`v324` (LNK)	`"lnk"`	`a5`/`a6`	Phase 1: IR linker
`v327` (OPT)	`"opt"`	`a7`/`a8`	Phase 2: NVVM optimizer
`v330` (LTO)	(none)	`a9`/`a10`	Phase 3: LTO passes
`v333` (LLC)	`"llc"`	`a11`/`a12`	Phase 4: Code generation

Each vector element is a 32-byte std::string with SSO. At exit, elements are serialized via malloc(8 * count) for the pointer array and malloc(len+1) + memcpy for each string.

Architecture Bitmask Validation

Architecture validation in sub_9624D0 uses a 64-bit bitmask 0x60081200F821:

offset = arch_number - 75;
if (offset > 0x2E || !_bittest64(&0x60081200F821, offset))
    // error: "is an unsupported option"

Valid architectures (bit positions): SM 75, 80, 86, 87, 88, 89, 90, 100, 103, 110, 120, 121. The a/f sub-variants share the base SM number for bitmask validation but receive distinct routing in sub_95EB40.

Compilation Mode Flags Bitmask (a13)

The a13 parameter in sub_9624D0 is an IN/OUT bitmask tracking compilation mode:

Bit/Mask	Source Flag	Meaning
`0x07`	(default)	Phase control: all phases active
`0x10`	`-g`, `--generate-line-info`	Debug/line-info enabled
`0x20`	`-gen-lto`, `-gen-lto-and-llc`	LTO generation enabled
`0x21`	`-gen-lto`	Gen-LTO mode
`0x23`	`-lto`	Full LTO mode
`0x26`	`-link-lto`	Link-LTO mode
`0x43`	`--emit-optix-ir`	OptiX IR emission mode
`0x80`	`-gen-opt-lto`	Optimized LTO lowering
`0x100`	`--nvvm-64`	64-bit NVVM mode
`0x200`	`--nvvm-32`	32-bit NVVM mode
`0x300`	(mask)	64/32-bit mode bits mask

Function Map

Function	Address	Size	Role
`sub_8F9C90`	`0x8F9C90`	10,066 B	Real main entry point
`sub_8FE280`	`0x8FE280`	~35 KB	Flag translation tree builder (nvcc -> cicc)
`sub_8FE150`	`0x8FE150`	--	Tree lookup (lower_bound + insert)
`sub_8FDFD0`	`0x8FDFD0`	--	Tree insert + rebalance
`sub_8FD0D0`	`0x8FD0D0`	--	Architecture flag scanner (first pass)
`sub_900130`	`0x900130`	39 KB	CLI processing Path A (12 params)
`sub_902D10`	`0x902D10`	~9 KB	Path A orchestrator
`sub_904450`	`0x904450`	--	Push flag to argument vector
`sub_905880`	`0x905880`	~6 KB	EDG frontend stage
`sub_905EE0`	`0x905EE0`	43 KB	Path A multi-stage pipeline driver
`sub_908220`	`0x908220`	--	LLC output callback (ID 56993)
`sub_908850`	`0x908850`	--	Triple construction (`nvptx64-nvidia-cuda`)
`sub_9085A0`	`0x9085A0`	--	OPT output callback (ID 64222)
`sub_95EB40`	`0x95EB40`	38 KB	3-column architecture mapping table builder
`sub_9624D0`	`0x9624D0`	75 KB	Flag catalog (4 output vectors, ~111 flags)
`sub_1262860`	`0x1262860`	--	Path B simple dispatch
`sub_1265970`	`0x1265970`	48 KB	Path B multi-stage pipeline driver

Global Variables

Address	Variable	Purpose
`qword_4F6D2A0`	Flag tree root	`std::map` root for `sub_8FE280`
`dword_4F6D2A8`	Flag tree sentinel	`tree.end()`
`qword_4F6D2B0`	Flag tree root node	Root node pointer
`qword_4F6D2B8`	Flag tree begin	Leftmost node (begin iterator)
`qword_4F6D2C8`	Init guard	Set to 1 after `sub_8FE280` first call
`byte_4F6D2D0`	Partial-link flag	Set by `--partial-link`
`byte_4F6D2DC`	LLP64 flag	Set by `--force-llp64`
`unk_4F06A68`	Data model width	8 = 64-bit, 4 = 32-bit
`unk_4D0461C`	Address space 3 flag	Enables `p3:32:32:32` in datalayout
`byte_4F6D280`	Wizard mode	Set by `NVVMCCWIZ=553282`

Cross-References

Entry Point & CLI -- full sub_8F9C90 analysis, Path A/B dispatch, wizard mode
CLI Flag Inventory -- complete flag listing across all five parsing sites
Optimization Levels -- O0-O3 and fast-compile tier pipeline details
Environment Variables -- NVVMCCWIZ, NV_NVVM_VERSION
EDG Frontend -- what happens after EDG flags are forwarded
OptiX IR -- OptiX IR emission pipeline
Optimizer -- how -opt=N and fast-compile flags affect the optimization pipeline

EDG 6.6 Frontend

NVIDIA licenses the Edison Design Group (EDG) C/C++ front end — a commercial compiler frontend used by several major compilers including Intel ICC. In cicc v13.0, EDG version 6.6 occupies 3.2 MB of code (0x5D0000–0x8F0000), making it the largest single subsystem in the binary. Unlike most modern compilers that parse directly to an SSA-based IR, EDG operates as a source-to-source translator: it parses CUDA C++ source code and emits transformed C code containing CUDA runtime API calls. This output is then fed into a second compilation phase that produces NVVM IR (LLVM bitcode). This two-stage design means the CUDA language extensions (kernel launch syntax, memory space qualifiers, device/host function annotations) are resolved entirely within EDG, and the LLVM-based backend never sees raw CUDA syntax.

The EDG frontend is configured at compile time through 737 #define macros, including GCC 8.1 emulation mode and Clang 9.1 emulation mode. Exceptions are disabled by default — CUDA device code cannot use C++ exceptions — while RTTI remains enabled for dynamic_cast support in host-side code that interacts with device objects.


EDG version	6.6 (string: `"Based on Edison Design Group C/C++ Front End, version 6.6"`)
Entry symbol	`lgenfe_main` (string at `sub_617BD0`)
GCC emulation	8.1 (`DEFAULT_GNU_VERSION = 80100`)
Clang emulation	9.1 (`DEFAULT_CLANG_VERSION = 90100`)
C++ standards	C++98, C++11, C++14, C++17, C++20, C++23 (`unk_4F07778` = year code)
C standards	C99, C11, C18, C23
Exceptions	Disabled by default (`DEFAULT_EXCEPTIONS_ENABLED = 0`)
RTTI	Enabled by default (`DEFAULT_RTTI_ENABLED = 1`)
Target model	LP64 (`TARG_SIZEOF_POINTER = 8, TARG_SIZEOF_LONG = 8`)
Backend	C-codegen (`BACK_END_IS_C_GEN_BE = 1`) — emits C source, not LLVM IR directly
Functions	~5,000 in range, 300+ above 5KB

Architecture

The compilation flow through EDG has four major phases: CLI parsing (282-case switch), translation unit initialization (keyword tables, parser bootstrapping), parsing and semantic analysis (the bulk of the 3.2 MB), and backend code emission (generating three output files: .int.c for internal declarations, .device.c for device code, and .stub.c for host-side launch stubs). Error recovery uses setjmp/longjmp — any of the 478 call sites that invoke the abort handler (sub_721090) will unwind back to the orchestrator rather than crashing the process.

sub_5D2A80 (orchestrator, setjmp error recovery)
  │
  ├─ sub_617BD0 (lgenfe_main: 282-case CLI switch, 737 config #defines)
  │    ├─ sub_610260 (register 300+ CLI options)
  │    └─ sub_6140E0 (option fetcher loop)
  │
  ├─ sub_8D0BC0 (translation unit init)
  │    ├─ sub_706250 (keyword table: ~350 keywords via sub_885C00)
  │    ├─ sub_858C60 (parser entry)
  │    └─ sub_709290 (finalize)
  │
  ├─ sub_709330 ("Generating Needed Template Instantiations", "Wrapping up translation unit")
  │
  └─ sub_5E3AD0 (backend entry: "Generating NVVM IR")
       ├─ Opens .int.c / .device.c / .stub.c output files
       ├─ sub_5DB980 (top-level declaration dispatcher)
       │    ├─ sub_5E13C0 (function declaration printer, 44KB)
       │    ├─ sub_5DBFC0 (expression printer, 41KB, 61 self-references)
       │    ├─ sub_5DFD00 (statement printer, 26KB)
       │    ├─ sub_5D80F0 (initializer printer)
       │    ├─ sub_5DAD30 (struct/union/enum printer)
       │    └─ sub_5DF1B0 (inline asm printer)
       │
       └─ dlopen("libTileIRCompiler_shared.so") [optional, gated by dword_4D045A0]
            └─ dlsym("cudacc_back_end") — 17-entry function pointer table

Timer callbacks record "Front end time", "Back end time", and "Total compilation time" via sub_7211D0.

Orchestrator — `sub_5D2A80`

The master entry point for the entire frontend. Uses setjmp for non-local error recovery — when any of the ~5,000 EDG functions detects an unrecoverable error (type system inconsistency, parser corruption, internal assertion failure), it calls sub_721090, which longjmps back to this function. The 478 call sites that reference the abort handler demonstrate just how pervasive error checking is throughout the frontend — roughly 10% of all functions in the EDG range can trigger a fatal abort.

Global	Purpose
`unk_4D045D8`	Phase callback (prints "Generating NVVM IR" etc.)
`unk_4D04744`	Timer enable flag
`unk_4F074B0`	Error flag (frontend errors occurred)
`unk_4F074A8`	Warning count
`qword_4F076F0`	Input source filename

Frontend Entry — `sub_617BD0` (lgenfe_main)

At 123KB and 3,113 decompiled lines, lgenfe_main is the largest function in the EDG range. The name "lgenfe" stands for "LLVM-generating front end" — a hint that this function was originally designed for a different backend before NVIDIA adopted the EDG+LLVM architecture. The function is divided into three distinct regions: a massive 282-case switch for CLI option parsing (2,000 lines), a post-parse validation phase that checks for conflicting options and enforces CUDA-specific constraints, and a file I/O setup phase that installs 11 signal handlers and returns a pointer to the configured compilation context.

Signature: (int argc, __int64 argv).

Structure

Region	Lines	Content
A	164–2157	282-case switch on option ID (`v6`)
B	2157–2700	Post-parse validation and cross-option consistency
C	2700–3113	File I/O setup, 11 signal handlers, return `&qword_4D046F0`

Architecture Parsing (case `0x52`)

compute_75, compute_80, compute_86, compute_87, compute_88, compute_89
compute_90, compute_90a
compute_100, compute_100a, compute_100f
compute_103, compute_103a, compute_103f
compute_110, compute_110a, compute_110f
compute_120, compute_120a, compute_120f
compute_121, compute_121a, compute_121f

Storage: unk_4D045E8 = SM number, unk_4D045E4 = a suffix flag, unk_4D045E0 = f suffix flag.

Configuration Emission (case `0xE1`)

Emits 737 #define macros to configure the EDG compiler. Key defines:

Define	Value	Meaning
`VERSION_NUMBER`	`"6.6"`	EDG frontend version
`EDG_MAIN`	`"lgenfe_main"`	Entry point symbol
`DEFAULT_GNU_VERSION`	`80100`	Emulate GCC 8.1
`DEFAULT_CLANG_VERSION`	`90100`	Emulate Clang 9.1
`DEFAULT_EXCEPTIONS_ENABLED`	`0`	CUDA: no exceptions
`TARG_SIZEOF_POINTER`	`8`	64-bit pointers
`TARG_SIZEOF_LONG_DOUBLE`	`16`	128-bit long double
`TARG_LITTLE_ENDIAN`	`1`	x86-64 host
`USE_SOFTFLOAT`	`1`	Software FP for constexpr
`ABI_COMPATIBILITY_VERSION`	`9999`	Maximum ABI compat
`MODULE_MAX_LINE_NUMBER`	`250000`	Max lines per module

CLI Option Registration — `sub_610260`

Registers ~300 options via sub_6101D0(id, name, flag, ...). CUDA-specific options include:

ID	Name	Purpose
51	`no-device-int128`	Disable __int128 on device
59	`emit-llvm-bc`	Emit LLVM bitcode directly
60	`device-debug`	Device-side debug info
68	`force-volatile`	Force volatile on memory space (global/shared/constant/local/generic/all)
73	`kernel-params-are-restrict`	All kernel pointer params are `__restrict__`
82	`nv_arch`	`compute_XX` architecture selection
93	`device-c`	Separate compilation mode
105	`tile-only`	TileIR-only compilation
124	`extended-lambda`	Extended lambda support (`--expt-extended-lambda`)
132	`emit-lifetime-intrinsics`	LLVM lifetime intrinsics

Translation Unit Processing

Translation unit processing is where EDG transitions from CLI configuration to actual compilation. The init function sets up the lexer, allocates the translation unit data structure (416 bytes), populates the keyword table with ~350 entries, and enters the recursive-descent parser. EDG uses a keyword-registration model where each keyword is individually registered with its token ID — this allows NVIDIA to add CUDA-specific keywords (like __shared__ or __nv_fp8_e4m3) without modifying the core parser grammar.

Init — `sub_8D0BC0`

Reset token state (dword_4F063F8 = 0)
Call sub_727950 (lexer init)
Allocate 416-byte TU object via sub_823970
Call sub_706250 — keyword table init (~350 keywords)
Call parser entry (sub_858C60 or PCH path sub_852E40)
Call sub_709290 — finalize

Keyword Registration — `sub_706250`

30KB. Calls sub_885C00(token_id, "keyword_string") ~350 times. Initializes 30+ subsystems before keyword registration. Categories:

C89 keywords: auto, break, case, const, continue, default, do, double, else, ...
C99 additions: _Bool, _Complex, _Generic, _Atomic, restrict, inline
C11/C23: _Static_assert, _Thread_local, _Alignas, _Alignof, constexpr, typeof
C++ keywords: class, template, virtual, namespace, using, try, catch, throw, ...
C++20: co_yield, co_return, co_await, requires, concept
Type traits (~80): __is_pod, __is_abstract, __is_trivially_copyable, __has_virtual_destructor, ...
NVIDIA extensions: __nv_is_extended_device_lambda_closure_type, __nv_is_extended_host_device_lambda_closure_type, __nv_is_extended_device_lambda_with_preserved_return_type
EDG internal: __edg_type__, __edg_vector_type__, __edg_neon_vector_type__, __edg_scalable_vector_type__

Version-gated by dword_4F077C4 (language mode), unk_4F07778 (standard year), qword_4F077B4 (feature flags).

Finalization — `sub_709330`

Strings: "Generating Needed Template Instantiations", "Wrapping up translation unit". Calls sub_8B18F0 for C++ template instantiation when dword_4F077C4 == 2.

Preprocessor

EDG includes its own preprocessor rather than relying on an external cpp. This is standard for EDG-based compilers — the preprocessor is tightly integrated with the parser to handle complex interactions between macros and C++ syntax (e.g., __VA_OPT__ in C++20, which requires the preprocessor to understand syntactic context). The preprocessor occupies ~250KB across four major functions and maintains a 99-entry predefined macro table plus a 25-entry feature-test macro table.

Token Scanner — `sub_7B8B50` (59KB)

The main preprocessor tokenizer. Handles all C/C++ token kinds: identifiers, numbers (delegates to sub_7B40D0), string literals, operators, punctuators, UCN sequences. Detects C++20 module/import keywords via string comparison.

Numeric Literal Parser — `sub_7B40D0` (42KB)

Second-largest preprocessor function. Handles: integer suffixes (u/U/l/L/ll/LL), float suffixes (f/F/l/L), hex floats (0x...p...), binary literals (0b...), C++14 digit separators (').

Macro Expander — `sub_81B8F0` (77KB)

The central macro expansion engine. Features:

__VA_ARGS__ (C99) and __VA_OPT__ (C++20) support
99-entry predefined macro table at off_4B7C440 (stride 40 bytes)
25-entry feature-test macro table at off_4B7C360
Recursion limit: 300 expansions (error 0xE3)
Intrinsic type-trait macros: __type_pack_element, __is_signed, __make_integer_seq, __is_pointer

Character Scanner — `sub_7BC390` (29KB)

Giant switch on character value. Handles trigraph sequences, line splices, multi-byte characters, comment detection (// and /*).

Parser & Declaration Processing

The parser subsystem is the largest part of the EDG frontend — over 1 MB of code spread across dozens of functions. EDG uses a recursive-descent parser augmented with a declaration-specifier state machine. The state machine design is necessary because C/C++ declaration specifiers can appear in any order (const unsigned long long int and int long unsigned long const are identical), requiring the parser to accumulate specifiers into bitmasks and resolve the final type only after all specifiers have been consumed.

NVIDIA's major contribution to the parser is the CUDA type extension infrastructure: 19 new FP8/FP6/FP4/MX-format type tokens (339–354) for Blackwell's tensor core operations, 9 address-space qualifier tokens (272–280) for GPU memory spaces, and 4 memory-space declaration specifiers (133–136) that piggyback on the existing width-modifier field. These extensions are grafted onto EDG's type system in a way that minimizes changes to the core parser logic — CUDA qualifiers reuse existing state variables with previously-unused value ranges.

Declaration Specifier State Machine — `sub_672A20` (132KB, 4,371 lines)

The central parser function and one of the most complex functions in the binary. A while(2)/switch dispatcher on token codes from word_4F06418[0] with ~80 case labels. It accumulates type specifiers, qualifiers, storage-class specifiers, and CUDA address-space qualifiers from the token stream into a set of bitmask variables, then constructs the final type node from the accumulated state.

State Variables

Variable	Stack	Bits	Role
`v325`	`[rsp+B8h]`	uint	Type specifier kind (see table below)
`v327`	`[rsp+C0h]`	uint64	Specifier category bitmask
`v307`	`[rsp+90h]`	int	CV-qualifier accumulation bits
`v302`	`[rsp+78h]`	uint	Long count (0=none, 1=long, 2=long long)
`v305`	`[rsp+84h]`	uint	Signedness/width — reused for CUDA (4–7)
`v299`	`[rsp+68h]`	int	_Complex (1) / _Imaginary (2) tracking

Type Specifier Kind (`v325`)

Value	Meaning	Token Case
0	None yet	—
2	`char`	80
3	`wchar_t`	165
4	`bool` / `_Bool`	128 / 120
5	`float`	126
6	`double`	127
7	`void`	180
8	`signed` / `__int8`	93 / 239
9	`__float128`	331
12	`int` (explicit)	89
14	`__float16`	332
15	`short` / `half`	85
16	`_Float16`	333
17	`__bf16`	334
19	`bfloat16`	335
20	Resolved typedef/CUDA type name	scope lookup
21	struct/union/enum tag	101/104/151
23	`decltype()`	183
24	`auto` (deduced)	186
25	Resolved identifier type	C++ lookup
26	Error recovery type	diagnostic

Specifier Bitmask (`v327`)

Bit	Mask	Meaning
0	`0x1`	Storage class (extern/static/etc.)
1	`0x2`	CV-qualifier seen
2	`0x4`	Type specifier seen
3	`0x8`	`friend` specifier
4	`0x10`	`__declspec` / attribute seen
5	`0x20`	`explicit` specifier
6	`0x40`	`inline` specifier
7	`0x80`	`_Thread_local` / `thread_local`
10	`0x400`	`typeof` / `decltype`
12	`0x1000`	`__declspec()` already processed
13	`0x2000`	`explicit(bool)` already processed
14	`0x4000`	`_Noreturn` / `[[noreturn]]`
15	`0x8000`	`_Atomic`

CV-Qualifier Bits (`v307`)

Bit	Mask	Qualifier
0	`0x01`	`const` (case 81)
1	`0x02`	`volatile` (case 107)
2	`0x04`	`restrict` / `__restrict` (cases 118/119)
3	`0x08`	`__unaligned` (case 263 with parens)
4	`0x10`	`__ptr32` (case 264)
5	`0x20`	`__ptr64` (case 265)
6	`0x40`	`__sptr` / `__uptr` (case 266)

Duplicate CV qualifiers trigger diagnostic 83.

CUDA Memory Space Tokens (133–136)

These piggyback on the signedness/width field v305 with values 4–7:

Token	Keyword	`v305`	`v325`	Formula
133	`__shared__`	4	2	Special case
134	`__device__`	5	8	`token - 129`
135	`__constant__`	6	8	`token - 129`
136	`__managed__`	7	8	`token - 129`

Clean separation: values 0–3 = standard C width modifiers, 4–7 = CUDA address-space qualifiers. The type-construction switch handles both ranges.

CUDA Extended Type Tokens (339–354)

Token	Type	Format
236	`__nv_fp8_e4m3`	FP8
339	`__nv_fp8_e5m2`	FP8
340–343	`__nv_fp8x{2,4}_e{4m3,5m2}`	FP8 vector
344–345	`__nv_fp6_e{2m3,3m2}`	FP6
346–347	`__nv_fp6x2_e{2m3,3m2}`	FP6 vector
348–349	`__nv_mxfp8_e{4m3,5m2}`	MX-format FP8
350–351	`__nv_mxfp6_e{2m3,3m2}`	MX-format FP6
352	`__nv_mxfp4_e2m1`	MX-format FP4
353	`__nv_satfinite`	Saturation type
354	`__nv_e8m0`	Exponent-only E8M0

All resolve via sub_6911B0() → type node, then set v325=20, v327|=4.

CUDA Address Space Qualifier Tokens (272–280)

Token	Keyword	Space ID	Handler
272	`__attribute__((address_space(N)))`	parsed int	`sub_6210B0`
273	`__global__`	0	`sub_667B60(0,...)`
274	`__shared__` (addr space)	2	`sub_667B60(2,...)`
275	`__constant__` (addr space)	3	`sub_667B60(3,...)`
276	`__generic__`	—	`sub_72B620(type, cv)`
277	`__nv_tex_surf_handle_t`	—	`sub_72BA30(unk_4F06A51)`
278	`__nv_buffer_handle_t`	—	`sub_72BA30(unk_4F06A60)`
279	`__nv_grid_constant`	—	`sub_72C390()`
280	`__nv_is_extended_device_lambda`	—	`sub_72C270()`

Type Construction Functions

Function	Purpose	Trigger
`sub_72BA30(code)`	Fundamental signed integer type	int, short, long, long long
`sub_72BC30(code)`	CUDA extended-width integer	CUDA mode + `v305 > 3`
`sub_72BCF0(code)`	Unsigned fundamental type	unsigned combos
`sub_72BDB0(code)`	CUDA unsigned extended type	CUDA mode + unsigned
`sub_72BF70()`	float type	`v325 == 5`
`sub_72C030()`	double type	`v325 == 6`
`sub_72C0F0()`	long double type	long + double
`sub_72C1B0()`	`__float128` type	`v325 == 9`
`sub_72C610(kind)`	Float-by-kind (mapped from v325)	FP8/FP6/BF16/etc.
`sub_72C6F0(kind)`	`_Complex` float variant	`v299 == 1`
`sub_72C7D0(kind)`	`_Imaginary` float variant	`v299 == 2`
`sub_72C930(code)`	Error/placeholder type	diagnostic issued
`sub_72CBA0()`	Dependent type	`v325 == 25`
`sub_72CBE0(...)`	`__int128` type	`v325 == 1`
`sub_73C570(type, cv, flags)`	Apply CV-qualifiers to type	post-construction

Accumulation Flow

Initialize: all state variables to 0
Loop: read word_4F06418[0], dispatch through switch — set bitmask bits, update kind/cv/width
Exit: unrecognized token → LABEL_8 (default exit)
Type construction: switch on v325 × v302 × v305 → call appropriate sub_72B*/sub_72C*
CV application: sub_73C570 wraps the type with const/volatile/restrict
Return: type stored at ds->field_272, CV bits at ds->field_120

Declaration Specifier Parser — `sub_7C0F00` (184KB, 3,953 lines)

Uses goto-driven dispatch (393 LABEL_ references) — NOT a switch/case. This is a massive state machine for declaration specifier resolution. Self-recursive at line 2407 with flags=20 for nested declarator parsing.

Top-Level Declaration Parser — `sub_662DE0` (61KB)

Declarator parsing — handles pointer (*), reference (&/&&), array ([]), and function (()) declarators. Uses SSE __m128i for bulk struct copying of 64-byte EDG type nodes.

Overload Resolution — `sub_6523A0` (64KB)

The master overload resolution function. Given a declaration being introduced and a set of existing candidates from name lookup, it decides whether the declaration is a new overload, a redeclaration, or an error. At 2,448 decompiled lines with 39 diagnostic call sites, it is one of the heaviest diagnostic emitters in the frontend.

Candidate collection uses a 72-byte ranking context (v320 on stack) and dispatches to one of three collectors: sub_644100 for non-member/ADL candidates, sub_648CF0 for member + using-declaration candidates (chosen when C++ mode, prior declaration exists, and the class has base classes or is a template), or sub_6418E0 for C-linkage functions. The best candidate is selected by sub_641B60.

__builtin_ prefix forwarding (lines 2060-2162): after resolution, if the resolved symbol is a bodyless non-member function, the resolver checks if a compiler builtin equivalent exists. It hardcodes three function names by length: "abs" (3), "ceil" (4), "strlen" (6). For each, it constructs "__builtin_" + name in a scratch buffer at qword_4F06C50, looks it up via sub_878540, then compares parameter types via sub_8DED30(type1, type2, 0x100004) (exact match + qualification conversion). On match, the builtin's scope entry is linked into the user function's auxiliary data at offset +256 field 8.

OpenMP variant dispatch (lines 727-752): when unk_4D03A10 is set, the resolver renames the declaration to "<name>$$OMP_VARIANT%06d" using a monotonic counter unk_4D03A0C. This creates unique internal names for each device/host specialization.

Constexpr/consteval propagation (lines 2288-2301): gated by unk_4F07778 (C++ standard year). For C++11 and later, byte +204 of the scope entry is bit-packed with three globals: bits 5-6 = unk_4F06C58 (constexpr disposition), bits 1-2 = unk_4F06C5A (consteval disposition), bits 3-4 = unk_4F06C59 (immediate-function flag). Diagnostic 2383 fires on constexpr mismatch between declaration and definition.

Device/host overload sets: CUDA allows the same function name to have both __host__ and __device__ overloads. EDG does not treat execution space as part of the function signature for overload resolution purposes -- the standard C++ overload rules apply first, and execution space filtering happens later during code generation. The $$OMP_VARIANT renaming mechanism is used for OpenMP dispatch variants that need distinct host/device specializations, but regular CUDA __host__/__device__ overloads rely on the backend's execution space filtering rather than frontend overload resolution. This means that if two functions have identical C++ signatures but differ only in __host__ vs __device__, they are treated as redeclarations (not overloads) at the EDG level, and the execution space annotation at scope entry offset +198 determines which version survives into device or host code.

CUDA Memory Space Processing — `sub_6582F0` (22KB)

Validates __shared__, __constant__, __managed__ attributes on declarations. Emits diagnostic for automatic variables in inappropriate memory spaces.

Type System

Type Node Layout (192 bytes = 12 x `__m128i`)

Offset	Size	Field
+8	8	Next pointer (linked lists)
+40	8	Name pointer
+48	1	Declaration kind byte
+80	1	Entity kind byte
+140	1	TYPE KIND DISCRIMINATOR (the central dispatch key)
+160	8	Inner/child type pointer (typedef chains, pointer bases)
+168	8	Member list / parameter chain
+173	1	Specifier/node kind byte
+176	2	Entity kind (`uint16`, dispatch key for constexpr evaluator)
+185	1	CV-qualifier bits (bit 0=const, 1=volatile, 2=restrict)
+200	1	Attribute flags

Type kind discriminator values at offset +140:

Value	Type	Notes
0	`void`
1	error type	Sentinel
2–4	fundamental (char, int, ...)
5	pointer	Follows +160 chain
6	pointer-to-member
7	function type	Complex: 17 sub-kinds for calling conventions
8	array	Element count at +128, element type at +160
9–11	class / struct / union	Members at +168
12	typedef / cv-qualified	Follow +160 for underlying type (critical: skip in type-walk loops)
13	enum
14	void (incomplete)
15	vector	Element count at +128
19	decltype
21	placeholder / auto

Scope Table Entry (776 bytes)

Indexed by dword_4F04C64 into base qword_4F04C68:

Offset	Field
+0	Scope identifier
+4	Scope kind (5=namespace, 6=class, 7=function, 8=block, 9=enum, 12=template)
+6–10	Flag bytes
+24	Name list head
+32	Name list tail
+208	Class type pointer
+232	Deferred list
+328	Template info
+552	Parent scope index
+624	Declaration pointer
+680	Linkage specification

Type Comparison — `sub_7386E0` (23KB)

The core type equivalence engine. Takes two type node pointers packed in an __int128 and a flags word, returns boolean equality. The flags word controls comparison mode: bits 0-1 select cv-qualifier strictness (0=strict, 1=relaxed, 2=overload), bit 2 enables template matching (class-equivalence shortcuts), and bit 5 enables anonymous-class structural comparison.

Entry sequence: both types are first canonicalized through sub_72EC50, which peels through chains of non-template typedef aliases. The canonicalizer checks three fields on the elaborated type node: +173 == 12 (typedef kind), +176 == 1 (single-member), and +170 bit 4 == 0 (no template specialization). If all hold, it unwraps one level via sub_72E9A0 and loops. This means typedef int MyInt; typedef MyInt YourInt; canonicalizes YourInt directly to int.

After canonicalization, a quick-reject compares three header bytes without recursing: byte +24 (type kind) must match exactly, bytes +25 XOR must be zero for bits 0x03 (const/volatile) and 0x40 (restrict), and byte +26 XOR must be zero for bit 0x04. Any mismatch short-circuits to return 0.

The main switch dispatches on 38 type kinds. Key cases for CUDA:

Case 1 (fundamental): compares sub-kind at +56, extra flags at +58 (bits 0x3A), and the base type chain at +72. For integer sub-kind (sub_kind == 'i'), follows a resolution chain to find the underlying class scope. In template matching mode (flags bit 2), uses sub_8C7520 to check whether two class instantiations share the same primary template, then sub_89AB40 to compare template argument lists. This path handles CUDA's exotic numeric types (__nv_fp8_e4m3, __nv_fp8_e5m2, etc.) which are represented as fundamental types with distinct sub-kinds.
Case 3 (class/struct/union): fast identity via scope pointer equality, then unique-ID shortcut via dword_4F07588. For anonymous classes with template matching, calls sub_740200 to extract canonical member lists and performs structural comparison. This is relevant for CUDA lambda closure types, which are anonymous classes.
Case 33 (using-declaration/alias): in overload mode (flags bit 1), performs a hash table lookup via *qword_4D03BF8 to retrieve base class triples and compare element-by-element. This ensures that two using declarations resolving to different base classes are treated as distinct for overload discrimination.

Overload mode specifics (flags & 2): the post-switch check additionally verifies that both types agree on the presence/absence of the +80 "extra declaration" pointer. Template parameters are forced unequal (never match for overload purposes without being identical). Scope pointer equivalence is verified via unique-ID for using-declaration discrimination.

CUDA type equivalence: the NVIDIA-specific float types (__nv_fp8_e4m3, __bf16, _Float16, etc.) each have distinct sub-kind values at type node +56 (see the type mangling table: sub-kind 0 = _Float16, 1 = __fp16, 9 = __bf16, 0xA = _Float16 alternate, 0xB = _Float32, 0xC = _Float64, 0xD = _Float128). The type comparison treats them as distinct fundamental types -- _Float16 and __fp16 are NOT equivalent despite both being 16-bit floats. The half type in CUDA maps to _Float16 (sub-kind 0 or 0xA depending on context), while __half in cuda_fp16.h is a wrapper struct (type kind 9, class/struct), so half and __half are never type-equivalent at the EDG level. User code relies on implicit conversions defined in the CUDA headers, not on type equivalence.

Type-to-String Emitter — `sub_74A390` (29KB, 19 callers)

The backbone type printer. Walks type nodes recursively, emitting textual representation for diagnostics. Handles NVIDIA-specific types: __surface_type__, __texture_type__, __nv_bool.

IL Tree Infrastructure

EDG represents parsed code as an Intermediate Language (IL) tree — a rich AST that preserves full C++ semantic information including template instantiation state, scope chains, and type qualifiers. The IL is not LLVM IR; it is EDG's proprietary tree representation that predates the LLVM integration. All semantic analysis, template instantiation, and overload resolution operate on this tree.

The IL tree is traversed by four structurally identical walker functions that share the same 87 node-type dispatch table. The walkers are instantiated from a common template with different callback functions — a design pattern where the traversal logic is fixed but the action at each node is parameterized through function pointers stored in six global variables. This callback-driven walker system is central to EDG's architecture: template instantiation, type checking, code emission, and tree copying all use the same walker infrastructure with different callbacks.

Function	Size	Self-recursive Calls	Purpose
`sub_7506E0`	190KB	297	Primary walker
`sub_760BD0`	109KB	427	Parallel walker (deeper traversal)
`sub_75C0C0`	87KB	316	Third-pass walker
`sub_766570`	148KB	2	Copier/transformer (takes callback params)

Walker Callback System

Six global function pointers form the visitor dispatch table:

Global	Role
`qword_4F08028`	Node pointer remapper (called before recursion)
`qword_4F08020`	Linked-list child remapper
`qword_4F08038`	String field processor
`qword_4F08030`	Pre-visit callback (return nonzero to skip)
`qword_4F08040`	Post-visit callback
`dword_4F08014`	Skip-shared-nodes flag
`dword_4F08018`	Clear/detach mode (null out fields for ownership transfer)

IL Node Types (87 types, from walker case labels)

ID	Type	ID	Type
1	source_file	28	integral_constant
2	scope (15 sub-kinds)	29	float_constant
3	type_qualifier	30	expression (generic)
4	simple_type	41	call_expression
5	pointer_type	42	cast_expression
6	function_type (17 sub-kinds)	43	conditional_expression
7	class_type	44	string_literal
8	enum_type	48	template_argument (4 sub-kinds)
9	array_type	59	concept_expression (10 sub-kinds)
10	bitfield_type	65	type_list (core linked list)
13	statement (30+ sub-kinds)	75	block/compound_statement
23	scope_entry (root)	76	access_specifier

Deep Copy — `sub_766570` with `sub_8C2C50`

sub_8C2C50 calls sub_766570 with copy callback sub_8C38E0 and list-copy callback sub_8C3810. Node size table at qword_4B6D500[node_type] provides memcpy sizes. Critical for template instantiation.

Constexpr Evaluator

The constexpr evaluator is arguably the most technically impressive subsystem in the EDG frontend. It is a complete tree-walking interpreter that can execute arbitrary C++ code at compile time, implementing the full C++20 constexpr specification including heap allocation (constexpr new), string literals, virtual function dispatch, and complex control flow. At 317KB for the expression evaluator alone, plus 77KB for the statement executor and ~200KB in supporting functions, it constitutes nearly 20% of the entire EDG frontend.

The evaluator operates on EDG's IL tree directly — it does not compile to bytecode or any intermediate form. Instead, it recursively walks expression and statement nodes, maintaining its own memory model (a 3-tier page arena), variable bindings (an open-addressing hash table), and lifetime tracking (scope epoch counters). This design trades execution speed for implementation simplicity and guaranteed semantic fidelity with the compiler's own type system.

Signature:

bool constexpr_eval_expr(
    constexpr_ctx *ctx,     // a1: evaluation context (hash table, arena, flags)
    expr_node **expr,       // a2: expression AST node
    __m128i *result,        // a3: output value slot (16 or 32 bytes)
    char *frame_base        // a4: stack frame base pointer for lifetime tracking
);

Expression Evaluator — `sub_786210` (317KB, 9,075 lines)

The largest function in the entire EDG frontend. Two-level dispatch: outer switch on expression kind *(a2+24), inner switch on operator code *(a2+56) with 124 cases.

Outer Switch — Expression Kinds

Kind	Hex	Meaning	Notes
0	`0x00`	Void/empty	Sets `ctx+132
1	`0x01`	Operator expression	→ 124-case inner switch on `*(a2+56)`
2	`0x02`	Variable reference	Hash table lookup, kind==1(const) or kind==3(constexpr)
3	`0x03`	Function reference / enumerator	Subkind==5: has constexpr body → recurse
4	`0x04`	Literal (int/float constant)	Immediate return — value is in the node
5–6	`0x05–06`	String / compound literal	C++20 mode required (`dword_4F077C4 == 2`)
7	`0x07`	Function call	Most complex case (~1200 lines)
10	`0x0A`	Parenthesized expression	Recurse on `a2[7]`
11	`0x0B`	Member access (`->`)	Navigate member hierarchy via type-size table
17	`0x11`	Lambda expression	Save/restore `ctx+72`, execute body via `sub_7987E0`
18	`0x12`	Capture variable	Hash table lookup by `a2[7]`
20	`0x14`	Address-of	Set flags `a3+8 = 0x20` (IS_SYMBOLIC)
23	`0x17`	sizeof / alignof	Delegate to `sub_620D80`
24	`0x18`	Subscript (`array[index]`)	Bounds check, compute `elem_size * index`
27	`0x1B`	Implicit conversion	Navigate chain, recurse on inner
31	`0x1F`	Requires expression (C++20)	Execute body via `sub_79B7D0`
32	`0x20`	Type trait	`sub_693DC0` → `xmmword_4F08280`/`xmmword_4F08290`
33	`0x21`	SFINAE / substitution failure	Template context check, `sub_6F2300`

Inner Switch — Operator Codes (124 cases, selected)

Cases	Category	Operations
0–1	Assignment	`=` / initialization (ref types: 32-byte memcpy)
3–4	Conversion	Lvalue-to-rvalue via `sub_7A0070`
5	Type cast	`static_cast` — massive dispatch: int→int(`sub_622780`), float→float(`sub_709EF0`), int→float(`sub_710280`), ptr→ptr(`sub_770010`)
14–15	Member access	`.` and `->` — offset via `sub_8D5CF0`, virtual base via `sub_771030`
16–17	Pointer arithmetic	Subtraction, `ptrdiff_t` via `sub_7764B0`
20, 29	Comparison	`==`, `!=` via `sub_7759B0`
26–28	Unary	`++`, `--`, unary minus (`sub_621DB0`)
30–31	Vector ops	Element-wise comparison loop, broadcast
39–45	Arithmetic	`+`(`sub_621270`), `-`(`sub_6215F0`), `*`(`sub_621F20`), `/`(`sub_6220A0`), `%`(`sub_6220C0`), `<<`(`sub_70BBE0`), `>>`(`sub_70BCF0`) — all with overflow/divzero checks
46–49	Bitwise	`&`, `\|`, `^`, `~`
50–57	Logical	`&&`, `\|\|` with short-circuit evaluation
58–59	Detailed comparison	Integer(`sub_621000`), float(`sub_70BE30`), pointer(address+symbolic)
64	Spaceship	`<=>` → strong_ordering values at `unk_4F06BD8`–`unk_4F06C30`
73–84	Compound assignment	`+=` through `^=` with lifetime validation, const-check (diag 0x1318)
91–93	Conditional	Ternary `?:`, array subscript (bounds-checked, error 0xA84)
94–95	Virtual dispatch	Vtable lookup → `sub_79CCD0`
96–97	Allocation	Placement new / `operator new`
103	Exception	`throw` (always fails in constexpr)
105–108	Delegated	→ `sub_77FCB0` (builtin operators)

Value Slot Layout (16 bytes at `a3`)

Offset	Size	Field
0–7	8	Primary value (integer, IEEE float, or arena pointer)
8	1	Flags byte (see below)
9–11	3	Alignment info, compound assignment tracking
12–15	4	Scope epoch ID (lifetime validation)

Extended slot (32 bytes for reference types) adds secondary address at +16 and frame base at +24.

Flags Byte (offset +8)

Bit	Mask	Name	Meaning
0	`0x01`	IS_POINTER	Value is an indirect pointer
1	`0x02`	IS_PAST_END	One-past-the-end pointer
2	`0x04`	HAS_CLEANUP	Destructor chain at +16
3	`0x08`	HAS_SUBOBJECT	Refers to a subobject
4	`0x10`	HAS_BITFIELD	Bitfield offset in bits 8–31
5	`0x20`	IS_SYMBOLIC	Unresolved symbolic reference
6	`0x40`	IS_CONST	From a const declaration
7	`0x80`	IS_ARRAY_MEMBER	Part of array storage

Statement Executor — `sub_795660` (77KB)

Dispatch on *(a2+40) — statement kind:

Case	Kind	Notes
0	Declaration	Arena alloc → eval initializer → insert into scoped hash table
1–4	If / if-else / if-init / if-constexpr	Condition → bool via `sub_620EE0` → branch
5	While loop	Step counter at `ctx+120`, limit from `qword_4D042E0` (~1M). Error 0x97F on exceeded.
6	Jump (break/continue/goto)	Sets control flow bits: bit 1=continue, bit 2=break, bit 3=goto
7,15,24	Null/empty	Return success
8	Return	Walk call chain at `ctx+72`, store result, set "returned" flag
11	Expression statement	Evaluate for side effects via `sub_7987E0`
12	For loop	Init → alloc → [condition → body → increment → cleanup] loop
13	Do-while	Delegates to `sub_7A0E60`
14	Range-based for	4 temp slots via `sub_77A250`, iterator advance via `sub_7A0470`

Memory Management — 3-Tier Page Arena

Tier	Location	Page Size	Threshold	Purpose
Primary	`ctx+16`/`ctx+24`	64KB	default	Expression evaluation temporaries
Secondary	`ctx+144`/`ctx+152`	64KB	lazy init (`ctx+132 & 8`)	Variable declarations
Tertiary	`ctx+80`	64KB	nullable	String/compound literals

Overflow: allocations >1024 bytes go to heap via sub_822B10(size+16), forming a singly-linked list from ctx+32. Freed by walking until scope epoch matches.

Value slot header: type pointer at offset -8 (8 bytes), lifetime bits at offset -9 (1 byte, bit 0 = "initialized").

Scope epoch: monotonic counter at ctx+128. Hash table at ctx+56/ctx+64 maps epoch → page state. Arena rewound on scope exit.

Hash Table (`ctx+0`/`ctx+8`)

Open-addressing with 16-byte entries [key, value]. Hash: key_pointer >> 3. Collision: linear probing. Doubles at 2 * count > capacity (via sub_7704A0). Secondary table at ctx+56/ctx+64/ctx+68 uses 4-byte integer keys (scope epoch IDs).

Diagnostic Codes

Code	Hex	Meaning
61	`0x3D`	Division by zero
2431	`0x97F`	Step limit exceeded
2692	`0xA84`	Array index out of bounds
2695	`0xA87`	Unsupported jump in constexpr
2698	`0xA8A`	Null pointer dereference
2705	`0xA91`	Negative shift count
2707	`0xA93`	Integer overflow/underflow
2712	`0xA98`	Use of uninitialized variable
2721	`0xAA1`	Not a constant expression (generic)
2727	`0xAA7`	Invalid type conversion
2735	`0xAAF`	Pointer below array start
2751	`0xABF`	Access outside lifetime
2766	`0xACE`	Modification through null pointer
2959	`0xB8B`	Missing return in constexpr function
3007	`0xBBF`	`reinterpret_cast` in constexpr
3022	`0xBCE`	Call to undefined constexpr function

Silent mode: ctx+132 bit 5 (0x20) suppresses diagnostics (SFINAE contexts).

Constexpr and CUDA: Host-Side Evaluation of Device Code

A key architectural question for any CUDA compiler is whether constexpr functions annotated __device__ are evaluated at host compile time. In cicc v13.0, the answer is yes, conditionally. The constexpr evaluator operates entirely within the EDG frontend, which runs on the host. When a constexpr __device__ function is used in a context requiring a constant expression (template argument, array bound, static_assert, constexpr variable initializer), the evaluator executes it using its tree-walking interpreter regardless of the function's execution space annotation. The execution space attributes (__device__, __host__, __global__) are semantic annotations for code generation, not for the constexpr evaluator -- the evaluator sees only the IL tree and does not distinguish between host and device function bodies.

This works because EDG's constexpr evaluator uses software floating point (USE_SOFTFLOAT = 1 in the 737-define configuration block). All floating-point arithmetic in constexpr contexts goes through the softfloat library (sub_70B8D0 add, sub_70B9E0 sub, sub_70BBE0 mul, sub_70BCF0 div, sub_709EF0 convert) rather than the host CPU's FPU. This guarantees that constexpr evaluation of device code produces results consistent with IEEE 754 semantics regardless of the host platform's floating-point behavior. The softfloat library handles all precision levels including _Float16, __bf16, _Float32, _Float64, and __float128.

SM architecture gates influence constexpr relaxations. The global qword_4F077A8 (SM version) gates certain constexpr features:

SM >= 89 (qword_4F077A8 > 0x15F8F): relaxed constexpr rules for variables with incomplete types
dword_4F077C4 == 2: C++20 features including constexpr new, constexpr string literals, and constexpr member access (expression evaluator cases 5/6)
dword_4D04880: C++14 relaxed constexpr (loops, local variable mutation, multiple return statements)
C++23/26 extensions: constexpr try-catch (statement executor case 14), constexpr placement new (expression evaluator case 103), constexpr dynamic_cast (error 0xBB7)

The evaluator enforces a step limit (qword_4D042E0, default ~1M iterations) to prevent infinite loops in constexpr evaluation. This limit applies uniformly to both host and device constexpr functions. When exceeded, diagnostic 0x97F ("constexpr evaluation step limit exceeded") is emitted.

One important consequence: __global__ (kernel) functions cannot be constexpr because they have no return value in the conventional sense -- they are launched asynchronously. The parser enforces this at the declaration specifier level, not in the constexpr evaluator.

Supporting Functions

Function	Size	Role
`sub_79CCD0`	67KB	Object member accessor (base classes, virtual bases, union tracking)
`sub_799B70`	33KB	Aggregate initializer (arrays, structs, designated init, brace elision)
`sub_79B7D0`	29KB	Function call evaluator (argument binding, body execution, recursion limits)
`sub_7987E0`	11KB	Statement list executor entry
`sub_77FCB0`	150KB	Top-level dispatch (80 expression types + 62-entry intrinsic table)
`sub_7764B0`	18KB	Type size calculator (Robin Hood hash memoization, 64MB cap)
`sub_7707D0`	—	Clone constexpr object
`sub_7790A0`	—	Trivial aggregate copy
`sub_7A0070`	—	Lvalue-to-rvalue load
`sub_77F5C0`	—	Bounds check (ptr, type → idx, err, size)
`sub_76FFC0`	—	Run cleanup/destructor chain

Bigint Library (`sub_621*`)

Function	Operation
`sub_621000`	compare(a, width_a, b, width_b) → {-1,0,1}
`sub_621270`	add(dst, src, width, overflow_out)
`sub_6215F0`	sub(dst, src, width, overflow_out)
`sub_621F20`	mul(dst, src, width, overflow_out)
`sub_6220A0`	div(dst, src, width, divzero_out)
`sub_6220C0`	mod(dst, src, width, divzero_out)
`sub_621DB0`	negate(dst)
`sub_620EE0`	to_int(value, width, result_out)

Float Library (`sub_70B*`)

Function	Operation
`sub_70B8D0`	add(type, lhs, rhs, dst, inexact, exception)
`sub_70B9E0`	sub
`sub_70BAF0`	negate
`sub_70BBE0`	mul
`sub_70BCF0`	div
`sub_70BE30`	compare(type, lhs, rhs, nan_result) → {-1,0,1,NaN}
`sub_709EF0`	convert(src, src_prec, dst, dst_prec, inexact)

Key Globals

Variable	Purpose
`dword_4F077C4`	C++ standard version (2 = C++20, enables constexpr new/string)
`dword_4D04880`	C++14 relaxed constexpr (enables loops, mutation)
`qword_4D042E0`	Max constexpr evaluation steps (~1M)
`xmmword_4F08280`	Canonical constexpr TRUE
`xmmword_4F08290`	Canonical constexpr FALSE
`qword_4F08380`	Global type-size hash table base
`qword_4F08060`	Global allocator function pointer (constexpr new detection)

CUDA-Specific Extensions

NVIDIA's extensions to the EDG frontend fall into four categories: memory space qualifiers that map to GPU address spaces, kernel launch syntax that gets lowered to CUDA runtime API calls, registration stubs that tell the CUDA runtime about compiled kernels, and atomic builtin generation for the C++11 atomics model on GPU. These extensions are concentrated in the 0x650000–0x810000 range and reference SM architecture version globals extensively — many features are gated by qword_4F077A8 comparisons against architecture thresholds.

CUDA Keyword Extensions

NVIDIA extends the EDG keyword table with execution space qualifiers, memory space qualifiers, and type intrinsics. These exist in four distinct layers -- registered keywords, declaration specifier tokens, address space attribute tokens, and extended type tokens -- each integrated differently into the EDG parser infrastructure.

The critical architectural fact: __device__, __host__, and __global__ are not keywords in the EDG keyword table. They are processed through the C/C++ attribute system, where EDG maps them to internal single-character codes. The declaration specifier state machine (sub_672A20) and the address space handler together resolve these attributes into symbol-table fields that downstream passes consume.

Token ID Inventory

NVIDIA uses four non-contiguous token ID ranges:

Range	Category	Count	Registration
133-136	Memory space declaration specifiers	4	Hardcoded in `sub_672A20` switch
236, 339-354	Extended numeric types (FP8/FP6/FP4/MX)	17	Resolved via `sub_6911B0`
272-280	Address space qualifier / special type tokens	9	Hardcoded handlers in `sub_672A20`
328-330	NVIDIA type trait intrinsics	3	Registered via `sub_885C00` in `sub_706250`

Only tokens 328-330 use the standard sub_885C00(token_id, "keyword") registration path. All other CUDA tokens are wired directly into parser switch cases, bypassing the keyword table entirely.

Execution Space Qualifiers -- Attribute Path

__device__, __host__, and __global__ are recognized by the attribute parser, which stores them as single-character codes at declaration context offset +269. The complete internal attribute character map (sub_5C79F0 at 0x5C79F0):

Char	Hex	Attribute	Scope Entry Bits
`'V'`	0x56	`__host__`	-- (host is the default)
`'W'`	0x57	`__device__`	+198 bit 4 (0x10)
`'X'`	0x58	`__global__`	+198 bit 4 (0x10) AND bit 5 (0x20)
`'Y'`	0x59	`__tile_global__`	--
`'Z'`	0x5A	`__shared__`	-- (stored in +136 as space code 3)
`'['`	0x5B	`__constant__`	-- (stored in +136 as space code 2)
`'\'`	0x5C	`__launch_bounds__`	Arguments at decl+336 struct
`']'`	0x5D	`__maxnreg__`	--
`'^'`	0x5E	`__local_maxnreg__`	--
`'_'`	0x5F	`__tile_builtin__`	--
`'f'`	0x66	`__managed__`	-- (stored in +136 as space code 5)
`'k'`	0x6B	`__cluster_dims__`	Arguments at cluster config struct
`'l'`	0x6C	`__block_size__`	--
`'r'`	0x72	`__nv_pure__`	--

The attribute character code at +269 is consumed by sub_6582F0 (declaration-side validation) and sub_65F400 (definition-side validation). These functions never see the CUDA qualifier as a keyword token -- they only see the resolved character code.

Execution space at scope entry offset +198 is the authoritative record of a function's execution space for all downstream passes:

Bit 4 (0x10): function is __device__ or __global__ -- activates device-scope variable validation
Bit 5 (0x20): function is __global__ (kernel entry point) -- triggers kernel metadata emission via sub_12735D0, which emits ("kernel", 1) to LLVM IR
Bit 2 (0x04) at offset +199: full_custom_abi flag

When a function has bit 5 set, the attribute emitter also iterates the parameter array (40-byte entries at decl+16) and emits ("grid_constant", param_index) for each parameter where byte +33 is nonzero. The preserve-register struct at decl+336 (three int32 fields: data, control, after) is consumed and cleared (set to -1) after emission.

Memory Space Declaration Specifiers (Tokens 133-136)

These piggyback on the signedness/width field v305 in the declaration specifier state machine with values 4-7, cleanly separated from the standard C width modifiers (0-3):

Token	Keyword	`v305` Value	`v325` Value	Formula
133	`__shared__`	4	2	Special case
134	`__device__`	5	8	`token - 129`
135	`__constant__`	6	8	`token - 129`
136	`__managed__`	7	8	`token - 129`

The type construction switch in sub_672A20 branches on v305 > 3 to invoke CUDA-specific type constructors (sub_72BC30 for signed, sub_72BDB0 for unsigned) instead of the standard C type constructors used for v305 values 0-3.

Address Space Qualifier Tokens (272-280)

Processed by dedicated handlers in the declaration specifier parser:

Token	Keyword	Handler	Argument
272	`__attribute__((address_space(N)))`	`sub_6210B0`	Parses integer N
273	`__global__` (addr space annotation)	`sub_667B60(0, ...)`	Space ID = 0
274	`__shared__` (addr space annotation)	`sub_667B60(2, ...)`	Space ID = 2
275	`__constant__` (addr space annotation)	`sub_667B60(3, ...)`	Space ID = 3
276	`__generic__`	`sub_72B620(type, cv)`	--
277	`__nv_tex_surf_handle_t`	`sub_72BA30(unk_4F06A51)`	Texture/surface handle
278	`__nv_buffer_handle_t`	`sub_72BA30(unk_4F06A60)`	Buffer handle
279	`__nv_grid_constant`	`sub_72C390()`	Grid-constant marker
280	`__nv_is_extended_device_lambda`	`sub_72C270()`	Lambda closure check

Note the dual role of __shared__, __constant__, and __global__: each appears both as a memory space declaration specifier (tokens 133-135) and as an address space qualifier (tokens 273-275). The declaration specifier path stores the result in the symbol-table entry's memory_space_code at offset +136 and memory_space_flags at offset +156. The address space qualifier path stores the result in the EDG type node's qualifier word at offset +18 (values 1=global, 32=shared, 33=constant). Both representations flow downstream: the symbol-table code controls declaration validation, while the type qualifier controls LLVM pointer type construction in sub_911D10.

The __grid_constant__ qualifier (token 279, handler sub_72C390) marks kernel parameters as grid-constant -- the parameter is read-only across all thread blocks and may be placed in constant memory by the backend. This is a SM 70+ feature.

NVIDIA Type Trait Keywords (Tokens 328-330)

The only CUDA tokens registered through the standard sub_885C00 keyword registration path. Always registered -- not gated by any version, language mode, or feature flag:

Token	Keyword	Registration
328	`__nv_is_extended_device_lambda_closure_type`	`sub_885C00(328, ...)`
329	`__nv_is_extended_host_device_lambda_closure_type`	`sub_885C00(329, ...)`
330	`__nv_is_extended_device_lambda_with_preserved_return_type`	`sub_885C00(330, ...)`

These type traits are used by CUDA's extended lambda machinery to query whether a lambda closure type carries device or host-device execution space annotations. They participate in SFINAE and if constexpr contexts for compile-time dispatch between host and device lambda implementations.

The lambda mangling extensions in sub_80FE00 use the execution space information from these traits to choose between three proprietary Itanium ABI mangling prefixes: Unvdl (device lambda), Unvdtl (device template lambda), and Unvhdl (host-device lambda). The selection is based on flag byte +92 of the closure descriptor, where bit 5 (0x20) marks an extended CUDA lambda, bit 4 (0x10) marks host-device, and bit 2 (0x04) marks a template lambda.

Extended Numeric Type Tokens (236, 339-354)

Blackwell tensor core operations require exotic floating-point formats. These are resolved via sub_6911B0() to a type node, then set v325=20, v327|=4 in the declaration specifier state machine:

Token	Type	Format	Width
236	`__nv_fp8_e4m3`	FP8	8b
339	`__nv_fp8_e5m2`	FP8	8b
340-341	`__nv_fp8x2_e{4m3,5m2}`	FP8 vector	16b
342-343	`__nv_fp8x4_e{4m3,5m2}`	FP8 vector	32b
344-345	`__nv_fp6_e{2m3,3m2}`	FP6	6b
346-347	`__nv_fp6x2_e{2m3,3m2}`	FP6 vector	12b
348-349	`__nv_mxfp8_e{4m3,5m2}`	MX-format FP8	8b
350-351	`__nv_mxfp6_e{2m3,3m2}`	MX-format FP6	6b
352	`__nv_mxfp4_e2m1`	MX-format FP4	4b
353	`__nv_satfinite`	Saturation modifier	--
354	`__nv_e8m0`	Exponent-only E8M0	8b

These types are represented as fundamental types with distinct sub-kind values at type node +56 in the EDG type system. The type comparison engine (sub_7386E0, case 1) compares sub-kind, extra flags at +58 (bits 0x3A), and the base type chain at +72 to ensure each format is treated as a distinct type.

Attribute Processing Pipeline

The complete pipeline from CUDA source keyword to LLVM IR metadata:

CUDA source: __global__ void kernel() __launch_bounds__(256, 2)
  |
  v
Phase 1: Attribute parser → char code 'X' (0x58) at decl context +269
  |
  v
Phase 2: Declaration specifier state machine (sub_672A20)
         → scope entry +198 bit 5 set (kernel)
  |
  v
Phase 3: Post-parse fixup (sub_5D0FF0)
         → __launch_bounds__(256, 2) extracted to launch config struct
  |
  v
Phase 4: CUDA attribute validator (sub_826060)
         → validates __launch_bounds__ on __global__ function
         → diagnostic 0xDCE (3534) if __launch_bounds__ on non-kernel
         → diagnostic 0xE83 (3715) if values out of range
         → diagnostic 0xE87 (3719) if __launch_bounds__ + __maxnreg__ conflict
  |
  v
Phase 5: Attribute emission to LLVM IR (sub_12735D0)
         → emits ("kernel", 1) from bit 5 of decl+198
         → emits ("grid_constant", N) per qualifying parameter
  |
  v
Phase 6: Kernel metadata generation (sub_93AE30)
         → "nvvm.maxntid" = "256,1,1"
         → "nvvm.minctasm" = "2"

Memory Space Attributes

sub_6582F0 (22KB) and sub_65F400 (28KB) validate __shared__, __constant__, __managed__ on variable declarations and definitions respectively. Token cases 133-136 in the parser handle these as first-class declaration specifiers. The validation logic enforces CUDA semantics: __shared__ variables cannot have initializers (shared memory is not initialized on kernel launch), __constant__ variables must have static storage duration, and __managed__ variables require unified memory support on the target architecture.

Symbol Table Memory Space Encoding

Memory space is tracked in two locations within each symbol-table entry:

Offset	Size	Field	Values
+136	1 byte	`memory_space_code`	0=default, 1=`__device__`, 2=`__constant__`, 3=`__shared__`, 5=`__managed__`
+156	1 byte	`memory_space_flags`	bit 0=device, bit 1=shared, bit 2=constant, bit 4=thread_local interaction
+157	1 byte	Extended flags	bit 0=managed

The dual encoding exists because the flags are additive from parsed attributes (multiple attributes can be OR'd in) while the code is the single resolved value used by downstream passes. The code at +136 is set by sub_735FB0 (symbol entry constructor) and queried throughout the compiler.

Declaration-Side Validation -- `sub_6582F0`

The validation follows a ten-phase pipeline:

__managed__ pre-resolution: when dword_4F04C5C == dword_4F04C34 (host-only mode), managed variables are silently downgraded to __device__ (space code 1) and the extern flag is cleared.
Extern handling: sets bit 0 of decl context +122 and the is_extern tracking variables.
Type normalization: checks function-type declarations against CUDA criteria via sub_8D4C10; emits diagnostic 891 for function types with memory space.
Specifier processing: calls sub_6413B0 against the current compilation target.
Prior-declaration conflict detection: looks up existing symbol, compares memory space codes. Mismatch with dword_4F077C4 == 2 (separate compilation) triggers diagnostic 172 (warning 4).
New symbol creation: sub_735FB0(type_ptr, space_code, target_id, is_new_decl).
__managed__ namespace binding: validates namespace name via sub_703C10; checks class/struct type compatibility (diagnostic 1560 on failure).
Storage class adjustments: processes constant-space read-only flags.
Device-scope enforcement: when scope +198 bit 4 is set (inside __device__/__global__ function), local variables cannot carry device memory qualifiers. Diagnostic 3484: "an automatic variable may not be declared as __device__". The memory space name is determined by the priority cascade: __constant__ > __managed__ > __shared__ > __device__.
Final fixup: type validation (sub_8D9350), attribute propagation (sub_8756F0), "main" function warnings (diagnostic 2948), thread-safety analysis (sub_826000).

Memory Space Mutual Exclusivity

The code consistently enforces these combinations:

Combination	Diagnostic	Severity
`__shared__` + `__constant__`	3481	error
`__constant__` + `__managed__`	3568	error
`__constant__` + `__shared__` + `__managed__`	3568	error
`thread_local` + `__device__`	892	error
`thread_local` + any device space	3578	error
auto variable + `__device__`/`__constant__`/`__managed__`	3484	error
`__shared__` + initializer	3510	error
`__constant__` in device-function scope	3512	error
`register` + device memory	3485/3688	error
`volatile` + `__constant__`	1378	error
redeclaration with different memory space	3499	warning 5

The diagnostic name-string priority cascade (__constant__ > __managed__ > __shared__ > __device__) appears identically in six locations: sub_6582F0 lines 734-739, sub_65F400 lines 541-549 and 927-935, sub_5C6B80 lines 22-34, sub_667550 lines 87-98, and sub_5D9330 (the symbol printer).

Address Space Flow: EDG to LLVM to PTX

CUDA Source	EDG Token	Symbol +136	Type Qualifier +18	LLVM AS	PTX Directive
`__device__ int x;`	134	1	1	1	`.global`
`__shared__ int x;`	133	3	32	3	`.shared`
`__constant__ int x;`	135	2	33	4	`.const`
`__managed__ int x;`	136	5	1	1	`.global` + runtime registration
(local in kernel)	--	0	--	0/5	`.local`/`.param`

The EDG type node qualifier word (offset +18, masked to 0x7FFF) carries address space through the type system. During EDG-to-LLVM type translation, sub_911D10 reads this qualifier from pointer/reference types (kind 75/76) and maps to LLVM address space numbers via sub_5FFE90. __managed__ variables are compiled as __device__ (LLVM address space 1) with additional runtime registration calls generated by sub_806F60 for unified memory management.

Kernel Launch Lowering — `sub_7F2B50` (16KB)

Transforms CUDA's <<<gridDim, blockDim, sharedMem, stream>>> kernel launch syntax into CUDA runtime API calls. The lowered sequence allocates a parameter buffer via cudaGetParameterBufferV2, copies kernel arguments into it, and launches with cudaLaunchDeviceV2. For the simpler launch path, it generates __cudaPushCallConfiguration followed by individual __cudaSetupArg/__cudaSetupArgSimple calls. This lowering happens entirely within EDG — by the time the code reaches the LLVM backend, kernel launches are ordinary function calls.

Registration Stub Generator — `sub_806F60`

Generates __cudaRegisterAll function with calls to: __cudaRegisterEntry, __cudaRegisterVariable, __cudaRegisterGlobalTexture, __cudaRegisterGlobalSurface, __cudaRegisterManagedVariable, __cudaRegisterBinary, ____cudaRegisterLinkedBinary.

Host-side stubs generated by sub_808590: "__device_stub_%s", "__cudaLaunch", "__cudaSetupArg", "__cudaSetupArgSimple".

Atomic Builtin Generator — `sub_6BBC40` (34KB)

Constructs __nv_atomic_fetch_{add,sub,and,xor,or,max,min} names with type suffixes (_s, _f, _u) and width (_%u).

SM Architecture Gates

Two functions configure ~160 optimization/feature flags based on SM version:

Function	Role	Thresholds
`sub_60D650`	Optimization level → 109 `unk_4D04*` flags	Single integer parameter (O-level)
`sub_60E7C0`	SM arch → 60 `unk_4D04*` feature flags	SM 75 (30399), SM 80 (40000), SM 90 (89999), SM 100 (109999), SM 120 (119999)

Each flag is gated by a byte_4CF8* user-override check, preventing auto-configuration when the user explicitly sets a flag via CLI.

TileIR Backend

sub_5E3AD0 optionally loads libTileIRCompiler_shared.so via dlopen and looks up symbol "cudacc_back_end". A 17-entry function pointer table is passed. Gated by dword_4D045A0.

Diagnostic System

EDG's diagnostic system supports three output formats: human-readable terminal output (with ANSI color and word-wrapping), SARIF JSON for IDE integration, and a machine-readable log format for automated tooling. All three share the same diagnostic numbering scheme and severity classification. The terminal output handler alone is 37KB — it implements its own word-wrapping algorithm with configurable terminal width, recursive child diagnostic emission (for "note: see declaration of X" chains), and color coding by severity level.

Terminal Output — `sub_681D20` (37KB)

Formats error/warning/remark messages with:

Severity labels: remark (2), warning (4), caution (5), severe-warning (6), error (7–8), catastrophe (9–10), internal-error (11)
Source location: file:line:col
ANSI color escapes (gated by dword_4F073CC)
Word-wrapping at dword_4D039D0 (terminal width)
Recursive child diagnostic emission

SARIF JSON Output — `sub_6837D0` (20KB)

Structured diagnostics for IDE integration, enabled by --diagnostics_format=sarif (CLI case 0x125, sets unk_4D04198 = 1). The output is a comma-separated stream of SARIF result objects -- NOT a complete SARIF envelope with $schema, runs[], etc. The caller or a post-processor is expected to wrap the stream in the standard SARIF container.

Each diagnostic emits one JSON object:

{
  "ruleId": "EC<number>",
  "level": "error",
  "message": {"text": "<json-escaped message>"},
  "locations": [
    {
      "physicalLocation": {
        "artifactLocation": {"uri": "file://<path>"},
        "region": {
          "startLine": 42,
          "startColumn": 17
        }
      }
    }
  ],
  "relatedLocations": [
    {
      "message": {"text": "see declaration of X"},
      "physicalLocation": { ... }
    }
  ]
}

Rule ID format: "EC" + decimal error number from the diagnostic record at offset +176. For example, EDG error 1234 becomes "EC1234".

Severity mapping (byte at diagnostic node +180):

Severity	EDG Meaning	SARIF `level`
4	remark	`"remark"`
5	warning	`"warning"`
7, 8	error	`"error"`
9	catastrophe	`"catastrophe"`
11	internal error	`"internal_error"`

Note that SARIF spec only defines "warning", "error", and "note" as standard levels. The "remark", "catastrophe", and "internal_error" values are EDG extensions -- consuming tools should treat unknown levels as "error".

Message text escaping: sub_683690 renders the diagnostic text into qword_4D039E8, then copies character-by-character into the output buffer, escaping " as \" and \ as \\. No other JSON escaping (e.g., control characters, Unicode) is applied.

Location resolution: sub_67C120 calls sub_729E00 to decompose the packed source location into (file-id, line, column), then sub_722DF0 to resolve the file-id to a filesystem path. The startColumn field is omitted when column is zero.

Related locations: the linked list at diagnostic node +72 chains "note" sub-diagnostics. Each is emitted as a relatedLocations array entry with its own message and physical location.

Filtering before emission: diagnostics pass through severity threshold check (byte_4F07481[0]), duplicate detection (byte_4CFFE80[4*errnum + 2] bit flags), pragma-based suppression (sub_67D520), and error limit check (unk_4F074B0 + unk_4F074B8 >= unk_4F07478). All filtering happens before the SARIF/text format branch.

Machine-Readable Log

Writes to qword_4D04908 in format: <severity-char> "<filename>" <line> <col> <message>\n. Severity chars from "RwweeccccCli": R=remark, w=warning, e=error, c=catastrophe.

Name Mangling (Itanium ABI)

EDG includes a complete implementation of the Itanium C++ ABI name mangling specification. NVIDIA extends the standard mangling with three proprietary prefixes (Unvdl, Unvdtl, Unvhdl) for device lambdas, device template lambdas, and host-device lambdas respectively. These extensions are necessary because CUDA's execution model requires distinguishing between host and device versions of the same lambda — they must have different mangled names to avoid linker collisions when both host and device code are linked into the same binary.

Address range 0x810000–0x8EFFFF:

Function	Size	Role
`sub_8E74B0`	29KB	Primary mangling entry
`sub_8E9FF0`	26KB	Type mangling
`sub_816460`	24KB	Type component mangling
`sub_813790`	13KB	Expression mangling
`sub_80E340`	23KB	Builtin type mangling (incl. `DF16_`, `DF16b`, `Cu6__bf16`, `u6__mfp8`)
`sub_80FE00`	8KB	NVIDIA extension mangling (`Unvdl`, `Unvdtl`, `Unvhdl`)

NVIDIA lambda mangling extensions (sub_80FE00): standard Itanium ABI uses Ul<params>E<index>_ for unnamed lambda types and Ut<index>_ for unnamed non-lambda types. NVIDIA adds three proprietary prefixes chosen based on flag byte +92 of the lambda's closure descriptor:

Prefix	Meaning	Condition
`Unvdl`	`__device__` lambda	`flag_byte_92 & 0x20` set, not host-device, not template
`Unvdtl`	`__device__` template lambda	`flag_byte_92 & 0x20` set, `flag_byte_92 & 4` set
`Unvhdl`	`__host__ __device__` lambda	`flag_byte_92 & 0x20` set, `flag_byte_92 & 0x10` set

The Unvhdl prefix carries three single-digit flags separated by underscores after the prefix: Unvhdl<index>_<has_explicit_return>_<is_host_device>_<has_template_params>_. Each flag is '0' or '1'. This is richer than the standard Ul which only encodes parameter types.

NVIDIA vendor type manglings (sub_80E340): the type mangler handles CUDA-specific types as Itanium vendor types (prefix u + length + name):

Type	Mangling	Notes
`__bf16` (bfloat16)	`u6__bf16` or `DF16b`	ABI-gated: `qword_4F077B4` lo32 selects vendor vs C++23 encoding
`__mfp8` (FP8)	`u6__mfp8`	NVIDIA micro-float 8-bit for transformer inference
`__metainfo`	`U10__metainfo`	Kernel parameter metadata type attribute
`float80`	`u7float80`	x87 extended precision (vendor type)

The __bf16 mangling has a three-way gate reflecting the ongoing ABI transition: qword_4F077B4 lo32 != 0 selects "u6__bf16" (vendor type); hi32 == 0 selects "DF16b" (C++23 standardized P1467); otherwise qword_4F06A78 determines which encoding. The ABI version variable unk_4D04250 controls this and other encoding decisions, with known thresholds at 0x76BF (GCC 3.3 compat) and 0xC350 (GCC 12 compat).

Standard float types follow Itanium: _Float16 = "DF16_", __fp16 = "Dh", float = "f", double = "d", __float128 = "g", with the complex variants adding a 'C' prefix.

Key Global Variables

Variable	Size	Role
`dword_4F077C4`	4	Language mode: 0=neither, 1=C, 2=C++
`unk_4F07778`	4	C/C++ standard year (199711, 201103, 201402, 201703, 202002, 202310)
`qword_4F077B4`	8	Dialect extension flags (lo=CUDA extensions, hi=GNU extensions)
`dword_4F077BC`	4	NVCC mode flag
`dword_4F077C0`	4	GCC compatibility mode
`qword_4F077A8`	8	SM architecture version (controls feature gates throughout)
`word_4F06418`	2	Current parser token
`qword_4F04C68`	8	Scope table base pointer (776-byte entries)
`dword_4F04C64`	4	Current scope index
`qword_4CF7CE0`	8	AST printer callback vtable
`qword_4D03FF0`	8	Current translation unit pointer
`qword_4D04908`	8	Machine-readable diagnostic log FILE*
`qword_4F08028`–`qword_4F08040`	48	IL tree walker callback table
`dword_4D045A0`	4	TileIR mode flag

NVVM IR Generation

Between the EDG 6.6 frontend and the LLVM optimizer sits a layer that has no upstream LLVM equivalent: the NVVM IR generation subsystem. Its job is to translate the EDG intermediate language (IL) tree -- a C-level AST produced by EDG's source-to-source backend -- into LLVM IR suitable for the NVPTX target. This is cicc's equivalent of Clang's CodeGen library (lib/CodeGen/CGExpr.cpp, CGStmt.cpp, CGDecl.cpp, etc.), but it operates on EDG's proprietary IL node format rather than a Clang AST. Understanding this layer is essential because it determines every structural property of the LLVM IR that the optimizer and backend will see: address space annotations on pointers, alloca placement conventions, kernel metadata encoding, and the specific IR patterns used for CUDA-specific constructs like threadIdx.x or __shared__ memory.

The EDG frontend does not produce LLVM IR directly. Its backend mode (BACK_END_IS_C_GEN_BE = 1) emits transformed C code into .int.c, .device.c, and .stub.c files. A second compilation pass then parses these files back through EDG to produce an IL tree -- a typed, linked representation of every declaration, statement, and expression in the translation unit. The IR generation layer walks this IL tree recursively, creating LLVM BasicBlocks, Instructions, and GlobalVariables via a hand-rolled IR builder that directly manipulates LLVM's in-memory data structures. The result is a complete LLVM Module containing one function per device-side function definition, with kernel entry points annotated via nvvm.annotations metadata.

Dual-Path Architecture

One of the most distinctive features of cicc's IR generation is that two complete copies exist within the binary. This mirrors the dual-path design observed throughout cicc: Path A (LibNVVM API mode, 0x90xxxx) and Path B (standalone mode, 0x126xxxx).

Component	Path A (LibNVVM)	Path B (Standalone)
Expression codegen	`0x91xxxx`--`0x94xxxx`	`0x127xxxx`--`0x12Bxxxx`
EmitExpr (master dispatch)	`sub_91DF90`	`sub_128D0F0`
EmitStmt (statement dispatch)	`sub_9363D0`	(parallel at similar offset)
EmitFunction (entry block setup)	`sub_946060`	(parallel)
GenerateFunctionProlog	`sub_938240`	(parallel)
Builtin lowering mega-switch	`sub_90AEE0` (109KB)	`sub_12B3FD0` (103KB)
Bitfield load/store	`sub_923780` / `sub_925930`	`sub_1282050` / `sub_1284570`
Special variable codegen	`sub_920430` / `sub_922290`	`sub_127F7A0` / `sub_1285550`
Inline asm codegen	`sub_932270`	`sub_1292420`
Global variable codegen	`sub_916430`	(parallel)
Type translation	`sub_91AED0`	(parallel)
Kernel metadata emitter	`sub_93AE30`	(parallel)

These are not shared-library variations or template instantiations across different types. They are structurally identical copies of the same algorithms with the same string constants (e.g., "allocapt", "agg.result", "entry", "return", ".addr") and the same error messages (e.g., "unsupported expression!", "Argument mismatch in generation function prolog!"). The two copies use different calling conventions for their codegen context objects -- Path A passes codegen state through a flat struct with LLVM API vtable pointers, while Path B uses a pointer-to-pointer indirection scheme -- but the algorithmic logic and IR output are byte-for-byte identical.

The remainder of this page uses Path B addresses (the 0x12xxxxx range) as the primary reference because they correspond to the standalone compilation path that nvcc invokes, and because the B-series analysis reports provide the most detailed coverage of this path. Every function described here has a direct counterpart in Path A at the corresponding 0x9xxxxx address.

Address Map

Address Range	Subsystem	Key Functions
`0x126A000`--`0x126BFFF`	Volatile detection, alignment queries	`sub_126A420` (IsVolatileAddress)
`0x1273000`--`0x1275FFF`	Function attribute emission	`sub_12735D0` (EmitFunctionAttrs), `sub_1273F90` (AttributeReader)
`0x127A000`--`0x127CFFF`	Type translation helpers	`sub_127A030` (GetLLVMType), `sub_127B390` (GetSMVersion), `sub_127B420` (IsAddressOfExpr), `sub_127B550` (FatalDiag)
`0x127D000`--`0x127FFFF`	Constants, alloca creation, bool emission	`sub_127D8B0` (EmitConstExpr), `sub_127FC40` (CreateAlloca), `sub_127FEC0` (EmitBoolExpr)
`0x1280000`--`0x1285FFF`	Bitfield access, member loads, inline asm	`sub_1282050` (EmitBitfieldStore), `sub_1284570` (EmitBitfieldLoad), `sub_1285290` (EmitAsmCall)
`0x1286000`--`0x128FFFF`	L-value codegen, binary ops, expression dispatch	`sub_1286D80` (EmitAddressOf), `sub_128A450` (EmitCast), `sub_128D0F0` (EmitExpr), `sub_128F9F0` (EmitBinaryArithCmp)
`0x1290000`--`0x129AFFF`	Control flow helpers, inline asm, printf lowering	`sub_1290AF0` (SetInsertPoint), `sub_1292420` (EmitInlineAsm), `sub_12992B0` (LowerPrintfToVprintf)
`0x129B000`--`0x12AFFFF`	Builtin helpers, atomic ops, surface/texture ops	`sub_12A4D50` (CreateBasicBlock), `sub_12A7DA0` (AtomicOps), `sub_12ADE80` (SurfaceTexture)
`0x12B0000`--`0x12BFFFF`	Builtin mega-switch	`sub_12B3FD0` (BuiltinLowering, 103KB, 770 IDs)

The IRGenState Object

Every codegen function receives a context object -- called IRGenState or CodeGenState in this wiki -- that carries all mutable state for the current function being compiled. Two distinct layouts exist depending on whether the context is accessed through the Path A flat struct or the Path B double-indirection pattern. Both layouts carry the same logical fields; the difference is structural.

Path B Layout (pointer-to-pointer pattern)

In Path B, the primary codegen context a1 is a CodeGenState** -- a pointer to a pointer. The outer pointer dereferences to a struct containing the core IR builder state, and sibling pointers at a1[1], a1[2], etc., reach related context objects:

Access	Offset	Field	Purpose
`*a1`	+0	IRBuilder state	Current function, insert point, module
`a1[1]`	+8	Insertion context	`[0]` = debug location, `[1]` = current BB, `[2]` = insertion sentinel
`a1[2]`	+16	LLVM context/module	Module handle, LLVMContext
`a1[4]`	+32	Module pointer	LLVM Module*
`a1[5]`	+40	Type context	Type table for `GetLLVMType`, `getIntNTy`
`a1[6]`	+48	Debug location	Current `DebugLoc` to attach to new instructions
`a1[7]`	+56	Current BasicBlock	BB for instruction insertion
`a1[8]`	+64	Insertion point	Iterator into BB's instruction list
`a1[9]`	+72	Address space context	For alloca type creation
`a1[19]`	+152	Cached printf alloca	Reused `"tmp"` alloca for vprintf buffer packing

Path A Layout (flat struct, offsets from `a1`)

Offset	Field	Purpose
+32	Module pointer	LLVM Module*
+40	IR builder	Current builder state
+48, +56	Operand pair array	Base and count for metadata pairs
+96	Current BasicBlock	Active BB
+104	Insertion point	Iterator
+128	Instruction creation vtable	Virtual dispatch for instruction emission
+136	Emitter context	Vtable at `[0]`, dispatch at `vtable[2]`
+192	Current Function	LLVM `Function*` being populated
+200	Return BB	The `"return"` basic block
+208	Return value alloca	`"retval"` alloca or sret pointer
+240	Has-cleanups flag	Nonzero when C++ destructors are pending
+344	Module (kernel metadata)	Used by `sub_93AE30`
+360/376	In-kernel flag	Bit 0 set when compiling a `__global__` function
+424	Cleanup stack	Stack of pending destructor frames (24 bytes each)
+456	Allocapt marker	The `"allocapt"` sentinel instruction

The "allocapt" marker deserves special attention. When EmitFunction (sub_946060) creates the entry block, it inserts a dummy bitcast void to void instruction named "allocapt" as a sentinel. All subsequent alloca instructions created by CreateTmpAlloca (sub_921D70 / sub_127FC40) are inserted before this sentinel, ensuring that every alloca ends up clustered at the top of the entry block. This is a hard requirement for LLVM's mem2reg pass to promote stack slots to SSA registers. The allocapt marker is removed by a later cleanup pass.

EDG IL Node Layout

Every codegen function traverses EDG IL nodes -- linked structures that represent declarations, statements, and expressions from the parsed CUDA source. The node layout is consistent across all codegen paths:

Expression node (passed as a2 to EmitExpr):

Offset	Field	Description
+0	Type pointer	EDG type node (dereference for type info)
+18	Qualifier word	16-bit: bits 0--14 = qualifier ID, bit 15 = negation
+24	Kind byte	Top-level expression category (1=operation, 2=literal, 3=member, 0x11=call, 0x14=decl-ref)
+25	Flags byte	Bit 2 = assignment context (write-only)
+36	Source location	Passed to debug info attachment
+56	Sub-opcode / data	For kind=1: operator sub-opcode; for kind=2: literal data
+72	Child/operand	Pointer to first child expression

Type node (accessed via expression's type pointer):

Offset	Field	Description
+8	Type classification byte	1--6 = float types, 11 = integer, 15 = pointer, 16 = vector
+128	Byte size	Element count for arrays, byte size for scalars
+136	Element size	Size in bits for non-typedef types
+140	Type tag	1=void, 8--11=aggregate (struct/union/class/array), 12=typedef alias, 16=__int128
+144	Flags	Bit 2 = is_bitfield, bit 3 = signed
+160	Inner type / next	Followed when tag==12 (typedef stripping)
+176	Element count	For array types

The typedef-stripping idiom appears throughout every codegen function (15+ occurrences in EmitExpr alone):

for (t = *expr_type; *(BYTE*)(t + 140) == 12; t = *(QWORD*)(t + 160));

This walks through chains of typedef aliases (kind 12) until it reaches the canonical type.

Function Emission Pipeline

When cicc processes a device-side function, IR generation proceeds through a fixed sequence of stages. The entry point is EmitFunction (sub_946060), which sets up the function skeleton and then calls GenerateFunctionProlog (sub_938240) to emit parameter handling, followed by recursive statement emission.

Stage 1: Function skeleton (sub_946060). Creates the LLVM Function* object, resolves the function type through the EDG typedef chain, and optionally sets a section name. Then creates two basic blocks: "entry" (the function entry point) and "return" (the single return block -- all return paths branch here). Inserts the "allocapt" sentinel into the entry block. For non-void functions, creates a "retval" alloca to hold the return value; for sret functions (returning aggregates), uses the first argument directly.

Stage 2: Function prolog (sub_938240). Iterates the EDG parameter linked list (next pointer at offset +112, stride 40 bytes per LLVM argument slot) in lockstep with the LLVM function's argument list. For each parameter:

If the first parameter has ABI kind 2 (sret), names it "agg.result" and advances.
Unnamed parameters get the name "temp_param"; the implicit this parameter (flags bit 0 at offset +172) gets "this".
Creates an alloca named <param_name>.addr via CreateTmpAlloca.
Emits a store of the incoming SSA argument into the alloca.
Registers the EDG declaration -> LLVM Value mapping in a hash table (open addressing, quadratic probing) for later lookup during expression codegen.
Optionally emits "__val_param" temporaries for byval aggregate parameters.

Stage 3: Body emission (recursive emitStmt / EmitExpr). Walks the IL tree for the function body, dispatching through the statement codegen switch and the expression codegen switch (detailed below).

Stage 4: Kernel metadata (sub_93AE30). For __global__ functions, emits nvvm.annotations metadata: kernel flag, __launch_bounds__ parameters (nvvm.maxntid, nvvm.reqntid, nvvm.minctasm, nvvm.maxnreg), cluster dimensions (nvvm.cluster_dim, nvvm.blocksareclusters), and per-parameter metadata (alignment, grid_constant, hidden-parameter flags).

Stage 5: Function attributes (sub_12735D0). Emits function-level metadata for CUDA-specific attributes: grid_constant (per-parameter), preserve_n_data / preserve_n_control / preserve_n_after (register preservation hints), and full_custom_abi (custom calling convention flag). These are later read back by sub_1273F90 and re-encoded as LLVM named metadata with MDString keys.

CUDA Semantic Mapping

The central task of this layer is mapping CUDA-specific semantics to LLVM IR constructs. The following table summarizes every CUDA concept and its IR representation:

CUDA Concept	LLVM IR Representation	Codegen Function
`threadIdx.x`	`call i32 @llvm.nvvm.read.ptx.sreg.tid.x()`	`sub_1286E40` (EmitSpecialVarMemberAccess)
`blockIdx.y`	`call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()`	same, category 2, component 1
`blockDim.z`	`call i32 @llvm.nvvm.read.ptx.sreg.ntid.z()`	same, category 1, component 2
`gridDim.x`	`call i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()`	same, category 3, component 0
`warpSize`	`call i32 @llvm.nvvm.read.ptx.sreg.warpsize()`	`sub_1285550` (EmitSpecialVarAccess)
`__shared__` variable	`@var = addrspace(3) global ...`	`sub_916430` (address space = 3)
`__constant__` variable	`@var = addrspace(4) global ...`	same (address space = 4)
`__device__` variable	`@var = addrspace(1) global ...`	same (address space = 1)
`__global__` function	`define void @kern() #0` + `!{ptr @kern, !"kernel", i32 1}` in `nvvm.annotations`	`sub_93AE30`
`__launch_bounds__(N, M)`	`!{!"nvvm.maxntid", !"N,1,1"}` + `!{!"nvvm.minctasm", !"M"}`	same
`__cluster_dims__(x,y,z)`	`!{!"nvvm.cluster_dim", !"x,y,z"}` + `!{!"nvvm.blocksareclusters"}`	same
`__syncthreads()`	Builtin ID dispatch -> `llvm.nvvm.barrier0`	`sub_12B3FD0` (cases 0xB5--0xCC)
`atomicAdd(ptr, val)`	Builtin dispatch -> `atomicrmw add` or `llvm.nvvm.atomic.*`	same (cases 0xBA--0xCC)
`printf(fmt, ...)`	Rewritten to `vprintf(fmt, packed_buf)`	`sub_12992B0` (LowerPrintfToVprintf)
`__asm__("ptx" : ...)`	`call void asm sideeffect "ptx", "=r,..."(...)`	`sub_1292420` (EmitInlineAsm)
Texture/surface ops	`call @llvm.nvvm.tex.` / `@llvm.nvvm.suld.`	`sub_12ADE80`, `sub_12AA9B0`
`__nv_float2int_rz`	`call i32 @__nv_float2int_rz(float %v)`	`sub_128A450` (EmitCast, NVIDIA intrinsic path)

The special variable recognition pipeline (sub_127F7A0) checks five preconditions before treating a variable as a hardware register read: (1) the in-kernel flag at IRGenState+376 must be set, (2) the symbol must not be extern, (3) it must not be template-dependent, (4) its element count must be 1, and (5) its name must be non-null. The intrinsic IDs are stored in a static 5x3 table (unk_427F760): 5 categories (threadIdx, blockDim, blockIdx, gridDim, warpSize) times 3 components (x, y, z), with warpSize using only the first slot.

Common IR Emission Patterns

Alloca-at-entry

Every local variable and parameter copy uses the same pattern:

sub_127FC40(ctx, type, name, alignment, addrspace)
  -> sub_921B80(ctx, type, name, arraySize=0)
     -> insert AllocaInst BEFORE the allocapt sentinel
     -> set alignment bits
     -> return alloca pointer

The critical detail: when arraySize == 0 (the common case), the alloca is inserted at IRGenState+456+24 -- the position just before the allocapt marker. This ensures all allocas land at the top of the entry block regardless of where in the function body they are created.

Instruction insertion and debug location

After creating any instruction, the same 15-line pattern inserts it into the current basic block and attaches debug metadata:

bb = ctx[1][1];              // current BB
sentinel = ctx[1][2];        // insertion sentinel
sub_157E9D0(bb + 40, inst);  // update BB instruction list
// doubly-linked list pointer surgery with 3-bit tag in low bits
sub_164B780(inst, &name);    // set instruction name (e.g., "arraydecay")
debugLoc = *ctx_debug;
if (debugLoc) {
    sub_1623A60(&loc, debugLoc, 2);  // clone debug location
    *(inst + 48) = loc;              // attach at instruction offset +48
    sub_1623210(&loc, loc, inst+48); // register in debug info list
}

The low 3 bits of list pointers carry tag/flags (alignment guarantees those bits are zero for valid pointers). Offset +24 is prev, +32 is parent block, +48 is debug location on each instruction node.

Constant vs instruction dispatch

Throughout expression codegen, a consistent threshold check determines whether to constant-fold or create an IR instruction:

if (*(BYTE*)(value + 16) > 0x10u)
    // Real IR instruction -> emit IR-level operation
    result = sub_15FDBD0(opcode, value, destTy, &out, 0);  // CastInst
else
    // Constant value -> constant-fold
    result = sub_15A46C0(opcode, value, destTy, 0);         // ConstantExpr

The byte at value+16 encodes the LLVM Value subclass kind. Values <= 0x10 are constants (ConstantInt, ConstantFP, ConstantPointerNull); values > 0x10 are Instruction subclasses. This avoids creating unnecessary instructions when both operands are compile-time constants.

Short-circuit boolean evaluation

Logical AND (&&) and OR (||) use the same short-circuit pattern with PHI merge:

; Logical AND (a && b):
  %lhs = icmp ne i32 %a, 0
  br i1 %lhs, label %land.rhs, label %land.end
land.rhs:
  %rhs = icmp ne i32 %b, 0
  br label %land.end
land.end:
  %0 = phi i1 [ false, %entry ], [ %rhs, %land.rhs ]
  %land.ext = zext i1 %0 to i32

Logical OR inverts the branch sense: TRUE goes to the end block (result is true), FALSE falls through to evaluate the RHS. Both share the same ZExt epilogue code via a merged tail at LABEL_162, selecting the name "land.ext" or "lor.ext" through a variable.

Printf lowering

Device-side printf cannot use C varargs. The compiler rewrites it to CUDA's vprintf(fmt, packed_buffer) ABI:

Look up or create @vprintf in the module via Module::getOrInsertFunction.
Allocate a stack buffer ("tmp" alloca, cached at IRGenState+152 for reuse across multiple printf calls in the same function).
For each vararg: compute its byte size, round offset to natural alignment, GEP into the buffer ("buf.indexed"), bitcast if needed ("casted"), and store.
Promote float arguments to double per C variadic convention (fpext).
If the total packed size exceeds the current alloca size, patch the alloca's size operand in-place by manipulating the use-def chain.
Emit call i32 @vprintf(ptr %fmt, ptr %buf).

The alloca in-place resize (step 5) is unusual -- most LLVM passes would create a new alloca. NVIDIA's motivation is to maintain a single alloca that dominates all printf pack sites within a function.

Type Translation System

The EDG-to-LLVM type translation (sub_91AED0 and its callees) is a worklist-driven fixed-point computation that runs before per-function codegen. It translates every EDG type node into an LLVM type, handling:

Primitive types: Direct mapping (EDG int -> LLVM i32, EDG float -> LLVM float).
Pointer types: Carry qualifier words at node+18 that encode CUDA address spaces (qualifier 1 = global/addrspace 1, qualifier 32 = shared/addrspace 3, qualifier 33 = constant/addrspace 4).
Struct/union/class types: Recursive member-by-member translation with reference counting to handle shared sub-types and diamond inheritance.
Typedef chains: Stripped by the standard for (t = type; tag == 12; t = *(t+160)) idiom.
Template specializations: Two-pass approach -- syntactic substitution (sub_908040) followed by semantic matching (sub_910920), gated by optimization flags.
Mutually recursive types: Handled by the fixed-point iteration do { changed = process_all(); } while (changed).

All hash tables in the type system use the standard DenseMap infrastructure with NVVM-layer sentinels (-8 / -16). See Hash Table and Collection Infrastructure for the common implementation.

Global Variable Codegen

Device-side globals (__device__, __constant__, __shared__, __managed__) are emitted by sub_916430 (determineAddressSpaceAndCreate) which reads EDG IL node attributes at offsets +0x88 (storage class), +0x9C, +0xAE, and +0xB0 to determine the NVPTX address space:

EDG Attribute	NVPTX Address Space	PTX Qualifier
`__device__`	1 (global)	`.global`
`__constant__`	4 (constant)	`.const`
`__shared__`	3 (shared)	`.shared`
Generic (default)	0 (generic)	(none)

After creating the GlobalVariable, sub_915400 (finalizeGlobals) orchestrates module-level metadata emission: nvvmir.version (IR version metadata), nvvm.annotations (kernel and parameter annotations), llvm.used (prevents dead-global elimination), Debug Info Version module flag (value 3), and optionally llvm.ident.

Naming Conventions

The IR generation layer produces named IR values that match Clang's naming conventions almost exactly, confirming that NVVM's codegen was closely modeled on Clang's IRGen:

IR Name	Context	Source
`"entry"`	Function entry basic block	`sub_946060`
`"return"`	Return basic block	`sub_946060`
`"allocapt"`	Sentinel instruction for alloca grouping	`sub_946060`
`"retval"`	Return value alloca	`sub_946060`
`"agg.result"`	Sret argument	`sub_938240`
`<name>.addr`	Parameter alloca	`sub_938240` / `sub_9446C0`
`"temp_param"`	Unnamed parameter	`sub_938240`
`"this"`	Implicit C++ `this` parameter	`sub_938240`
`"__val_param"<name>`	Byval parameter copy	`sub_938240`
`"arraydecay"`	Array-to-pointer decay GEP	`sub_128D0F0` (opcode 0x15)
`"lnot"` / `"lnot.ext"`	Logical NOT + ZExt	`sub_128D0F0` (opcode 0x1D)
`"land.rhs"` / `"land.end"` / `"land.ext"`	Logical AND blocks + result	`sub_128D0F0` (opcode 0x57)
`"lor.rhs"` / `"lor.end"` / `"lor.ext"`	Logical OR blocks + result	`sub_128D0F0` (opcode 0x58)
`"cond.true"` / `"cond.false"` / `"cond.end"`	Ternary operator blocks	`sub_128D0F0` (opcode 0x67)
`"tobool"` / `"conv"`	Cast results	`sub_128A450`
`"sub.ptr.lhs.cast"` / `"sub.ptr.rhs.cast"` / `"sub.ptr.sub"` / `"sub.ptr.div"`	Pointer subtraction	`sub_128D0F0` (opcode 0x34)
`"if.then"` / `"if.else"` / `"if.end"`	If statement blocks	`sub_937020`
`"while.cond"` / `"while.body"` / `"while.end"`	While loop blocks	`sub_937180`
`"for.cond"` / `"for.body"` / `"for.inc"` / `"for.end"`	For loop blocks	`sub_936D30`
`"do.body"` / `"do.cond"` / `"do.end"`	Do-while loop blocks	`sub_936B50`
`"bf.*"`	Bitfield access temporaries (30+ variants)	`sub_1282050` / `sub_1284570`
`"predef_tmp_comp"`	Special register read result	`sub_1286E40`
`"buf.indexed"` / `"casted"`	Printf buffer GEP and cast	`sub_12992B0`
`"asmresult"`	Inline asm extractvalue result	`sub_1292420`

The IR generation subsystem is documented in detail across four sub-pages, each covering a major functional area:

Expression & Constant Codegen -- The EmitExpr master dispatch (sub_128D0F0), its 40-operator inner switch, compile-time constant emission (sub_127D8B0), and the cast/conversion codegen (sub_128A450). Covers every C/C++ expression type from array decay to pointer subtraction to logical short-circuit.
Statement & Control Flow Codegen -- The emitStmt dispatcher (sub_9363D0), basic block creation for if/while/do-while/for/switch, cleanup scope management for C++ destructors, label and goto handling, and #pragma unroll metadata attachment.
Function, Call & Inline Asm Codegen -- Function skeleton creation (sub_946060), the parameter prolog (sub_938240), call instruction emission with ABI classification (sub_93CB50), inline asm template parsing and constraint construction (sub_1292420), printf-to-vprintf lowering (sub_12992B0), and the 770-entry builtin dispatch table (sub_12B3FD0).
Type Translation, Globals & Special Vars -- The fixed-point type translation system (sub_91AED0), address space mapping for CUDA memory qualifiers, global variable creation (sub_916430), kernel metadata emission (sub_93AE30), function attribute handling (sub_12735D0), and special variable codegen for threadIdx/blockIdx/blockDim/gridDim/warpSize.

Expression & Constant Codegen

The central expression emitter sub_128D0F0 (56 KB, 1751 decompiled lines) is the single function responsible for translating every C/C++ expression in the EDG AST into LLVM IR. It is a large recursive two-level switch: the outer switch classifies the expression node kind (operation, literal, member access, call, etc.), and the inner switch dispatches across 40+ C operators to emit the corresponding LLVM IR instruction sequences. Every named temporary in the output (%arraydecay, %land.ext, %sub.ptr.div, %cond, etc.) originates from explicit SetValueName calls within this function, closely mirroring Clang's IRGen naming conventions.

Two companion subsystems handle specialized expression domains: bitfield codegen (sub_1282050 store, sub_1284570 load) lowers C bitfield accesses to shift/mask/or sequences, and constant expression codegen (sub_127D8B0, 1273 lines) produces llvm::Constant* values for compile-time evaluable expressions. Cast codegen (sub_128A450, 669 lines) maps every C cast category to the appropriate LLVM cast opcode.


Master dispatcher	`sub_128D0F0` — `EmitExpr` (56 KB, address `0x128D0F0`)
Bitfield store	`sub_1282050` — `EmitBitfieldStore` (15 args, R-M-W sequence)
Bitfield load	`sub_1284570` — `EmitBitfieldLoad` (12 args, extract sequence)
Constant expressions	`sub_127D8B0` — `EmitConstExpr` (1273 lines, recursive)
Cast/conversion	`sub_128A450` — `EmitCast` (669 lines, 11 LLVM opcodes)
Bool conversion	`sub_127FEC0` — `EmitBoolExpr` (expr to `i1`)
Literal emission	`sub_127F650` — `EmitLiteral` (numeric/string constants)

Master Expression Dispatcher

Reconstructed signature

// sub_128D0F0
llvm::Value *EmitExpr(CodeGenState **ctx, EDGExprNode *expr,
                      llvm::Type *destTy, unsigned flags, unsigned flags2);

The ctx parameter is a pointer-to-pointer hierarchy:

Offset	Field
`*ctx`	IRBuilder state (current function, insert point)
`ctx[1]`	Debug info context: `[0]` = debug scope, `[1]` = current BB, `[2]` = insertion sentinel
`ctx[2]`	LLVM module/context handle

EDG expression node layout

Every expression node passed as expr has a fixed layout:

Offset	Size	Field
+0x00	8	Type pointer (EDG type node)
+0x18	1	Outer opcode (expression kind byte)
+0x19	1	Flags byte
+0x24	12	Source location info
+0x38	1	Inner opcode (operator sub-kind, for kind=1)
+0x48	8	Child/operand pointer

Type nodes carry a tag at offset +140: 12 = typedef alias (follow +160 to unwrap), 1 = void. The typedef-stripping idiom appears 15+ times throughout the function:

// Type unwrapping — strips typedef aliases to canonical type
for (Type *t = expr->type; *(uint8_t*)(t + 140) == 12; t = *(Type**)(t + 160))
    ;

Outer switch — expression categories

The byte at expr+0x18 selects the top-level expression category:

Kind	Category	Handler
`0x01`	Operation expression	Inner switch on `expr+0x38` (40+ C operators)
`0x02`	Literal constant	`EmitLiteral` (`sub_127F650`)
`0x03`	Member/field access	`EmitAddressOf` + `EmitLoadFromAddress`
`0x11`	Call expression	`EmitCall` (`sub_1296570`)
`0x13`	Init expression	`EmitInitExpr` (`sub_1281220`)
`0x14`	Declaration reference	`EmitAddressOf` + `EmitLoadFromAddress`
default		Fatal: `"unsupported expression!"`

Inner switch — complete opcode reference

When the outer kind is 0x01 (operation), the byte at expr+0x38 selects which C operator to emit. The complete dispatch table follows. Every opcode is listed; no gaps exist between documented entries.

Opcode	C operator	Handler / delegate	LLVM pattern
`0x00`	Constant subexpr	`sub_72B0F0` (evaluate) + `sub_1286D80` (load)	Constant materialization
`0x03`	Compound special A	`EmitCompoundAssign` (`sub_1287ED0`)	Read-modify-write
`0x05`	Dereference (`*p`)	Elide if child is `&`: `IsAddressOfExpr` (`sub_127B420`). Otherwise: recursive `EmitExpr` + `EmitLoad` (`sub_128B370`)	`%val = load T, ptr %p`
`0x06`	Compound special B	`EmitCompoundAssign` (`sub_1287ED0`)	Read-modify-write
`0x08`	Compound special C	`EmitCompoundAssign` (`sub_1287ED0`)	Read-modify-write
`0x15`	Array decay	See Array decay	`%arraydecay = getelementptr inbounds ...`
`0x19`	Parenthesized `(x)`	Tail-call optimization: `a2 = child`, restart loop	(no IR emitted)
`0x1A`	`sizeof` / `alignof`	`EmitSizeofAlignof` (`sub_128FDE0`)	Constant integer
`0x1C`	Bitwise NOT (`~x`)	`sub_15FB630` (xor with -1)	`%not = xor i32 %x, -1`
`0x1D`	Logical NOT (`!x`)	Two-phase: `EmitBoolExpr` + `zext`	`%lnot = icmp eq ..., 0` / `%lnot.ext = zext i1 ... to i32`
`0x1E`	Type-level const	`ConstantFromType` (`sub_127D2C0`)	Compile-time constant
`0x1F`	Type-level const	`ConstantFromType` (`sub_127D2C0`)	Compile-time constant
`0x23`	Pre-increment `++x`	`EmitIncDec` (`sub_128C390`): prefix=1, inc=1	`%inc = add ...` / `%ptrincdec = getelementptr ...`
`0x24`	Pre-decrement `--x`	`EmitIncDec` (`sub_128C390`): prefix=0, inc=0	`%dec = sub ...` / `%ptrincdec = getelementptr ...`
`0x25`	Post-increment `x++`	`EmitIncDec` (`sub_128C390`): prefix=1, inc=0	Returns old value; `%inc = add ...`
`0x26`	Post-decrement `x--`	`EmitIncDec` (`sub_128C390`): prefix=0, inc=1	Returns old value; `%dec = sub ...`
`0x27`-`0x2B`	`+`, `-`, `*`, `/`, `%`	`EmitBinaryArithCmp` (`sub_128F9F0`)	`add`/`sub`/`mul`/`sdiv`/`srem` (or `u`/`f` variants)
`0x32`	Comma `(a, b)`	Emit both sides; return RHS	(LHS discarded)
`0x33`	Subscript `a[i]`	`EmitSubscriptOp` (`sub_128B750`): GEP + load	`%arrayidx = getelementptr ...` + `load`
`0x34`	Pointer subtraction	See Pointer subtraction	`%sub.ptr.div = sdiv exact ...`
`0x35`-`0x39`	`==`, `!=`, `<`, `>`, `<=`, `>=`	`EmitBinaryArithCmp` (`sub_128F9F0`)	`icmp eq`/`ne`/`slt`/`sgt`/`sle`/`sge` (or `u`/`f` variants)
`0x3A`	`<<`	`EmitShiftOrBitwise` (`sub_128F580`): triple `(1, 32, 32)`	`shl`
`0x3B`	`>>`	`EmitShiftOrBitwise` (`sub_128F580`): triple `(14, 33, 33)`	`ashr` (signed) / `lshr` (unsigned)
`0x3C`	`&`	`EmitShiftOrBitwise` (`sub_128F580`): triple `(2, 38, 34)`	`and`
`0x3D`	`^`	`EmitShiftOrBitwise` (`sub_128F580`): triple `(4, 40, 36)`	`xor`
`0x3E`	`\|`	`EmitShiftOrBitwise` (`sub_128F580`): triple `(3, 39, 35)`	`or`
`0x3F`	Rotate	`EmitShiftOrBitwise` (`sub_128F580`): triple `(5, 41, 37)`	`llvm.fshl` / `llvm.fshr`
`0x41`-`0x46`	Type-level consts	`ConstantFromType` (`sub_127D2C0`)	Compile-time constant
`0x49`	Member access `.`/`->`	See Member access	`getelementptr` + `load` (or bitfield path)
`0x4A`	`+=`	`EmitCompoundAssignWrapper` (`sub_12901D0`) + `sub_1288F60`	Load + add + store
`0x4B`	`-=`	`EmitCompoundAssignWrapper` (`sub_12901D0`) + `sub_1288370`	Load + sub + store
`0x4C`	`*=`	`EmitCompoundAssignWrapper` (`sub_12901D0`) + `sub_1288770`	Load + mul + store
`0x4D`	`/=`	`EmitCompoundAssignWrapper` (`sub_12901D0`) + `sub_1289D20`	Load + div + store
`0x4E`	`%=`	`EmitCompoundAssignWrapper` (`sub_12901D0`) + `sub_1288DC0`	Load + rem + store
`0x4F`	`&=`	`EmitCompoundAssignWrapper` (`sub_12901D0`) + `sub_1288B70`	Load + and + store
`0x50`	`\|=`	`EmitCompoundAssignWrapper` (`sub_12901D0`) + `sub_1289360`	Load + or + store
`0x51`	`<<=`	`EmitCompoundAssignWrapper` (`sub_12901D0`) + `sub_1288090`	Load + shl + store
`0x52`	`>>=`	`EmitCompoundAssignWrapper` (`sub_12901D0`) + `sub_1287F30`	Load + ashr/lshr + store
`0x53`	`^=`	`EmitCompoundAssignWrapper` (`sub_12901D0`) + `sub_1288230`	Load + xor + store
`0x54`	`,=` (rare)	`EmitCompoundAssignWrapper` (`sub_12901D0`) + `sub_128BE50`	Comma-compound
`0x55`	`[]=` (subscript compound)	`EmitCompoundAssignWrapper` (`sub_12901D0`) + `sub_128B750`	GEP + R-M-W
`0x56`	Bitfield assign	See Bitfield Codegen	R-M-W sequence
`0x57`	Logical AND `&&`	See Logical AND	`land.rhs`/`land.end` + PHI
`0x58`	Logical OR `\|\|`	See Logical OR	`lor.rhs`/`lor.end` + PHI
`0x59`, `0x5A`, `0x5D`	Type-level consts	`ConstantFromType` (`sub_127D2C0`)	Compile-time constant
`0x5B`	Statement expression `({...})`	`EmitStmtExpr` (`sub_127FF60`); create empty BB if `(*a1)[7] == 0`	Body emission
`0x5C`, `0x5E`, `0x5F`	Compound special	`EmitCompoundAssign` (`sub_1287ED0`)	Read-modify-write
`0x67`	Ternary `?:`	See Ternary operator	`cond.true`/`cond.false`/`cond.end` + PHI
`0x68`	Type-level const	`ConstantFromType` (`sub_127D2C0`)	Compile-time constant
`0x69`	Special const	`EmitSpecialConst` (`sub_1281200`)	Constant materialization
`0x6F`	Label address `&&label`	GCC extension: `sub_12A4D00` (lookup) + `sub_1285E30`(builder, label, 1)	`blockaddress(@fn, %label)`
`0x70`	Label value	`sub_12A4D00` + `sub_12812E0`(builder, label, type)	Indirect goto target
`0x71`	Computed goto `goto *p`	`sub_12A4D00` + `sub_1285E30`(builder, label, 0)	`indirectbr`
`0x72`	`va_arg`	`sub_12A4D00` on va_list child + `sub_1286000`	`va_arg` lowering
default		`FatalDiag` (`sub_127B550`)	`"unsupported operation expression!"`

Shift and bitwise triple encoding

The EmitShiftOrBitwise (sub_128F580) triple (signedOp, intOp, fpOp) encodes three things: signedOp controls signed-vs-unsigned selection for right shift (14 selects ashr for signed, lshr for unsigned), intOp is the LLVM integer opcode number, and fpOp is the floating-point variant (unused for shift/bitwise but present for uniformity).

Increment / decrement detail

EmitIncDec (sub_128C390, 16 KB) handles integer, floating-point, and pointer types. It reads the expression type to select the arithmetic operation:

Integer path: add/sub nsw i32 %x, 1 with name "inc" or "dec". For prefix variants, the incremented value is returned; for postfix, the original value is returned and the increment is stored.
Floating-point path: fadd/fsub float %x, 1.0 with the same return-value semantics.
Pointer path: getelementptr inbounds T, ptr %p, i64 1 (or i64 -1 for decrement) with name "ptrincdec". Element type comes from the pointed-to type.

All paths load the current value, compute the new value, store back, and return either old or new depending on prefix/postfix.

Compound assignment wrapper mechanics

EmitCompoundAssignWrapper (sub_12901D0) implements the common load-compute-store pattern for all compound assignment operators (+=, -=, etc.):

// sub_12901D0 pseudocode
Value *EmitCompoundAssignWrapper(ctx, expr, impl_fn, flags) {
    Value *addr = EmitAddressOf(ctx, expr->lhs);     // sub_1286D80
    Value *old_val = EmitLoadFromAddress(ctx, addr);  // sub_1287CD0
    Value *rhs_val = EmitExpr(ctx, expr->rhs);        // sub_128D0F0 (recursive)
    Value *new_val = impl_fn(ctx, old_val, rhs_val);  // per-operator function
    EmitStore(ctx, new_val, addr);                     // store back
    return new_val;
}

Each impl_fn is a small function (typically 200-400 lines) that handles integer/float type dispatch and signedness. For example, sub_1288F60 (AddAssign) selects between add, fadd, and pointer-GEP addition.

Member access multi-path handler

Opcode 0x49 handles struct field access (. and ->) through a multi-path dispatcher:

Simple scalar field (field count == 1): Computes field address via EmitAddressOf (sub_1286D80), checks the volatile bit (v349 & 1), copies 12 DWORDs of field descriptor into the local frame, then loads via EmitLoadFromAddress (sub_1287CD0).
Bitfield field: If the field descriptor indicates a bitfield, routes to EmitBitfieldAccess (sub_1282050) which emits the shift/mask extraction sequence.
Nested/union access (field count > 1): Calls ComputeCompositeMemberAddr (sub_1289860) for multi-level GEP computation, then EmitComplexMemberLoad (sub_12843D0).
Write-only context: If the assignment bit (a2+25, bit 2) is set, returns null -- the caller only needs the address, not the loaded value.

Statement expression, label address, and va_arg

Statement expression (0x5B): Emits the compound statement body via EmitStmtExpr (sub_127FF60). If no return basic block exists yet ((*a1)[7] == 0), creates an anonymous empty BB via CreateBasicBlock + SetInsertPoint to serve as the fall-through target. The value of the last expression in the block is the statement expression's result.

Label address (0x6F): Implements the GCC &&label extension. Looks up the label via LookupLabel (sub_12A4D00), then creates a blockaddress(@current_fn, %label) constant via sub_1285E30(builder, label, 1). The second argument 1 distinguishes "take address" from "goto to".

Computed goto (0x71): The goto *ptr extension. Same LookupLabel call, but sub_1285E30(builder, label, 0) with flag 0 emits an indirectbr instruction targeting the resolved label.

va_arg (0x72): Extracts the va_list child node at +72, its sub-child at +16, resolves both via sub_12A4D00, then calls EmitVaArg (sub_1286000) which lowers to a va_arg LLVM instruction with the appropriate type.

Constant vs. instruction dispatch

Throughout all operator emission, a consistent pattern selects between constant folding and IR instruction creation. The byte at Value+16 encodes the LLVM Value subclass kind: values <= 0x10 are constants (ConstantInt, ConstantFP, etc.) and values > 0x10 are instructions. This check appears 20+ times throughout the function, always with the same structure:

// Constant-fold or emit IR? Decision pattern (appears 20+ times)
if (*(uint8_t*)(value + 16) > 0x10) {
    // Real IR instruction -- create via IR builder
    result = CreateCast(opcode, value, destTy, &out, 0);    // sub_15FDBD0
    result = CreateBinOp(opcode, lhs, rhs, &out, 0);       // sub_15FB440
} else {
    // Compile-time constant -- constant-fold at LLVM ConstantExpr level
    result = ConstantExprCast(opcode, value, destTy, 0);    // sub_15A46C0
    result = ConstantFoldBinOp(lhs, rhs, 0, 0);            // sub_15A2B60
}

The dispatch table for the constant-fold vs IR-instruction paths:

Operation	IR path (Value > 0x10)	Constant path (Value <= 0x10)
Binary op	`CreateBinOp` (`sub_15FB440`)	`ConstantFoldBinOp` (`sub_15A2B60`)
Unary NOT	`CreateUnaryOp` (`sub_15FB630`)	`ConstantFoldUnary` (`sub_15A2B00`)
Cast	`CreateCast` (`sub_15FDBD0`)	`ConstantExprCast` (`sub_15A46C0`)
Int compare	`sub_15FEC10`(op=51, pred)	`sub_15A37B0`(pred, lhs, rhs)
Float compare	`sub_15FEC10`(op=52, pred)	`sub_15A37B0`(pred, lhs, rhs)
Sub (constant)	`CreateBinOp`(13=Sub)	`ConstantFoldSub` (`sub_15A2B60`)
SDiv exact	`CreateBinOp`(18=SDiv) + `SetExactFlag`	`ConstantFoldSDiv` (`sub_15A2C90`)

When the constant path is taken, no LLVM instruction is created and no BB insertion occurs -- the result is a pure llvm::Constant* that can be used directly. This is critical for expressions like sizeof(int) + 4 where no runtime code should be emitted.

Key Expression Patterns

Array decay

Opcode 0x15. Converts an array lvalue to a pointer to its first element.

When IsArrayType (sub_8D23B0) confirms the source is an array type, the emitter creates an inbounds GEP with two zero indices. The GEP instruction is constructed manually: allocate 72 bytes for 3 operands via AllocateInstruction, compute the result element type, propagate address space qualifiers from the source, then fill operands (base, i64 0, i64 0) and mark inbounds:

%arraydecay = getelementptr inbounds [N x T], ptr %arr, i64 0, i64 0

If the source is already a pointer type (not an array), the function either passes through directly or inserts a ptrtoint / zext if the types differ.

Pointer subtraction

Opcode 0x34. The classic 5-step Clang pattern for (p1 - p2):

%sub.ptr.lhs.cast = ptrtoint ptr %p1 to i64
%sub.ptr.rhs.cast = ptrtoint ptr %p2 to i64
%sub.ptr.sub      = sub i64 %sub.ptr.lhs.cast, %sub.ptr.rhs.cast
%sub.ptr.div      = sdiv exact i64 %sub.ptr.sub, 4    ; element_size=4 for int*

Step 5 (the sdiv exact) is skipped entirely when the element size is 1 (i.e., char* arithmetic), since division by 1 is a no-op. The element size comes from the pointed-to type at offset +128. The exact flag on sdiv tells the optimizer that the division is known to produce no remainder -- a critical optimization hint.

Logical AND (short-circuit)

Opcode 0x57. Creates two basic blocks and a PHI node for C's short-circuit && evaluation:

entry:
    %lhs = icmp ne i32 %a, 0
    br i1 %lhs, label %land.rhs, label %land.end

land.rhs:
    %rhs = icmp ne i32 %b, 0
    br label %land.end

land.end:
    %0 = phi i1 [ false, %entry ], [ %rhs, %land.rhs ]
    %land.ext = zext i1 %0 to i32

The construction sequence:

Create blocks land.end and land.rhs via CreateBasicBlock (sub_12A4D50).
Emit LHS as boolean via EmitBoolExpr (sub_127FEC0).
Conditional branch: br i1 %lhs, label %land.rhs, label %land.end.
Switch insertion point to %land.rhs.
Emit RHS as boolean.
Unconditional branch to %land.end.
Switch to %land.end, construct PHI with 2 incoming edges.
Zero-extend the i1 PHI result to the expression's declared type (i32 typically) with name land.ext.

The PHI node is allocated as 64 bytes via AllocatePHI (sub_1648B60), initialized with opcode 53 (PHI), and given a capacity of 2. Incoming values are stored in a compact layout: [val0, val1, ..., bb0, bb1, ...] where each value slot occupies 24 bytes (value pointer + use-list doubly-linked-list pointers), and basic block pointers form a parallel array after all value slots.

Logical OR (short-circuit)

Opcode 0x58. Identical structure to logical AND but with inverted branch sense: the TRUE outcome of the LHS branches to lor.end (short-circuits to true), and FALSE falls through to evaluate the RHS:

entry:
    %lhs = icmp ne i32 %a, 0
    br i1 %lhs, label %lor.end, label %lor.rhs

lor.rhs:
    %rhs = icmp ne i32 %b, 0
    br label %lor.end

lor.end:
    %0 = phi i1 [ true, %entry ], [ %rhs, %lor.rhs ]
    %lor.ext = zext i1 %0 to i32

Internally, the AND and OR paths share a common tail (merging at a single code point with a variable holding either "lor.ext" or "land.ext").

Ternary / conditional operator

Opcode 0x67. Constructs a full three-block diamond with PHI merge for a ? b : c:

entry:
    %cond.bool = icmp ne i32 %test, 0
    br i1 %cond.bool, label %cond.true, label %cond.false

cond.true:
    %v1 = <emit true expr>
    br label %cond.end

cond.false:
    %v2 = <emit false expr>
    br label %cond.end

cond.end:
    %cond = phi i32 [ %v1, %cond.true ], [ %v2, %cond.false ]

The function creates three blocks (cond.true, cond.false, cond.end), records which basic block each arm finishes in (since the true/false expression emission might create additional blocks), and builds the PHI from those recorded blocks. When one arm is void, the PHI is omitted and whichever arm produced a value is returned directly.

Logical NOT and bitwise NOT

Logical NOT (opcode 0x1D) is a two-phase emit:

%lnot     = icmp eq i32 %x, 0         ; Phase 1: convert to bool
%lnot.ext = zext i1 %lnot to i32      ; Phase 2: extend back to declared type

Phase 1 calls EmitBoolExpr which produces the icmp eq ... 0 comparison. Phase 2 zero-extends the i1 back to the expression's target type. If the value is already a compile-time constant, the constant folder handles it directly.

Bitwise NOT (opcode 0x1C) produces xor with all-ones:

%not = xor i32 %x, -1

Created via CreateUnaryOp (sub_15FB630) which synthesizes xor with -1 (all bits set). Optional zext follows if the result needs widening.

Dereference with address-of elision

Opcode 0x05. Before emitting a load for unary *, the function checks if the child is an address-of expression via IsAddressOfExpr (sub_127B420). If so, the dereference and address-of cancel out -- no IR is emitted, only a debug annotation is attached. This handles the common pattern *&x becoming just x.

Bitfield Codegen

Bitfield loads and stores are lowered to shift/mask/or sequences by two dedicated functions. A path selector CanUseFastBitfieldPath (sub_127F680) determines whether the bitfield fits within a single naturally-aligned container element (fast path) or must be processed byte-by-byte (general path).

EDG bitfield descriptor

The bitfield metadata object carries:

Offset	Type	Field
+120	qword	Container type node
+128	qword	Byte offset within struct
+136	byte	Bit offset within containing byte
+137	byte	Bit width of the field
+140	byte	Type tag (12 = array wrapper, walk chain)
+144	byte	Flags (bit 3 = signed bitfield)
+160	qword	Next/inner type pointer

Fast path (single-container load)

When the bitfield plus its bit range fits within one container element, the fast path loads the entire container and extracts the field with a single shift and mask:

// Example: struct { unsigned a:3; unsigned b:5; } s;
// s.b: byte_offset=0, bit_offset=3, bit_width=5, container=i8

Load s.b (fast path):

%container  = load i8, ptr %s
%shifted    = lshr i8 %container, 3            ; "highclear" -- position field at bit 0
%result     = and i8 %shifted, 31              ; "zeroext" -- mask to 5 bits (0x1F)

The shift amount is computed as 8 * elem_size - bit_width - bit_offset - 8 * (byte_offset % elem_size). When this evaluates to zero, the lshr is constant-folded away.

For signed bitfields, the zero-extend is replaced with an arithmetic sign extension via shift-left then arithmetic-shift-right:

%shifted = lshr i8 %container, 3              ; "highclear"
%signext = ashr i8 %shifted, 5                ; "signext" -- propagates sign bit

Store s.b = val (fast path read-modify-write):

%container     = load i8, ptr %s
%bf.value      = and i8 %val, 31              ; mask to 5 bits
%cleared       = and i8 %container, 7         ; "bf.prev.cleared" -- clear bits [3:7]
%positioned    = shl i8 %bf.value, 3          ; "bf.newval.positioned"
%merged        = or  i8 %cleared, %positioned ; "bf.finalcontainerval"
store i8 %merged, ptr %s

The clear mask is ~((1 << bit_width) - 1) << bit_position). For containers wider than 64 bits, both the clear mask and the value mask are computed via APInt operations (sub_16A5260 to set bit range, sub_16A8F40 to invert).

Byte-by-byte path (spanning load)

When the bitfield spans multiple container elements, it is processed one byte at a time. Each iteration loads a byte, extracts the relevant bits, zero-extends to the accumulator width, shifts into position, and ORs into the running accumulator.

For example, a 20-bit field starting at byte 0, bit 0:

; Byte 0: bits [0:7]
%bf.base.i8ptr = bitcast ptr %s to ptr         ; pointer cast
%byte0.ptr     = getelementptr i8, ptr %bf.base.i8ptr, i64 0
%bf.curbyte.0  = load i8, ptr %byte0.ptr
%bf.byte_zext.0 = zext i8 %bf.curbyte.0 to i32
; accumulator = %bf.byte_zext.0 (shift=0 for first byte)

; Byte 1: bits [8:15]
%byte1.ptr     = getelementptr i8, ptr %bf.base.i8ptr, i64 1
%bf.curbyte.1  = load i8, ptr %byte1.ptr
%bf.byte_zext.1 = zext i8 %bf.curbyte.1 to i32
%bf.position.1  = shl i32 %bf.byte_zext.1, 8   ; "bf.position"
%bf.merge.1     = or  i32 %bf.byte_zext.0, %bf.position.1  ; "bf.merge"

; Byte 2: only 4 bits remain (20 - 16 = 4)
%byte2.ptr         = getelementptr i8, ptr %bf.base.i8ptr, i64 2
%bf.curbyte.2      = load i8, ptr %byte2.ptr
%bf.end.highclear  = lshr i8 %bf.curbyte.2, 4  ; "bf.end.highclear" -- clear top 4 bits
%bf.byte_zext.2    = zext i8 %bf.end.highclear to i32
%bf.position.2     = shl i32 %bf.byte_zext.2, 16
%bf.merge.2        = or  i32 %bf.merge.1, %bf.position.2

The byte-by-byte store path mirrors this in reverse: for boundary bytes (first and last), it loads the existing byte, masks out the target bits with AND, positions the new bits with SHL, and merges with OR. Middle bytes that are entirely overwritten skip the read-modify-write and store directly.

The `bf.*` naming vocabulary

All bitfield IR values use a consistent naming scheme:

Name	Path	Meaning
`bf.base.i8ptr`	Both	Pointer cast to `i8*`
`bf.curbyte`	Load	Current byte in iteration loop
`bf.end.highclear`	Load	`lshr` to clear unused high bits in last byte
`bf.byte_zext`	Load	`zext` of byte to accumulator width
`bf.position`	Both	`shl` to position byte/value within accumulator/container
`bf.merge`	Load	`or` to merge byte into accumulator
`bf.highclear`	Load	`lshr` before sign extension
`bf.finalval`	Load	`ashr` for sign extension
`highclear`	Load fast	Fast-path `lshr` to clear high bits
`zeroext`	Load fast	Fast-path zero-extend result
`signext`	Load fast	Fast-path `ashr` sign extension
`bf.value`	Store	`and(input, width_mask)` -- isolated field bits
`bf.prev.cleared`	Store fast	Container with old field bits cleared
`bf.newval.positioned`	Store fast	New value shifted to field position
`bf.finalcontainerval`	Store fast	`or(cleared, positioned)` -- final container
`bf.reload.val`	Store	Truncated value for compound assignment reload
`bf.reload.sext`	Store	Sign-extended reload via shift pair
`bassign.tmp`	Store	Alloca for temporary during bitfield assignment

Wide bitfield support (> 64 bits)

Both load and store functions handle bitfields wider than 64 bits through APInt operations. The threshold check width > 0x40 (64) appears throughout: values <= 64 bits use inline uint64_t masks computed as 0xFFFFFFFFFFFFFFFF >> (64 - width), while wider values allocate heap-backed APInt word arrays. Every code path carefully frees heap APInts after use. This supports __int128 bitfields in CUDA.

Volatile and alignment

Volatile detection uses a global flag at unk_4D0463C. When set, sub_126A420 queries whether the GEP target address is in volatile memory, propagating the volatile bit to load/store instructions. The alignment parameter for bitfield container loads must be 1; the function asserts on other values with "error generating code for loading from bitfield!".

Duplicate implementations

Two additional copies exist at sub_923780 (store) and sub_925930 (load) -- identical algorithms with the same string names, same opcodes, same control flow. These likely correspond to different template instantiations or address-space variants in the original NVIDIA source. The 0x92xxxx copies are in the main NVVM frontend region while the 0x128xxxx copies are in the codegen helper region.

Constant Expression Codegen

EmitConstExpr (sub_127D8B0) converts EDG constant expression AST nodes into llvm::Constant* values. It is recursive: aggregate initializers call it for each element.

// sub_127D8B0
llvm::Constant *EmitConstExpr(CodeGenState *ctx, EDGConstExprNode *expr,
                               llvm::Type *arrayElemTyOverride);

The constant kind byte at expr[10].byte[13] is the primary dispatch:

Kind	Category	Output type
`1`	Integer constant	`ConstantInt`
`2`	String literal	`ConstantDataArray`
`3`	Floating-point constant	`ConstantFP`
`6`	Address-of constant	`GlobalVariable`, `Function`, or string global
`0xA`	Aggregate initializer	`ConstantStruct`, `ConstantArray`, or `ConstantAggregateZero`
`0xE`	Null/empty	Returns 0 (no constant)
default		Fatal: `"unsupported constant variant!"`

Integer constants

For normal integers (up to 64 bits), the value is extracted via edg::GetSignedIntValue or edg::GetUnsignedIntValue depending on signedness, masked to the actual bit width, and passed to ConstantInt::get(context, APInt).

For __int128 (type size == 16 bytes), the EDG IL stores the value as a decimal string. The path is: edg::GetIntConstAsString(expr) returns the decimal text, then APInt::fromString(128, str, len, radix=10) parses it into a 128-bit APInt. This string-based transfer suggests the EDG IL uses text encoding for portability of wide integers.

APInt memory management follows the standard pattern: values > 64 bits use heap-allocated word arrays (checked via width > 0x40). Every path frees heap APInts after consumption.

When the target LLVM type is a pointer (tag 15), the integer constant is first created, then ConstantExpr::getIntToPtr converts it.

String literals

The character width is determined from a lookup table qword_4F06B40 indexed by the encoding enum at expr[10].byte[8] & 7:

Index	Width	C type
0	1 byte	`char` / UTF-8
1	platform	`wchar_t`
2	1 byte	`char8_t`
3	from global	platform-dependent
4	from global	platform-dependent

The raw byte buffer is built by copying byte_count bytes from the EDG node, reading each character through edg::ReadIntFromBuffer(src, width) -- an endian-aware read function (the EDG IL may store string data in a platform-independent byte order). The buffer is then passed to ConstantDataArray::getRaw(data, byte_count) to create the LLVM constant.

For each character width, the LLVM element type is selected: i8 for 1-byte, i16 for 2-byte, i32 for 4-byte, i64 for 8-byte. Empty strings create zero-element arrays. If the array type override a3 provides a larger size than the literal, the remaining bytes are zero-filled.

Floating-point constants

Raw bit patterns are extracted via edg::ExtractFloatBits(kind, data_ptr), then reinterpreted into native float or double values:

EDG kind	C type	Conversion path
2	`float`	`BitsToFloat` -> `APFloat(float)` -> `IEEEsingle` semantics
4	`double`	`BitsToDouble` -> `APFloat(double)` -> `IEEEdouble` semantics
6	`long double`	Truncated to double (with warning 0xE51)
7	`__float80`	Truncated to double (with warning 0xE51)
8, 13	`__float128`	Truncated to double (with warning 0xE51)

All extended-precision types (long double, __float80, __float128) are silently lowered through the double path. NVPTX has no hardware support for 80-bit or 128-bit floats, so CICC truncates them to 64-bit IEEE 754. When the compilation context has the appropriate flag (bit 4 at offset +198), a diagnostic warning is emitted identifying the specific type being truncated.

Address-of constants

Sub-dispatched by a byte at expr[11].byte[0]:

Byte 0 -- Variable/global reference: Calls GetOrCreateGlobalVariable (sub_1276020), returning a GlobalVariable* as a constant pointer. Debug info is optionally attached.
Byte 1 -- Function reference: Calls GetOrCreateFunction (sub_1277140). For static-linkage functions, resolves through LookupFunctionStaticVar.
Byte 2 -- String literal reference (&"..."): Validates the node kind is 2 (string), then calls CreateStringGlobalConstant (sub_126A1B0).

Post-processing applies a constant GEP offset if expr[12].qword[0] is nonzero, and performs pointer type cast if the produced type differs from the expected type. Same-address-space mismatches use ConstantExpr::getBitCast; cross-address-space mismatches use ConstantExpr::getAddrSpaceCast. Pointer-to-integer mismatches use ConstantExpr::getPtrToInt with address-space normalization to addrspace(0) first.

Aggregate initializers

The largest case (630+ lines). After stripping typedefs, dispatches on the canonical type tag at +140:

Tag	Type	Output
10	Struct	`ConstantStruct` or `ConstantAggregateZero`
11	Union	Anonymous `{member_type, [N x i8]}`
8	Array	`ConstantArray`
12	Typedef	Strip and re-dispatch
other		Fatal: `"unsupported aggregate constant!"`

Struct (tag 10): Walks the EDG field list and initializer list in parallel. The field chain is traversed via +112 pointers; the initializer list via +120 next pointers.

Padding/zero-width fields are skipped (flag byte at +146, bit 3).
For each non-bitfield field, GetFieldIndex (sub_1277B60) returns the LLVM struct element index. If gaps exist between the previous and current index, intermediate slots are filled with Constant::getNullValue (sub_15A06D0).
Each field's initializer is processed by recursive EmitConstExpr call.
Packed struct fields (flag at +145, bit 4) have their sub-elements extracted individually via ConstantExpr::extractvalue (sub_15A0A60).
Missing trailing fields are padded with null values.
If the struct has no fields and the initializer list is empty, returns ConstantAggregateZero::get (sub_1598F00) as a shortcut.
Final assembly: ConstantStruct::get (sub_159F090) with type compatibility check via Type::isLayoutIdentical (sub_1643C60). If packed, StructType::get(elts, n, true) (sub_15943F0).

Struct bitfield packing (post-processing)

When any bitfield field is detected during the main walk (flag bit 2, &4 at +144), the function re-enters a post-processing phase after the main field loop. This packs bitfield constant values byte-by-byte into the struct's byte array:

// Bitfield packing pseudocode — sub_127D8B0, case 0xA post-processing
StructLayout *layout = DataLayout::getStructLayout(structTy);  // sub_15A9930

for (each bitfield field where flag &4 at +144 && name at +8 is non-null) {
    uint32_t byte_offset = field->byte_offset;
    uint32_t elem_idx = StructLayout::getElementContainingOffset(layout, byte_offset);
                                                                // sub_15A8020
    // Validate the target byte is zero
    assert(elements[elem_idx] == ConstantInt::get(i8, 0),
           "unexpected error while initializing bitfield!");

    // Evaluate bitfield initializer
    Constant *val = EmitConstExpr(ctx, init_expr, 0);          // recursive
    assert(val != NULL, "bit-field constant must have a known value at compile time!");

    APInt bits = extractAPInt(val);  // at constant+24, width at constant+32
    uint8_t bit_width = field->bit_width;    // at +137
    if (bits.width > bit_width)
        bits = APInt::trunc(bits, bit_width);                  // sub_16A5A50

    // Pack into struct bytes, one byte at a time
    uint8_t bit_offset = field->bit_offset;  // at +136 (within first byte)
    while (remaining_bits > 0) {
        uint8_t available = (first_byte ? 8 - bit_offset : 8);
        uint8_t take = min(remaining_bits, available);

        APInt slice = bits;
        if (slice.width > take)
            slice = APInt::trunc(slice, take);                 // sub_16A5A50
        if (take < 8)
            slice = APInt::zext(slice, 8);                     // sub_16A5C50
        slice = slice << bit_offset;                           // shl
        existing_byte |= slice;                                // sub_16A89F0

        elements[byte_index] = ConstantInt::get(ctx, existing_byte);
        bits = bits >> take;                                   // sub_16A7DC0
        remaining_bits -= take;
        bit_offset = 0;       // subsequent bytes start at bit 0
        byte_index++;
    }
}

This implements the C standard's bitfield byte-packing model: bits are inserted starting at the field's bit_offset within its containing byte, potentially spanning multiple bytes. Values wider than 64 bits use heap-backed APInt word arrays.

Union (tag 11): Finds the initialized member via two paths:

Designated initializer (kind 13): *(init+184) is the designated field, *(init+120) is the actual value expression.
Implicit: Walk the field chain (type+160) looking for the first non-skip, non-bitfield field. Named bitfield members are explicitly rejected: "initialization of bit-field in union not supported!". If no field is found: "cannot find initialized union member!".

The member value is emitted recursively. Padding to the full union byte size is added as [N x i8] zeroinitializer. The result is an anonymous {member_type, [N x i8]} struct via ConstantStruct::getAnon (sub_159F090).

Array (tag 8): Resolves element type via GetArrayElementType (sub_8D4050), walks the initializer linked list via +120 next pointers, calls EmitConstExpr recursively for each element. Designated initializers (kind 11) are supported: *(node+176) gives the designated element index, *(node+184) gives the range count. Type mismatches are handled by sub_127D000 (resize constant to target type).

When the declared dimension exceeds the initializer count, remaining elements are filled with Constant::getNullValue. The result uses ConstantArray::get (sub_159DFD0) when all elements have the same LLVM type (the common case), or falls back to an anonymous struct via StructType::get + ConstantStruct::get for heterogeneous cases (which should not occur in well-formed C but is handled defensively).

Cast / Conversion Codegen

EmitCast (sub_128A450) handles every C-level cast category. The function first checks for early exits (skip flag, identity cast where source type equals destination type), then dispatches by source and destination type tags.

// sub_128A450
llvm::Value *EmitCast(CodeGenState **ctx, EDGCastNode *expr,
                      uint8_t is_unsigned, llvm::Type *destTy,
                      uint8_t is_unsigned2, char skip_flag,
                      DiagContext *diag);

Type classification

Type tags at *(type+8):

Tag	Type
1-6	Floating-point (1=half, 2=float, 3=double, 4=fp80, 5=fp128, 6=bf16)
11	Integer (bit-width encoded in upper bits)
15	Pointer
16	Vector/aggregate

The test (tag - 1) > 5 means "NOT a float" (tags 1-6 are float types).

Tobool patterns

When the destination type is i1 (bool), the codegen produces comparison-against-zero:

Integer/float source (tags 1-6, 11):

%tobool = icmp ne i32 %val, 0          ; integer source
%tobool = fcmp une float %val, 0.0     ; float source

Float-to-bool uses fcmp une (unordered not-equal), which returns true for any non-zero value including NaN. Integer-to-bool uses icmp ne with a zero constant of matching type.

Pointer source (tag 15):

%tobool = icmp ne ptr %val, null

A shortcut exists: if the source expression is already a comparison result (opcode 61) and the source is already the bool type, the comparison result is returned directly without creating a new instruction.

Integer-to-integer (trunc / zext / sext)

The helper sub_15FE0A0 internally selects the operation based on relative widths:

dest_width < src_width -> trunc
dest_width > src_width AND unsigned -> zext
dest_width > src_width AND signed -> sext

All produce a value named "conv".

Pointer casts

Pointer-to-pointer: In LLVM opaque-pointer mode (which CICC v13 uses for modern SMs), same-address-space casts hit the identity return path and produce no IR. Cross-address-space casts use addrspacecast (opcode 47).

Pointer-to-integer: ptrtoint (opcode 45). Asserts that the destination is actually an integer type.

Integer-to-pointer: A two-step process. First, the integer is widened or narrowed to the pointer bit-width (32 or 64, obtained via sub_127B390). Then inttoptr (opcode 46) converts the properly-sized integer to a pointer:

%conv1 = zext i32 %val to i64          ; step 1: widen to pointer width
%conv  = inttoptr i64 %conv1 to ptr    ; step 2: int -> ptr

Float-to-integer and integer-to-float

Two paths exist for these conversions:

Standard path: Uses LLVM's native cast opcodes. Triggered when the global flag unk_4D04630 is set (relaxed rounding mode), or when the destination is 128-bit, or when the source is fp128:

Direction	Signed opcode	Unsigned opcode
int -> float	`sitofp` (39)	`uitofp` (40)
float -> int	`fptosi` (41)	`fptoui` (42)

NVIDIA intrinsic path: For SM targets that require round-to-zero semantics on float-int conversions. Constructs an intrinsic function name dynamically and emits it as a plain function call:

// Name construction pseudocode
char buf[64];
if (src_is_double)  strcpy(buf, "__nv_double");
else                strcpy(buf, "__nv_float");

strcat(buf, is_unsigned ? "2u" : "2");

if (dest_bits == 64) strcat(buf, "ll_rz");
else                 strcat(buf, "int_rz");

Producing names like:

Intrinsic	Conversion
`__nv_float2int_rz`	`f32` -> `i32`, signed, round-to-zero
`__nv_float2uint_rz`	`f32` -> `u32`, unsigned, round-to-zero
`__nv_double2ll_rz`	`f64` -> `i64`, signed, round-to-zero
`__nv_double2ull_rz`	`f64` -> `u64`, unsigned, round-to-zero
`__nv_float2ll_rz`	`f32` -> `i64`, signed, round-to-zero

These are emitted as plain LLVM function calls (call i32 @__nv_float2int_rz(float %val)), not as LLVM intrinsics. The NVIDIA PTX backend later pattern-matches these __nv_ calls to cvt.rz.* PTX instructions. The intrinsic call is created by sub_128A3C0, which builds a function type, looks up or creates the declaration in the module, and emits a CallInst with one argument.

If the source integer is 32-bit but the target needs 64-bit conversion, the function first converts i32 to i64, then recursively calls itself to convert i64 to the target float type.

Float-to-float (fptrunc / fpext)

The source and destination type tags are compared directly. If the destination tag is larger (wider float), opcode 44 (fpext) is used. If smaller, opcode 43 (fptrunc).

%conv = fpext float %val to double       ; float -> double
%conv = fptrunc double %val to float     ; double -> float

Cast control flow summary

EmitCast(ctx, expr, is_unsigned, destTy, is_unsigned2, skip, diag)
  |
  +-- skip_flag set          --> return 0
  +-- destTy == BoolType?
  |     +-- src is float       --> fcmp une %val, 0.0    "tobool"
  |     +-- src is ptr/int     --> icmp ne %val, null/0  "tobool"
  +-- srcTy == destTy          --> return expr (identity)
  +-- ptr -> ptr               --> bitcast(47)           "conv"
  +-- ptr -> int               --> ptrtoint(45)          "conv"
  +-- int -> ptr               --> resize + inttoptr(46) "conv"
  +-- int -> int               --> trunc/zext/sext       "conv"
  +-- int -> float
  |     +-- standard           --> sitofp(39)/uitofp(40) "conv"
  |     +-- nvidia             --> __nv_*2*_rz call      "call"
  +-- float -> int
  |     +-- standard           --> fptosi(41)/fptoui(42) "conv"
  |     +-- nvidia             --> __nv_*2*_rz call      "call"
  +-- float -> float
        +-- wider              --> fpext(44)             "conv"
        +-- narrower           --> fptrunc(43)           "conv"

IR Instruction Infrastructure

BB insertion linked list

After creating any LLVM instruction, it must be inserted into the current basic block. This appears ~30 times across the expression codegen functions as a doubly-linked intrusive list manipulation. The low 3 bits of list pointers carry tag/flag bits (alignment guarantees valid pointers have zero in those positions):

// Repeated BB insertion pattern
Value *tail = ctx[1][1];           // current BB's instruction list tail
if (tail) {
    Value *sentinel = ctx[1][2];   // sentinel node
    InsertIntoBB(tail + 40, inst); // sub_157E9D0
    // Linked list fixup (doubly-linked with 3-bit tag):
    inst->prev = (*sentinel & ~7) | (inst->prev & 7);   // preserve tag bits
    inst->parent = sentinel;
    ((*sentinel & ~7) + 8) = inst + 24;    // old_tail.next = inst
    *sentinel = (*sentinel & 7) | (inst + 24);  // sentinel.head = inst
}

Instruction offsets: +24 = prev pointer, +32 = parent block, +48 = debug location metadata slot.

Debug metadata attachment

After every BB insertion, debug location metadata is cloned and attached:

SetValueName(inst, &name);                    // sub_164B780: e.g. "lnot.ext"
Value *debugLoc = *ctx_debug;
if (debugLoc) {
    Value *cloned = CloneDebugLoc(debugLoc, 2);  // sub_1623A60
    if (inst->debugLoc)
        ReleaseDebugLoc(inst + 48);              // sub_161E7C0: free old
    inst->debugLoc = cloned;
    if (cloned)
        RegisterDebugLoc(cloned, inst + 48);     // sub_1623210
}

Global flags

Address	Purpose
`dword_4D04720` + `dword_4D04658`	Debug info emission control. When both zero, source location is forwarded before dispatch
`dword_4D04810`	Bitfield optimization flag. When set, enables `bassign.tmp` alloca path for bitfield assignments
`unk_4D04630`	When set, forces standard LLVM casts (`sitofp`/`fptosi`) instead of `__nv_*_rz` intrinsics
`unk_4D04700`	When set, marks tobool results as "potentially inexact" via flag bit
`unk_4D0463C`	Volatile detection flag. When set, queries address volatility

Helper Function Reference

Address	Recovered name	Role
`sub_128D0F0`	`EmitExpr`	Master expression dispatcher (this page)
`sub_128A450`	`EmitCast`	All C-level casts
`sub_127D8B0`	`EmitConstExpr`	Compile-time constant expressions
`sub_1282050`	`EmitBitfieldStore`	Bitfield write (R-M-W)
`sub_1284570`	`EmitBitfieldLoad`	Bitfield read (extract)
`sub_127FEC0`	`EmitBoolExpr`	Expression to `i1` conversion
`sub_127F650`	`EmitLiteral`	Numeric/string literal emission
`sub_1286D80`	`EmitAddressOf`	Compute pointer to lvalue
`sub_1287CD0`	`EmitLoadFromAddress`	Load via computed address
`sub_1287ED0`	`EmitCompoundAssign`	Generic compound assignment
`sub_128C390`	`EmitIncDec`	Pre/post increment/decrement
`sub_128F9F0`	`EmitBinaryArithCmp`	Binary arithmetic and comparison
`sub_128F580`	`EmitShiftOrBitwise`	Shift and bitwise operators
`sub_128B750`	`EmitSubscriptOp`	Array subscript (GEP + load)
`sub_128FDE0`	`EmitSizeofAlignof`	`sizeof` and `alignof` operators
`sub_12901D0`	`EmitCompoundAssignWrapper`	Wrapper dispatching to per-operator impl
`sub_1296570`	`EmitCall`	Function call emission
`sub_12897E0`	`EmitBitfieldStore` (inner)	Actual bitfield store logic
`sub_127A030`	`GetLLVMType`	EDG type to LLVM type translation
`sub_127F680`	`CanUseFastBitfieldPath`	Bitfield path selector
`sub_128A3C0`	`EmitIntrinsicConvCall`	`__nv_*_rz` intrinsic call helper
`sub_12A4D50`	`CreateBasicBlock`	Create named BB
`sub_12A4DB0`	`EmitCondBranch`	Conditional branch emission
`sub_12909B0`	`EmitUnconditionalBranch`	Unconditional branch emission
`sub_1290AF0`	`SetInsertPoint`	Switch current BB
`sub_15FB440`	`CreateBinOp`	Binary instruction creation
`sub_15FDBD0`	`CreateCast`	Cast instruction creation (IR path)
`sub_15A46C0`	`ConstantExprCast`	Cast (constant-fold path)
`sub_15A0680`	`ConstantInt::get`	Integer constant creation
`sub_159C0E0`	`ConstantInt::get` (APInt)	Wide integer constant creation
`sub_159CCF0`	`ConstantFP::get`	Float constant creation
`sub_128B370`	`EmitLoad`	Load with volatile/type/srcloc
`sub_128BE50`	`EmitCommaOp`	Comma operator RHS extraction
`sub_1289860`	`ComputeCompositeMemberAddr`	Multi-level GEP for nested fields
`sub_12843D0`	`EmitComplexMemberLoad`	Nested struct/union field load
`sub_127FF60`	`EmitStmtExpr`	Statement expression body emission
`sub_1281200`	`EmitSpecialConst`	Special constant materialization
`sub_1281220`	`EmitInitExpr`	Init expression emission
`sub_1285E30`	`EmitBlockAddress`	`blockaddress` / indirect branch
`sub_1286000`	`EmitVaArg`	`va_arg` lowering
`sub_127FC40`	`CreateAlloca`	Alloca with name and alignment
`sub_127B420`	`IsAddressOfExpr`	Check if child is `&` (for elision)
`sub_127B3A0`	`IsVolatile`	Volatile type query
`sub_127B390`	`GetSMVersion`	Returns current SM target
`sub_127B460`	`IsPacked`	Packed struct type query
`sub_127B550`	`FatalDiag`	Fatal diagnostic (never returns)
`sub_127C5E0`	`AttachDebugLoc`	Debug location attachment
`sub_127D2C0`	`ConstantFromType`	Type-level constant (sizeof, etc.)
`sub_12A4D00`	`LookupLabel`	Label resolution for goto/address
`sub_1648A60`	`AllocateInstruction`	Raw instruction memory allocation
`sub_1648B60`	`AllocatePHI`	PHI node memory allocation
`sub_164B780`	`SetValueName`	Assigns `%name` to IR value
`sub_157E9D0`	`InsertIntoBasicBlock`	BB instruction list insertion
`sub_1623A60`	`CloneDebugLoc`	Debug location cloning
`sub_1623210`	`RegisterDebugLoc`	Debug location list registration
`sub_161E7C0`	`ReleaseDebugLoc`	Debug location list removal
`sub_15F1EA0`	`InitInstruction`	Instruction field initialization
`sub_15F1F50`	`InitPHINode`	PHI node initialization (opcode 53)
`sub_15F2350`	`SetExactFlag`	Mark `sdiv`/`udiv` as `exact`
`sub_15F55D0`	`GrowOperandList`	Realloc PHI operand array
`sub_15FEC10`	`CreateCmpInst`	ICmp/FCmp instruction creation
`sub_15FE0A0`	`CreateIntResize`	Trunc/zext/sext helper
`sub_15FB630`	`CreateUnaryOp`	Unary NOT (xor -1)
`sub_15F9CE0`	`SetGEPOperands`	GEP operand filling
`sub_15FA2E0`	`SetInBoundsFlag`	Mark GEP as inbounds
`sub_8D23B0`	`IsArrayType`	Array type check
`sub_72B0F0`	`EvaluateConstantExpr`	EDG constant evaluation
`sub_731770`	`NeedsBitfieldTemp`	Bitfield temp alloca check

Constant expression helper functions

Address	Recovered name	Role
`sub_127D8B0`	`EmitConstExpr`	Master constant expression emitter
`sub_127D000`	`ResizeConstant`	Resize constant to target type
`sub_127D120`	`DestroyAPFloatElement`	APFloat cleanup in aggregate loop
`sub_127D2E0`	`PushElementBulk`	Bulk push to element vector
`sub_127D5D0`	`PushElement`	Single push to element vector
`sub_1277B60`	`GetFieldIndex`	Struct field index query
`sub_1276020`	`GetOrCreateGlobalVar`	Global variable creation/lookup
`sub_1277140`	`GetOrCreateFunction`	Function creation/lookup
`sub_1280350`	`LookupFunctionStaticVar`	Static local variable resolution
`sub_126A1B0`	`CreateStringGlobalConst`	Global string constant creation
`sub_1598F00`	`ConstantAggregateZero::get`	Zero-initialized aggregate
`sub_15991C0`	`ConstantDataArray::getRaw`	Raw byte array constant
`sub_159DFD0`	`ConstantArray::get`	Typed array constant
`sub_159F090`	`ConstantStruct::get`	Struct constant
`sub_15943F0`	`StructType::get`	Anonymous struct type
`sub_15A06D0`	`Constant::getNullValue`	Zero constant for any type
`sub_15A0A60`	`ConstantExpr::extractvalue`	Sub-element extraction
`sub_15A2E80`	`ConstantExpr::getGEP`	Constant GEP expression
`sub_15A4510`	`ConstantExpr::getBitCast`	Constant bitcast
`sub_15A4A70`	`ConstantExpr::getAddrSpaceCast`	Constant addrspacecast
`sub_15A4180`	`ConstantExpr::getPtrToInt`	Constant ptrtoint
`sub_15A8020`	`StructLayout::getElemContainingOffset`	Bitfield byte lookup
`sub_15A9930`	`DataLayout::getStructLayout`	Struct layout query
`sub_620E90`	`edg::IsSignedIntConst`	Signedness query
`sub_620FA0`	`edg::GetSignedIntValue`	Signed integer extraction
`sub_620FD0`	`edg::GetUnsignedIntValue`	Unsigned integer extraction
`sub_622850`	`edg::GetIntConstAsString`	`__int128` decimal string extraction
`sub_622920`	`edg::ExtractFieldOffset`	Field offset extraction
`sub_709B30`	`edg::ExtractFloatBits`	Float raw bits extraction
`sub_722AB0`	`edg::ReadIntFromBuffer`	Endian-aware integer read
`sub_8D4050`	`edg::GetArrayElementType`	Array element type query
`sub_8D4490`	`edg::GetArrayElementCount`	Array dimension query

LLVM Opcode Constants

Numeric opcode constants used in CreateBinOp, CreateCast, and instruction creation calls throughout the expression codegen:

Number	LLVM instruction	Used by
13	`sub`	Pointer subtraction step 4
18	`sdiv`	Pointer subtraction step 5 (with `exact` flag)
32	`shl`	Left shift (`<<`)
33	`ashr` / `lshr`	Right shift (`>>`, signedness-dependent)
34	`and` (FP variant)	Bitwise AND
35	`or` (FP variant)	Bitwise OR
36	`xor` (FP variant)	Bitwise XOR
37	`zext`	Zero-extend (bool-to-int, `lnot.ext`, `land.ext`)
38	`and`	Bitwise AND (integer)
39	`sitofp` / `or`	Signed int-to-float / bitwise OR (integer)
40	`uitofp` / `xor`	Unsigned int-to-float / bitwise XOR (integer)
41	`fptosi` / funnel shift	Signed float-to-int / rotate
42	`fptoui`	Unsigned float-to-int
43	`fptrunc`	Float-to-float truncation
44	`fpext`	Float-to-float extension
45	`ptrtoint`	Pointer-to-integer cast
46	`inttoptr`	Integer-to-pointer cast
47	`bitcast` / `addrspacecast`	Pointer casts
51	ICmp instruction kind	Integer comparison creation
52	FCmp instruction kind	Float comparison creation
53	PHI node kind	PHI creation for `&&`, `\|\|`, `?:`

PHI Node Construction Detail

PHI nodes are used by three expression types: logical AND (0x57), logical OR (0x58), and ternary (0x67). The construction sequence is identical across all three:

Allocate: AllocatePHI (sub_1648B60) with 64 bytes.
Initialize: InitPHINode (sub_15F1F50) with opcode 53 (PHI), type, and zero for parent/count/incoming.
Set capacity: *(phi+56) = 2 -- two incoming edges.
Set name: SetValueName (sub_164B780) with "land.ext", "lor.ext", or "cond".
Reserve slots: sub_1648880(phi, 2, 1) -- reserve 2 incoming at initial capacity 1.

Adding each incoming value:

count = *(phi+20) & 0xFFFFFFF;           // current operand count
if (count == *(phi+56))                   // capacity full?
    GrowOperandList(phi);                 // sub_15F55D0: realloc

new_idx = (count + 1) & 0xFFFFFFF;
*(phi+20) = new_idx | (*(phi+20) & 0xF0000000);  // update count, preserve flags

// Large-mode flag at *(phi+23) & 0x40 selects operand array location:
base = (*(phi+23) & 0x40) ? *(phi-8) : phi_alloc_base - 24*new_idx;

// Value slot: base + 24*(new_idx-1) — 24 bytes per slot (value ptr + use-list pointers)
slot = base + 24*(new_idx - 1);
*slot = value;                           // incoming value
slot[1] = value.use_next;               // link into value's use-list
slot[2] = &value.use_head | (slot[2] & 3);
value.use_head = slot;

// Basic block slot: stored after all value slots as parallel array
bb_offset = base + 8*(new_idx-1) + 24*num_incoming + 8;
*bb_offset = incoming_bb;

The PHI operand layout is [val0, val1, ..., bb0, bb1, ...] where each value slot occupies 24 bytes (value pointer + doubly-linked use-list pointers), and basic block pointers form a parallel 8-byte array after all value slots.

Duplicate Implementations

Two additional copies of the bitfield codegen exist at sub_923780 (store) and sub_925930 (load) -- identical algorithms with the same string names, same opcodes, same control flow. These are in the 0x92xxxx range (NVVM frontend region) while the primary copies are in the 0x128xxxx range (codegen helper region). They likely correspond to different template instantiations or address-space variants in the original NVIDIA source code.

Diagnostic String Index

String	Origin function	Trigger
`"unsupported expression!"`	`EmitExpr` (`sub_128D0F0`)	Default case in outer switch
`"unsupported operation expression!"`	`EmitExpr` (`sub_128D0F0`)	Default case in inner switch
`"constant expressions are not supported!"`	`EmitConstExpr` (`sub_127D8B0`)	Unsupported context kind (`sub_6E9180` returns true)
`"unsupported constant variant!"`	`EmitConstExpr` (`sub_127D8B0`)	Unknown constant kind in main switch; also byte != 0/1/2 in address-of
`"unsupported float variant!"`	`EmitConstExpr` (`sub_127D8B0`)	Float kind 5, or kind < 2
`"long double"` / `"__float80"` / `"__float128"`	`EmitConstExpr` (`sub_127D8B0`)	Warning 0xE51: extended precision truncated to double on CUDA target
`"failed to lookup function static variable"`	`EmitConstExpr` (`sub_127D8B0`)	Function static address with type tag > 0x10
`"taking address of non-string constant is not supported!"`	`EmitConstExpr` (`sub_127D8B0`)	`&literal` where literal kind != 2 (non-string)
`"unsupported cast from address constant!"`	`EmitConstExpr` (`sub_127D8B0`)	Type mismatch that is not ptr-to-ptr or ptr-to-int
`"unsupported aggregate constant!"`	`EmitConstExpr` (`sub_127D8B0`)	Type tag not in {8, 10, 11, 12} for aggregate case
`"initialization of bit-field in union not supported!"`	`EmitConstExpr` (`sub_127D8B0`)	Union initializer targeting a named bitfield
`"cannot find initialized union member!"`	`EmitConstExpr` (`sub_127D8B0`)	Union field chain exhausted without finding target
`"bit-field constant must have a known value at compile time!"`	`EmitConstExpr` (`sub_127D8B0`)	Bitfield initializer evaluates to NULL
`"unexpected error while initializing bitfield!"`	`EmitConstExpr` (`sub_127D8B0`)	Pre-existing byte in struct is not zero when packing
`"unexpected non-integer type for cast from pointer type!"`	`EmitCast` (`sub_128A450`)	`ptrtoint` destination is not integer
`"unexpected destination type for cast from pointer type"`	`EmitCast` (`sub_128A450`)	`inttoptr` source is not integer
`"error generating code for loading from bitfield!"`	`EmitBitfieldLoad` (`sub_1284570`)	Alignment assertion failure
`"expected result type of bassign to be void!"`	`EmitExpr` (`sub_128D0F0`)	Bitfield assign result type validation

Cross-References

IRGen Types -- type translation from EDG to LLVM
Statement Codegen -- statement-level emission that calls into EmitExpr
Cast Codegen detail -- EmitCast subsystem
Diagnostics -- diagnostic emission infrastructure
Address Spaces -- NVPTX address space model affecting pointer casts

Statement & Control Flow Codegen

The statement code generator converts EDG IL statement nodes into LLVM IR basic blocks and terminators. It is the control flow backbone of NVVM IR generation: every if, while, for, switch, goto, return, and compound block passes through a single recursive dispatcher (sub_9363D0) that reads a statement-kind byte and fans out to 17 specialized handlers. Each handler creates named basic blocks following a fixed naming convention, connects them with conditional or unconditional branches, and attaches metadata for branch prediction and loop optimization. Understanding this subsystem means understanding exactly how C/CUDA source-level control flow maps to the LLVM IR that downstream optimization passes will transform.

Binary coordinates: Handlers span 0x930000--0x948000 (~96 KB). The dispatcher itself is at 0x9363D0; the most complex handler (try/catch at sub_932270) is 57 KB alone.

Statement Dispatcher -- `sub_9363D0` (emitStmt)

void emitStmt(CGModule *cg, StmtNode *stmt);

The dispatcher is the only entry point for statement lowering. All control flow handlers, compound statements, and even the top-level function body driver call emitStmt recursively.

Entry logic:

If cg->currentBB (offset +96) is NULL, create an anonymous unreachable basic block via createBB("") and insert it. This is the "dead code after return" safety net -- it ensures the IR builder always has an insertion point, even for unreachable code that follows a return or goto.
Read stmt->stmtKind (byte at StmtNode offset +40).
Special fast path: if kind == 8 (return), call setDebugLoc + pushScope + emitReturnStmt and return immediately. Returns get priority handling because they terminate the current BB and may trigger cleanup scope unwinding.
General path: setDebugLoc + pushScope, then dispatch on kind through a switch table.

Kind Dispatch Table

Kind	Statement type	Handler	Address
0	Expression statement	`emitExprStmt`	`sub_921EA0`
1	`if` statement	`emitIfStmt`	`sub_937020`
2	`if constexpr` (C++17)	`emitConstexprIf`	`sub_936F80`
5	`while` loop	`emitWhile`	`sub_937180`
6	`goto`	`emitGoto`	`sub_931270`
7	Label statement	`emitLabel`	`sub_930570`
8	`return`	`emitReturn`	`sub_9313C0`
11	Compound `{ ... }`	`emitCompound`	`sub_9365F0`
12	`do-while` loop	`emitDoWhile`	`sub_936B50`
13	`for` loop	`emitFor`	`sub_936D30`
15	`case` label	`emitCase`	`sub_935670`
16	`switch` statement	`emitSwitch`	`sub_9359B0`
17	Variable declaration	`emitDeclStmt`	`sub_9303A0`
18	`try/catch`	`emitTryCatch`	`sub_932270`
20	Cleanup/destructor scope	`emitCleanupScope`	`sub_931670`
24	Null/empty statement	(return immediately)	--
25	Expression statement (alt)	`emitExprStmt`	`sub_921EA0`

Kinds 0 and 25 share the same handler. The split likely distinguishes C expression-statements from GNU statement-expressions or a similar EDG internal distinction. Any unrecognized kind triggers fatal("unsupported statement type").

Gaps in the numbering (3, 4, 9, 10, 14, 19, 21--23) either correspond to statement types handled entirely in the EDG frontend (lowered before codegen sees them) or are reserved for future use.

If Statement -- `sub_937020`

Reads from the StmtNode: condition expression at offset +48, then-body at +72, else-body at +80 (may be NULL).

BB Layout: if/else

    ┌─────────────────────┐
    │    current BB        │
    │  %cond = ...         │
    │  br i1 %cond,        │
    │    label %if.then,    │
    │    label %if.else     │
    └──┬──────────────┬────┘
       │              │
       ▼              ▼
 ┌──────────┐   ┌──────────┐
 │ if.then  │   │ if.else  │
 │ <then>   │   │ <else>   │
 │ br %end  │   │ br %end  │
 └────┬─────┘   └────┬─────┘
      │               │
      ▼               ▼
    ┌─────────────────────┐
    │      if.end          │
    └─────────────────────┘

BB Layout: if without else

    ┌─────────────────────┐
    │    current BB        │
    │  %cond = ...         │
    │  br i1 %cond,        │
    │    label %if.then,    │
    │    label %if.end      │
    └──┬──────────────┬────┘
       │              │
       ▼              │
 ┌──────────┐         │
 │ if.then  │         │
 │ <then>   │         │
 │ br %end  │         │
 └────┬─────┘         │
      │               │
      ▼               ▼
    ┌─────────────────────┐
    │      if.end          │
    └─────────────────────┘

LLVM IR pseudocode:

  %cond = icmp ne i32 %x, 0           ; evalCondition: convert to i1
  br i1 %cond, label %if.then, label %if.else, !prof !0

if.then:
  ; ... then-body codegen ...
  br label %if.end

if.else:
  ; ... else-body codegen ...
  br label %if.end

if.end:
  ; continues here

!0 = !{!"branch_weights", i32 2000, i32 1}  ; if __builtin_expect(x, 1)

Branch Weight Metadata

sub_92F9D0 examines __builtin_expect annotations on branch bodies by checking bit flags at StmtNode offset +41:

Flag	Source annotation	Weight encoding	Metadata attached
bit 0x10	`__builtin_expect(x, 1)` -- likely	weightHint = 1	`!{!"branch_weights", i32 2000, i32 1}`
bit 0x20	`__builtin_expect(x, 0)` -- unlikely	weightHint = 2	`!{!"branch_weights", i32 1, i32 2000}`
neither	no annotation	weightHint = 0	(no metadata)

The 2000:1 ratio represents 99.95% prediction confidence. For compound statements (kind 11), the function recurses into the compound's first child statement to find the annotation.

Constexpr If -- `sub_936F80`

C++17 if constexpr is fully resolved during EDG frontend semantic analysis. By the time the codegen sees it, only the taken branch body survives. The handler reads a selection record from offset +72: a bit at +24 determines which of two fields contains the surviving body pointer. If non-null, it creates constexpr_if.body and constexpr_if.end BBs and emits the body with an unconditional branch to .end. If null (dead branch entirely eliminated), no codegen occurs at all.

While Loop -- `sub_937180`

    ┌─────────────────────┐
    │    current BB        │
    │  br label %while.cond│
    └─────────┬───────────┘
              │
              ▼
    ┌─────────────────────┐◄──────────────┐
    │   while.cond         │               │
    │  %c = ...            │               │
    │  br i1 %c,           │               │
    │    label %while.body, │               │
    │    label %while.end   │               │
    └──┬──────────────┬────┘               │
       │              │                    │
       ▼              │                    │
 ┌──────────────┐     │                    │
 │ while.body   │     │                    │
 │ <body>       │     │                    │
 │ br %cond ────┼─────┼────────────────────┘
 └──────────────┘     │         backedge with
                      │         !llvm.loop metadata
                      ▼
              ┌──────────────┐
              │  while.end   │
              └──────────────┘

LLVM IR pseudocode:

  br label %while.cond

while.cond:
  %c = icmp slt i32 %i, %n
  br i1 %c, label %while.body, label %while.end

while.body:
  ; ... body codegen ...
  br label %while.cond, !llvm.loop !1

while.end:
  ; continues here

!1 = !{!1, !2}                                    ; self-referential loop ID
!2 = !{!"llvm.loop.mustprogress"}

The backedge branch (br label %while.cond from while.body) always receives !llvm.loop metadata via emitLoopMustProgress (sub_930810). If the loop carries #pragma unroll, additional unroll metadata is merged into the same MDNode (see Loop Metadata below).

Do-While Loop -- `sub_936B50`

The key structural difference from while: the body executes before the condition. The condition BB follows the body.

    ┌─────────────────────┐
    │    current BB        │
    │  br label %do.body   │
    └─────────┬───────────┘
              │
              ▼
    ┌─────────────────────┐◄──────────────┐
    │   do.body            │               │
    │  <body>              │               │
    │  br label %do.cond   │               │
    └─────────┬───────────┘               │
              │                            │
              ▼                            │
    ┌─────────────────────┐               │
    │   do.cond            │               │
    │  %c = ...            │               │
    │  br i1 %c,           │               │
    │    label %do.body, ──┼───────────────┘
    │    label %do.end     │     backedge
    └──────────────┬──────┘
                   │
                   ▼
           ┌──────────────┐
           │   do.end      │
           └──────────────┘

LLVM IR pseudocode:

  br label %do.body

do.body:
  ; ... body codegen ...
  br label %do.cond

do.cond:
  %c = icmp ne i32 %x, 0
  br i1 %c, label %do.body, label %do.end, !llvm.loop !1

do.end:
  ; continues here

The backedge is the conditional branch in do.cond (true edge back to do.body). Debug location is set separately for the condition expression using the condition node's own source location (offset +36 from the condition expression node).

For Loop -- `sub_936D30`

The most complex loop handler. Reads four components from the StmtNode: init statement at offset +80 field [0], condition at +48, increment expression at +80 field [1], and body at +72. Any of init, condition, and increment may be NULL.

    ┌─────────────────────┐
    │    current BB        │
    │  <init statement>    │    ← emitted in current BB if non-null
    │  br label %for.cond  │
    └─────────┬───────────┘
              │
              ▼
    ┌─────────────────────┐◄──────────────┐
    │   for.cond           │               │
    │  %c = ... or true    │               │
    │  br i1 %c,           │               │
    │    label %for.body,   │               │
    │    label %for.end     │               │
    └──┬──────────────┬────┘               │
       │              │                    │
       ▼              │                    │
 ┌──────────────┐     │                    │
 │  for.body    │     │                    │
 │  <body>      │     │                    │
 │  br %for.inc │     │                    │
 └──────┬───────┘     │                    │
        │             │                    │
        ▼             │                    │
 ┌──────────────┐     │                    │
 │  for.inc     │     │                    │
 │  <increment> │     │                    │
 │  br %for.cond┼─────┼────────────────────┘
 └──────────────┘     │         backedge
                      ▼
              ┌──────────────┐
              │   for.end    │
              └──────────────┘

LLVM IR pseudocode:

  ; init: i = 0
  store i32 0, ptr %i.addr, align 4
  br label %for.cond

for.cond:
  %i = load i32, ptr %i.addr, align 4
  %cmp = icmp slt i32 %i, %n
  br i1 %cmp, label %for.body, label %for.end

for.body:
  ; ... body codegen ...
  br label %for.inc

for.inc:
  %i1 = load i32, ptr %i.addr, align 4
  %inc = add nsw i32 %i1, 1
  store i32 %inc, ptr %i.addr, align 4
  br label %for.cond, !llvm.loop !1

for.end:
  ; continues here

Special cases:

Null condition: If the condition expression is NULL (e.g., for(;;)), the handler calls ConstantInt::getTrue (sub_ACD6D0) to create an unconditionally-true condition, producing an infinite loop.
Volatile increment: If the increment expression operates on a volatile pointer (type descriptor & 0xFB == 8 and isVolatile() returns true), the store is marked volatile.
Scope tracking: Outside "fast codegen" mode (dword_4D04658 == 0), pushes a DW_TAG_lexical_block debug scope at for-loop entry via sub_941230/sub_9415C0 and pops it at exit via sub_93FF00. This generates correct DWARF scoping so debuggers see for-local variables in the right scope.

The for.inc BB is only created when an increment expression exists. If omitted, the body branches directly back to for.cond.

Switch Statement -- `sub_9359B0`

The largest control flow handler after try/catch (~550 decompiled lines). Uses a three-phase approach with an internal open-addressing hash table.

Phase 1: Build case-to-BB mapping

Iterates the case list (linked list at stmt[10]+16, next pointer at +32). For each case label, creates a switch_case.target BB. Also creates one switch_case.default_target BB for the default case. Stores the mapping in an open-addressing hash table at CGModule offsets +496 through +520.

Hash table layout (32-byte entries):

CGModule offset	Field
+496	numEntries
+504	bucket array pointer
+512	numOccupied
+516	numTombstones
+520	capacity

Uses the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function and growth policy.

Phase 2: Emit LLVM SwitchInst

Evaluates the switch condition via sub_92F410, then creates a SwitchInst via sub_B53A60 (SwitchInst::Create) with the case count, default target BB, and condition value. Each case constant is added via sub_B53E30 (SwitchInst::addCase).

Phase 3: Emit body

Creates a switch_child_entry BB, inserts it, and recursively emits the switch body. If the switch has no explicit default: case, emits a fallthrough to the switch_case.default_target BB.

LLVM IR pseudocode:

  %val = load i32, ptr %x.addr
  switch i32 %val, label %switch_case.default_target [
    i32 0, label %switch_case.target
    i32 1, label %switch_case.target1
    i32 5, label %switch_case.target2
  ]

switch_case.target:                ; case 0
  ; ...
  br label %switch_child_entry     ; fallthrough or break

switch_case.target1:               ; case 1
  ; ...

switch_case.default_target:        ; default
  ; ...

Note that cicc always emits an LLVM switch instruction. The decision to lower a switch into a jump table versus sequential comparisons is made later by the SelectionDAG backend (specifically NVPTXTargetLowering), not during IR generation. The codegen produces a clean, canonical switch and lets the backend optimize the dispatch strategy.

Case Label -- `sub_935670`

When the recursive statement walk encounters a case label (kind 15), it looks up the parent switch node (asserts stmtKind == 16), finds the pre-allocated target BB from the hash table, and calls insertBB to make it the current insertion point. Fatal error "basic block for case statement not found!" if the hash table lookup fails.

For the default case (identified by a null value at +8), retrieves the last entry in the mapping vector.

Goto and Label Statements

Goto -- `sub_931270`

Reads the target label from stmt->auxData+128. Fatal error if null: "label for goto statement not found!".

Two code paths based on cleanup state:

Simple goto (no active cleanups, CGModule offset +240 == 0): Resolves the label to its BB via sub_946C80 and emits an unconditional branch.

Goto with cleanups (offset +240 != 0): Before branching, the handler must destroy all local variables whose scope is being exited. Calls sub_9310E0 to compute the destruction set, iterates each variable calling sub_9465D0 to emit destructor calls, resets the cleanup stack, then resolves and branches to the label BB.

  ; goto with cleanup: jumping out of scope with a std::string local
  call void @_ZNSsD1Ev(ptr %str)     ; ~string()
  br label %label_target

Label -- `sub_930570`

Resolves the label to its BB via sub_946C80 and inserts it as the current basic block via insertBB. The BB name comes from the label's symbol name in the EDG IL.

Computed Goto (GCC `&&label` Extension)

Computed goto is handled in the expression codegen layer, not the statement dispatcher. Expression kind 0x71 at sub_921EA0 calls EmitBlockAddress (sub_1285E30) to produce an LLVM blockaddress constant, and expression kind 0x70 produces the label-as-value. The resulting indirectbr instruction is lowered later by IndirectBrExpandPass (pipeline parser index 247, "indirectbr-expand") because NVPTX does not natively support indirect branches -- they are expanded into a switch over all possible target labels.

Return Statement -- `sub_9313C0`

Reads the return expression from StmtNode offset +48. Dispatches on CGModule return-type information (offsets +208 and +216):

Path A -- Aggregate (struct) return: If the return type is aggregate (sub_91B770 returns true), emits a memcpy-like sequence into the sret pointer via sub_947E80. For multi-register returns (offset +216 > 0), uses bit-width analysis (_BitScanReverse64) to determine the return bit layout.

Path B -- Scalar return: Evaluates the expression, creates a ReturnInst via sub_B4D3C0, and may bitcast the value for ABI compliance via sub_AE5020.

Path C -- Void return with expression: Evaluates the expression for side effects only (calls emitExprStmt), then falls through to emit a void return.

Cleanup before return: If cg->hasCleanups (offset +240) is set, calls sub_9310E0 to compute the set of locals requiring destruction, emits destructor calls in reverse order, resets the cleanup stack, then emits an unconditional branch to the function's unified return block (offset +200).

  ; return with cleanup unwind
  call void @_ZN3FooD1Ev(ptr %obj)    ; ~Foo()
  store i32 %retval, ptr %retval.addr
  br label %return

return:                               ; unified return BB
  %0 = load i32, ptr %retval.addr
  ret i32 %0

The unified return block pattern means every return in a function branches to a single shared return BB rather than emitting ret directly. This is standard in compilers because it simplifies cleanup handling and produces cleaner IR for optimization.

Try/Catch -- `sub_932270`

The largest single statement handler at 2,225 decompiled lines (57 KB, 0x3B0 bytes of stack locals). Lowers C++ try/catch into LLVM's landingpad-based exception handling model.

High-level structure:

Collect catch handlers: Traverses the linked list at stmt->auxData+136 to build a vector of catch clause pointers.
Construct cleanup names: Builds a mangled cleanup function name from the function's symbol (reading name range from symbol +184/+176). Single $ characters are doubled to $$ for LLVM compatibility.
Build dispatch mapping: Creates an outer dispatch vector mapping each catch clause to its target BB, stored in the same open-addressing hash table scheme used by switch.
Emit try body: Installs the landingpad/invoke mechanism so that throwing calls within the try body become invoke instructions rather than call instructions.
Emit catch handlers: For each catch clause, creates a BB, emits the handler body, and generates the cleanup/resume path.

Note that CUDA device code has exceptions disabled by default (EDG config DEFAULT_EXCEPTIONS_ENABLED = 0). This handler is exercised primarily for host-side code compiled through cicc, or for the rare case where exceptions are explicitly enabled via compiler flags. When exceptions are disabled, the EDG frontend strips try/catch entirely and the codegen never sees kind 18.

The NVVM IR verifier (sub_2C76F10) explicitly rejects landingpad, invoke, and resume instructions in device code, confirming that exception handling is a host-only feature.

Cleanup/Destructor Scope -- `sub_931670`

Handles statement kind 20. Only active when cg->hasCleanups (offset +240) is set.

Walks a linked list at StmtNode offset +72. For each entry where the byte at +8 equals 7 (indicating a variable with non-trivial destructor):

Extracts the variable reference at entry[2] (offset +16).
Checks visibility flags (bits 0x60 at +170, byte +177 != 5) to skip external and static symbols.
Looks up the variable in the CGModule's var-lookup hash table (offsets +8 through +24) using the same hash function as the switch table.
If the variable is already registered for cleanup (checked via sub_91CCF0), adds it to the pending cleanup list and emits an immediate destructor call via sub_9465D0.
If not yet registered, just adds it to the pending list for later processing.

This mechanism ensures that C++ automatic variables with non-trivial destructors are properly destroyed when their scope exits -- whether by normal control flow, goto, return, or exception propagation.

Compound Statement -- `sub_9365F0`

Handles { ... } blocks (kind 11). This is the workhorse that ties everything together: the function body itself is a compound statement, and every block scope creates a nested compound.

Cleanup frame management: When cg->hasCleanups (offset +240) is set, pushes a new cleanup frame onto the cleanup stack (offset +424). Each frame is a 24-byte record: {pendingDestructors ptr, end, capacity}.

Variable declarations: Iterates local declarations at scope fields [14] and [15] (linked lists). For each local variable, emits an alloca or initializer as needed. If the variable has a non-trivial destructor, registers it in the cleanup set.

Statement iteration: Walks the child statement linked list starting at StmtNode offset +72, following nextStmt pointers at +16. For each child, calls emitStmt(cg, child) recursively. Between statements, checks whether pending cleanups need flushing (temporaries with non-trivial destructors). If new cleanup entries appeared since the last check, iterates them in reverse order and emits destructor calls.

Statement-expression support (GNU extension): For ({...}) expressions, if the last statement in the block is an expression (kind 0 or 25), treats its value as the compound's result. Fatal error: "unexpected: last statement in statement expression is not an expression!" if the last statement is not an expression type.

Scope tracking: Outside fast-codegen mode, pushes a DW_TAG_lexical_block debug scope at entry and pops at exit, so debuggers correctly associate variables with their lexical scope.

Variable Declaration -- `sub_9303A0`

Reads the variable descriptor from StmtNode offset +72, then the variable's symbol from descriptor +8.

Initialization dispatch (based on byte at symbol +177):

Value	Meaning
4	Block-scope static -- `fatal("block scope static variable initialization is not supported!")`
0, 3	No dynamic init needed -- skip codegen
2	Dynamic initialization -- main path

Dynamic init sub-dispatch (descriptor +48 byte):

Sub-kind	Handler	Purpose
1	`sub_91DAD0`	Load-address style init
2	`sub_91FFE0`	Emit initializer expression
3	`sub_92F410`	Direct expression evaluation
other	--	`fatal("unsupported dynamic initialization")`

After computing the initializer value, the handler checks for volatile store qualification, computes alignment via sub_91CB50, retrieves the alloca/global address via sub_9439D0, and emits the store via sub_923130.

  %x = alloca i32, align 4              ; from function prologue
  %init = call i32 @compute_value()     ; dynamic initialization
  store i32 %init, ptr %x, align 4      ; emitDeclStmt

Block-scope static variables (static int x = expr;) are explicitly unsupported and fatal. In CUDA device code, block-scope statics have no sensible semantics (no persistent thread-local storage across kernel invocations), so this restriction is intentional.

Loop Metadata: Pragma Unroll and Mustprogress

Pragma Unroll -- `sub_9305A0`

Called from while, do-while, and for handlers when StmtNode offset +64 (pragma annotation) is non-NULL. Parses "unroll %d" from the pragma string via sscanf.

Count value	Metadata produced
`0x7FFFFFFF` (INT_MAX)	`!{!"llvm.loop.unroll.full"}`
Specific N	`!{!"llvm.loop.unroll.count", i32 N}`
<= 0	`fatal("Unroll count must be positive.")`
Parse failure	`fatal("Parsing unroll count failed!")`

The metadata is wrapped in the standard LLVM loop-ID self-referential MDNode pattern:

  br label %for.cond, !llvm.loop !3

!3 = !{!3, !4}                                      ; self-ref loop ID
!4 = !{!"llvm.loop.unroll.count", i32 8}

Global flag dword_4D046B4 ("skip pragma" mode) gates this entirely -- when set, sub_9305A0 returns immediately.

Loop Mustprogress -- `sub_930810`

Called on every loop backedge (while, do-while, for). Creates !{!"llvm.loop.mustprogress"} and attaches it to the backedge branch. If the backedge already has !llvm.loop metadata (from pragma unroll), the existing operands are read and the mustprogress node is appended to create a combined MDNode:

  br label %while.cond, !llvm.loop !5

!5 = !{!5, !6, !7}                                  ; merged: self-ref + unroll + mustprogress
!6 = !{!"llvm.loop.unroll.count", i32 4}
!7 = !{!"llvm.loop.mustprogress"}

This metadata tells the LLVM optimizer that loops must make forward progress -- it is allowed to remove provably-infinite side-effect-free loops. This corresponds to the C++ forward progress guarantee required by the standard.

Infrastructure Functions

createBB -- `sub_945CA0`

Allocates an 80-byte BasicBlock object and initializes it with the LLVM context from CGModule offset +40. The name parameter produces the characteristic BB names visible throughout this page: "if.then", "while.cond", "for.inc", "switch_case.target", "constexpr_if.body", etc.

insertBB -- `sub_92FEA0`

void insertBB(CGModule *cg, BasicBlock *bb, int canDelete);

Finalizes the current BB (emits an implicit unconditional branch to bb if the current BB lacks a terminator), then inserts bb into the function's BB list. If canDelete is 1 and the BB has no predecessors, the BB is immediately freed -- this garbage-collects unreachable continuation blocks (e.g., if.end when both branches terminate, while.end when the loop is infinite).

The canDelete=1 flag is used for if.end, while.end, for.end, and do.end BBs.

finalizeBB / emitBr -- `sub_92FD90`

If the current BB exists and its last instruction is NOT a terminator (opcode check: opcode - 30 > 10 filters out br, ret, switch, etc.), creates a BranchInst to the target BB and inserts it. Then clears cg->currentBB and the insert point.

emitCondBr -- `sub_945D00`

Creates a conditional BranchInst with true/false targets and optional branch weight metadata. When weightHint != 0, attaches !prof branch_weights metadata via MDBuilder::createBranchWeights.

evalCondition -- `sub_921E00`

Evaluates a condition expression and converts the result to i1. Checks for aggregate types (fatal error if the condition is an aggregate), determines signedness, evaluates the expression, then emits icmp ne 0 (integer) or fcmp une 0.0 (floating point) to produce a boolean.

EDG StmtNode Layout

Reconstructed from usage patterns across all statement handlers:

Offset	Size	Field
+0	4	Source location: line number
+4	2	Source location: column number
+16	8	`nextStmt` -- linked list pointer
+40	1	`stmtKind` -- enum value (0--25 observed)
+41	1	Flags (bit 0x10 = likely, bit 0x20 = unlikely)
+48	8	`exprPayload` / condition expression pointer
+64	8	Pragma annotation (NULL or `"unroll N"` string)
+72	8	`auxData` -- kind-specific (then-body, label, variable descriptor, etc.)
+80	8	`auxData2` -- kind-specific (else-body for if, init/increment for for, etc.)

CGModule Offsets Used by Statement Codegen

Offset	Size	Field
+8	8	`varLookupTable.buckets`
+24	4	`varLookupTable.capacity`
+40	8	`llvmContext`
+96	8	`currentBB` (BasicBlock pointer)
+104	8	`insertPoint`
+192	8	`currentFunction` (Function pointer)
+200	8	`returnBlock` (unified return BB)
+208	8	`returnValue` / sret pointer
+216	4	`returnAlignment`
+240	1	`hasCleanups` flag
+248	--	`cleanupSet` (DenseSet tracking which vars need cleanup)
+424	8	`cleanupStack` pointer (24-byte frames)
+496	8	`switchHashTable.count`
+504	8	`switchHashTable.buckets`
+512	4	`switchHashTable.numOccupied`
+516	4	`switchHashTable.numTombstones`
+520	4	`switchHashTable.capacity`
+528	8	`currentScope` pointer

Global Mode Flags

Global	Purpose
`dword_4D04658`	Fast codegen mode. Skips debug location emission, scope tracking, and some pragma processing. Corresponds to `-G0` or equivalent "no debug" mode.
`dword_4D046B4`	Skip pragma mode. `emitUnrollPragma` returns immediately. Also gates some compound-statement declaration processing.
`dword_4F077C4`	CUDA compilation mode. Value 2 triggers alternate volatile-qualification logic in for-loop increment and variable declaration codegen.

Complete BB Naming Reference

Every basic block created by the statement codegen uses one of these exact names:

Statement type	BB names created
`if`	`if.then`, `if.else`, `if.end`
`if constexpr`	`constexpr_if.body`, `constexpr_if.end`
`while`	`while.cond`, `while.body`, `while.end`
`do-while`	`do.body`, `do.cond`, `do.end`
`for`	`for.cond`, `for.body`, `for.inc`, `for.end`
`switch`	`switch_case.target` (per case), `switch_case.default_target`, `switch_child_entry`
`goto` / `label`	(named from label symbol)
`return`	(branch to unified return block)
`compound { }`	(no BBs unless cleanup)
dead code	`""` (anonymous unreachable BB)

These names survive into the final LLVM IR dump (-Xcuda-ptxas=-v) and are visible in optimization pass debug output. Recognizing them immediately tells you which source-level construct produced a given IR region.

Function, Call & Inline Asm Codegen

This page covers the four subsystems that together translate CUDA/C++ function definitions and call sites into LLVM IR: function prolog generation, call instruction emission, inline assembly compilation, and builtin lowering. The code lives in the 0x930000--0x960000 address range (Path A) with a parallel copy at 0x1270000--0x12D0000 (Path B).


EmitFunction	`sub_946060` (Path A) -- creates entry BB, allocapt sentinel, dispatches to prolog
GenerateFunctionProlog	`sub_938240` (16 KB) -- parameter iteration, ABI dispatch, alloca emission
EmitCallExpr	`sub_93CB50` (1,293 lines) -- type resolution, ABI classification, call emission
EmitInlineAsm	`sub_1292420` (53 KB, 2,087 lines) -- 7-phase asm template-to-IR pipeline
BuiltinLowering	`sub_12B3FD0` (103 KB, 3,409 lines) -- mega-switch over ~250 builtin IDs
EmitFunctionAttrs	`sub_12735D0` / `sub_1273F90` -- grid_constant, preserve_n, custom ABI metadata

Function Prolog: Entry Block Setup

Every LLVM function produced by cicc starts with the same structural skeleton: an entry basic block containing a sentinel instruction, a cluster of alloca instructions for parameters and locals, and a return basic block for the unified exit path. The outer driver EmitFunction (sub_946060) builds this skeleton; the inner workhorse GenerateFunctionProlog (sub_938240) populates it with parameter handling code.

EmitFunction -- The Outer Driver

EmitFunction executes a fixed 10-step initialization sequence before tail-calling into the prolog generator:

EmitFunction(IRGenState *S, FunctionDecl *Decl, Function *F,
             ParamList *Params, TypeInfoArray *TI, SourceLoc Loc, bool ByvalDemotion):

  1.  Resolve function type through typedef chain (kind==12 -> follow offset+160)
  2.  Call SetupFunctionMetadata(S, Decl)
  3.  Optionally set section name on F via Value::setSection
  4.  Create "entry" basic block:
        entryBB = BasicBlock::Create(S, "entry", F, nullptr)
  5.  Create the "allocapt" sentinel instruction:
        voidTy   = Type::getVoidTy(ctx)
        undef    = UndefValue::get(voidTy)
        allocapt = new BitCastInst(undef, voidTy)  // void-to-void no-op
        entryBB->getInstList().push_back(allocapt)
        allocapt->setName("allocapt")
        S->AllocaInsertPt = allocapt          // stored at IRGenState+456
  6.  Create "return" basic block:
        retBB = BasicBlock::Create(S, "return", nullptr, nullptr)
        S->ReturnBlock = retBB                // stored at IRGenState+200
  7.  Set up return value slot:
        if returnType is void:
            S->RetVal = nullptr
        elif ABI kind == 2 (sret) AND isAggregate(returnType):
            S->RetVal = F->arg_begin()        // reuse the sret pointer
        else:
            S->RetVal = CreateTmpAlloca(S, returnType, "retval")
  8.  Store alignment of return type at S+216
  9.  Initialize insertion state: S->CurrentBB = entryBB
 10.  Tail-call GenerateFunctionProlog(S, Decl, F, Params, TI, Loc, ByvalDemotion)

The allocapt sentinel is the critical mechanism. It is a dead bitcast void undef to void instruction that serves as an insertion anchor. When CreateTmpAlloca (at sub_921D70) is called with no explicit array size -- the common case -- it inserts the new AllocaInst before the allocapt marker rather than at the current builder insertion point. This ensures that all alloca instructions cluster at the top of the entry block regardless of where in the function body they were requested, which is a hard requirement for LLVM's mem2reg pass to promote them to SSA registers.

The sentinel is eventually dead-code-eliminated in a later pass since it produces no usable value.

GenerateFunctionProlog -- Parameter Lowering

The prolog iterates four parallel data structures in lockstep:

Cursor	Source	Stride	Termination
EDG parameter node	Linked list from `Decl`	`next` at offset +112	`nullptr`
LLVM argument slot	`F->arg_begin()`	40 bytes	`F->arg_end()`
Type info entry	From the ABI classifier	40 bytes	(parallel with args)
Parameter index	1-based counter	+1	(parallel with params)

A post-loop assertion validates that both cursors reached their end simultaneously: "Argument mismatch in generation function prolog!".

Struct Return: The `agg.result` Convention

Before entering the parameter loop, a helper (sub_938130) checks whether the first argument's ABI kind equals 2 (sret). When true, the prolog names the first LLVM argument "agg.result" and advances the argument cursor by one slot (+40 bytes), so that subsequent parameter processing starts at the second argument. This mirrors the standard LLVM sret convention where the caller pre-allocates space for a returned struct and passes a pointer as a hidden first parameter.

ABI Variant Dispatch

For each parameter, the ABI variant field at TypeInfo+12 selects one of four lowering paths:

Variant 0/1 -- Indirect/Aggregate Pass. The parameter arrives as a pointer to caller-allocated memory. If the type is an aggregate (struct/union/class/array -- type kinds 8--11 checked by IsAggregateType at sub_91B770), the prolog creates a local alloca named <param>.addr, stores the incoming argument into it, and registers the alloca in the declaration map via EmitParamDecl. If the type is a scalar, it goes directly to EmitParamDecl without an intermediate alloca.

Variant 2 -- Direct Pass (most common). The parameter is passed by value in a register or register pair. Two sub-paths exist:

Byval demotion path. When the ByvalDemotion flag (parameter a7) is set and the parameter carries a byval attribute (TypeInfo+16 nonzero), the prolog consults a global name-set (dword_4D04688) to decide whether to create a __val_param temporary. If selected, it allocates a "tmp" alloca via CreateTmpAlloca, stores the argument into it, names the alloca "__val_param" + param_name, and falls through to EmitParamDecl. The __val_param prefix is NVIDIA-specific and marks parameters that have been demoted from byval to local copy for downstream optimization passes.
Normal path. For non-byval scalars, calls EmitParamDecl directly. A guard validates that non-aggregate arguments are not marked indirect: "Non-aggregate arguments passed indirectly are not supported!".

Variant 3 -- Coercion. The parameter's LLVM type does not match the source type and requires a coercion cast. For aggregates, a "tmp" alloca is created. For scalars, the declaration is looked up and wrapped with a bitcast. The result is forwarded to EmitParamDecl.

EmitParamDecl -- Registration

EmitParamDecl (sub_9446C0) performs the final steps for each parameter:

For scalar (non-aggregate, non-indirect) parameters: creates an alloca named <param>.addr, stores the incoming argument into it, and names the argument with the original parameter name.
Inserts the mapping (EDG decl pointer -> LLVM Value*) into a hash map with open-addressing/quadratic-probing collision resolution. A duplicate check guards against re-declaration: "unexpected: declaration for variable already exists!".
If debug info is enabled (dword_4D046B4), emits debug metadata for the parameter via sub_9433F0.

Naming Convention Table

IR Value	Name Assigned
sret argument	`"agg.result"`
Unnamed parameter	`"temp_param"`
C++ this parameter	`"this"` (detected by bit 0 at EDG node offset +172)
Parameter alloca	`<param_name> + ".addr"`
Byval temp alloca	`"__val_param" + <param_name>`
Return value alloca	`"retval"`
Entry basic block	`"entry"`
Return basic block	`"return"`
Alloca sentinel	`"allocapt"`

CreateTmpAlloca Internals

CreateTmpAlloca (sub_921D70) computes alignment from the type size using _BitScanReverse64 (effectively log2(size)), looks up or creates the pointer-to-type in the module's type system, then delegates to CreateAllocaInst (sub_921B80). The key detail: when no explicit array size is provided, the alloca is inserted at the allocapt marker position (IRGenState+456+24), not at the current builder insertion point.

Call Codegen

Call emission (sub_93CB50) is a 1,293-line function that handles direct calls, indirect calls, builtins, special intrinsics, and printf interception. It receives the caller's codegen context, the EDG call expression node, and an optional pre-allocated destination for aggregate returns.

Phase 1: Type Resolution

The callee operand is extracted from the call node's first operand slot (offset +72). The function resolves the callee's declaration via sub_72B0F0, then peels through the type chain -- stripping typedef aliases (kind 12) by following offset +160 -- until it reaches a pointer-to-function type (kind 6) wrapping a function type (kind 7). Fatal assertions guard both steps: "Expected pointer to function!" and "unexpected: Callee does not have routine type!".

Phase 2: Builtin Dispatch

For direct calls (opcode 20), the resolved callee declaration is checked for the builtin flag: byte[199] & 2. When set, the entire normal call path is bypassed. Control transfers to sub_955A70 (or sub_12B3FD0 on Path B), the builtin lowering mega-switch described in a later section. If the builtin returns an aggregate, the call codegen allocates an "agg.tmp" stack slot and emits a store of the result into it.

Phase 3: Intrinsic Special Cases

If the callee is not a builtin but carries an intrinsic ID (word[176] != 0), a handful of intrinsic IDs receive special treatment:

Intrinsic ID	Description
10214	Surface/texture primitive
10219, 10227	Warp-level primitives (detected via `(id - 10219) & 0xFFF7 == 0`)
15752	Special return convention intrinsic

These dispatch to sub_939370, a dedicated handler that bypasses the normal ABI classification entirely.

Phase 4: Argument Processing

Arguments are codegen'd by walking the argument linked list and calling sub_921F50 on each expression. Results are collected into a dynamically-growing array (24 bytes per entry, managed by sub_C8D5F0).

When bit 1 of the call node's flags byte (offset +60) is set -- indicating variadic or reversed-evaluation convention -- arguments are first collected into a temporary linked list and then written into the array in reverse order. This preserves the C right-to-left evaluation order for variadic calls.

Phase 5: ABI Classification

The ABI classifier (sub_9378E0) receives the return type, parameter types, and byval flags, and produces a calling-convention descriptor. Each parameter gets an ABI kind:

ABI Kind	Meaning	Codegen Action
0	Direct (register)	Push value directly if scalar; alloca + store if byval aggregate
1	Indirect (pointer)	Push pointer directly (only valid for aggregates)
2	Indirect + byval	Push value directly (callee copies)
3	Coercion/expand	Multi-register split, handled by `sub_923000`

For the return value, ABI kind 2 means sret: a hidden first parameter is prepended to the argument list, pointing to a caller-allocated "tmp" alloca.

Phase 6: Callee Bitcast Folding

If the callee operand is a bitcast (byte[0] == 5), the optimizer walks back to the original function pointer and compares return types and parameter counts. If the signature matches exactly (pointer equality on type nodes, parameter-by-parameter comparison), the bitcast is folded out. This removes unnecessary bitcast wrappers that arise from C-style casts between compatible function pointer types.

Phase 7: Pre-Call Hooks and printf Interception

Debug location metadata is emitted via sub_92FD10. Then a special case: if the call is direct (opcode 20) and the callee name is literally "printf", control transfers to sub_939F40 which performs GPU printf lowering -- converting the printf call into a vprintf-style call that writes formatted output through the GPU's printf buffer mechanism.

Phase 8: preserve_n Operand Bundles

If the call node's preserve_data field (offset +64) is non-null, up to three operand bundles are attached to the call instruction:

preserve_data[0] >= 0  =>  "preserve_n_data"    = ConstantInt(value)
preserve_data[1] >= 0  =>  "preserve_n_control"  = ConstantInt(value)
preserve_data[2] >= 0  =>  "preserve_n_after"    = ConstantInt(value)

These NVPTX-specific operand bundles are register-pressure hints consumed by the instruction scheduler and register allocator. The value -1 means "not specified" and suppresses the bundle.

Phase 9: Call Emission and Attribute Attachment

The LLVM CallInst is created by sub_921880, which takes the callee, the argument array, return type, and the optional operand bundle. Calling-convention attributes (sret, byval, alignment) are collected by sub_93AE30 and attached to the call. For indirect calls, the instruction is named "call" for readability; direct calls inherit the callee's name.

Phase 10: Return Value Handling

Return ABI Kind	Handling
0 or 1 (direct scalar)	Return the `CallInst` result directly
0 or 1 (direct aggregate)	Allocate `"agg.tmp"`, store the result, return the alloca
2 (sret)	Return the sret pointer (aggregate) or load from it (scalar)
3 (expanded/multi-register)	Call `sub_923000` to split across multiple extracts

For indirect calls, callalign metadata is constructed by querying the alignment requirement of the return type and each argument type, wrapping them in an MDTuple, and attaching it to the call instruction. This metadata is consumed by the NVPTX backend to generate correct alignment annotations in PTX.

Call Emission Pseudocode

EmitCallExpr(Result *Out, CodegenCtx *Ctx, CallNode *Call, u64 DestFlags, u32 Align):

  callee_decl = ResolveCallee(Call->operand[0])
  func_type   = PeelTypedefs(callee_decl->type)  // kind 6 -> kind 7

  // ---- Builtin fast path ----
  if Call->opcode == CALL_DIRECT  AND  callee_decl->flags[199] & 2:
      result = BuiltinLowering(Ctx, Call)
      if isAggregate(func_type->returnType):
          dest = DestFlags.ptr  OR  CreateTmpAlloca("agg.tmp")
          Store(result, dest, ComputeAlign(returnType))
          Out = {dest, INDIRECT, sizeof(returnType)}
      else:
          Out = result
      return

  // ---- Special intrinsics ----
  if callee_decl->intrinsicID in {10214, 10219, 10227, 15752}:
      return SpecialIntrinsicHandler(Out, Ctx, callee_decl->intrinsicID, Call)

  // ---- Normal call path ----
  callee_val = CodegenCallee(Ctx, Call->operand[0])
  args[]     = CodegenArguments(Ctx, Call->argList)
  if Call->flags & REVERSED_EVAL:
      Reverse(args)

  abi_desc   = ClassifyABI(func_type->returnType, paramTypes, byvalFlags)

  if abi_desc.returnIsSRet:
      sret_ptr = DestFlags.ptr  OR  CreateTmpAlloca("tmp")
      PrependArg(args, sret_ptr)

  for each (arg, abi_entry) in zip(args, abi_desc.params):
      if abi_entry.kind == DIRECT  AND  abi_entry.isByval:
          tmp = CreateAllocaForAggregate(arg)
          Store(arg, tmp)
          arg = tmp
      elif abi_entry.kind == INDIRECT:
          assert isAggregate(arg.type)

  callee_val = FoldCalleebitcast(callee_val, func_type)

  EmitDebugLoc(Ctx, Call->srcLoc)

  if Call->opcode == CALL_DIRECT  AND  callee_name == "printf":
      return PrintfExpansion(Ctx, abi_desc, args, Call->srcLoc)

  bundle = BuildPreserveNBundle(Call->preserveData)
  call_inst = EmitCall(func_type, callee_val, args, bundle)
  AttachCCAttrs(call_inst, abi_desc)

  Out = HandleReturnValue(call_inst, abi_desc, func_type->returnType)

Inline Assembly Codegen

The inline asm handler (sub_1292420, 53 KB) translates a CUDA __asm__() statement into an LLVM InlineAsm call instruction through a strict 7-phase pipeline. A nearly-identical duplicate exists at sub_932270 for the Path A codegen context -- same parsing logic, same constraint table, different diagnostic function pointers.

Phase 1: Template String Parsing

The raw PTX template string from the EDG AST is scanned character-by-character into a fragment array. Each fragment (48 bytes) is either a literal text chunk (kind=0) or an operand substitution reference (kind=1 with an operand index at offset +0x28).

The parser handles the CUDA-to-LLVM syntax translation:

CUDA Syntax	LLVM IR Output	Parser Action
`$` (literal dollar)	`$$`	Escape doubling
`%%`	`%`	Literal percent
`%N` (operand ref)	Fragment kind=1, index=N	Multi-digit decimal parse
`%=` (unique ID)	`${:uid}`	LLVM unique-identifier modifier
`%[name]`	--	Fatal: `"symbolic operand reference not supported!"`
`%cN` (modifier+operand)	Fragment kind=1, modifier=c, index=N	Alpha char + decimal parse

For operands referencing string literal constants (the C constraint), the parser resolves the constant through the EDG value chain, validates the type is array of char, extracts each byte, escapes any $ characters, strips the trailing NUL, and emits the entire string as a literal fragment.

Phase 2: Template Reconstruction

The fragment array is serialized into the final LLVM inline-asm template string:

Literal fragments: appended verbatim.
Operand references without modifier: converted to $N (e.g., operand 3 becomes $3).
Operand references with modifier: converted to ${N:c} (e.g., operand 0 with modifier h becomes ${0:h}).

This is where the CUDA %N convention is translated to LLVM's $N convention. Literal % characters in PTX (like %tid.x) pass through unchanged because they were never parsed as operand references.

Phase 3: Constraint String Construction

The parser iterates the EDG operand linked list, building a comma-separated LLVM constraint string. Each EDG operand carries a constraint type-chain -- a linked list of tag bytes that map through a 256-byte global lookup table (aXg0123456789rh[]) to produce LLVM constraint letters.

Output operands (flags & 2 != 0):

Pointer types: constraint prefix "=*" + letters (indirect output).
Non-pointer types: constraint prefix "=" + letters (direct output).
Read-write operands (byte at +24 == 3): a tied input operand is generated with the output's index as the constraint, linking them as a two-address pair.

Input operands:

Same tag-to-letter mapping.
Tags 10--19 are prohibited: "tied input/output operands not supported!" (GCC-style matching-digit constraints are not implemented).
Tag 23 (the C constraint on inputs) creates an undef value -- the constant's value was already inlined into the template string during Phase 1.

Special tag handling:

Tag	Effect
8, 9	Sets `is_address` + `is_memory` flags; tag 9 also emits `"imr"` composite constraint
0x14, 0x15, 0x16, 0x18, 0x26, 0x2A	Pointer-through types: follow type chain, set `is_address`
0x19, 0x1B, 0x1C	Memory constraints
23	Remapped to tag 20 before table lookup

Phase 4: Clobber List

The EDG clobber linked list (at asmInfo+144) is iterated. Each clobber node has a tag byte selecting the clobber type:

Tag 1: Memory clobber. Appends ",~{memory}" to the constraint string.
Tag 58: Named register clobber. Uses the name string from the node. Appends ",~{<name>}".
Other tags: Looks up the register name from a global table (off_4B6DCE0[tag]). Appends ",~{<name>}".

Phase 5: InlineAsm Object Creation

The LLVM function type for the asm is constructed based on the output count:

Zero outputs: void return type.
One output: scalar return type matching the output operand.
Multiple outputs: anonymous struct return type.

The volatile/sideeffect flag is read from asmInfo+128 (bit 2). A diagnostic (0xE9F) warns when outputs exist but the asm is not marked volatile, as this risks miscompilation.

The InlineAsm object is created via InlineAsm::get(funcType, asmString, constraintString, hasSideEffects, isAlignStack=0, dialect=0) and a CallInst is emitted to invoke it.

Phase 6: Result Extraction

For single-output asm, the CallInst result is used directly. For multiple outputs, each result is extracted with extractvalue instructions:

Results with type size <= 16 bytes: a compact extractvalue path.
Results with type size > 16 bytes: a full instruction node (88 bytes) is allocated, the extractvalue is constructed with explicit index arrays, linked into the basic block's instruction list, and named "asmresult".

Each extracted value is then stored into its output destination via sub_12843D0, which reads the output codegen-info records built during Phase 3.

Phase 7: Cleanup

All temporary vectors and strings are freed: the fragment array (with per-element string cleanup), constraint strings, operand/type/destination vectors, and tied-operand tracking arrays.

End-to-End Example

CUDA source:    __asm__("mov.u32 %0, %tid.x" : "=r"(result));

Phase 1 parse:  [literal("mov.u32 "), operand(idx=0), literal(", %tid.x")]
Phase 2 recon:  "mov.u32 $0, %tid.x"
Phase 3 constr: "=r"
Phase 4 clobber: ""
Phase 5 create: InlineAsm::get("mov.u32 $0, %tid.x", "=r", sideeffects=true)
                call i32 asm sideeffect "mov.u32 $0, %tid.x", "=r"()
Phase 6 extract: (single output -- use call result directly)
                 store i32 %asm_result, i32* %result.addr

Builtin Lowering

The builtin lowering mega-switch (sub_12B3FD0, 103 KB) is one of the largest single functions in the binary. It handles ~250 builtin IDs across ~130 case labels, dispatching CUDA intrinsic functions like __syncthreads(), __shfl_sync(), and __hmma_m16n16k16_mma_f16f16 into LLVM IR.

Entry Logic

The function extracts the callee from the call expression, validates the builtin bit (flags byte[199] & 2), then looks up the builtin ID by name via sub_12731E0. If the ID is 0 (name not in the builtin table), execution falls through to the LLVM intrinsic fallback path at line 3154.

Five Lowering Strategies

Strategy	Usage (%)	Mechanism
Sub-handler delegation	66% (~165 IDs)	Calls a specialized function for a family of builtins
Intrinsic call emission	12% (~30 IDs)	1:1 mapping to a single `llvm.nvvm.*` intrinsic via `sub_1285290`
Inline IR generation	10% (~25 IDs)	Builds IR nodes directly (alloca, load, store, cast, insertvalue)
Table-driven selection	10% (~25 IDs)	Selects intrinsic ID from a table keyed by operand type/size
SM-gated conditional	2% (~5 IDs)	Different lowering depending on target SM version

Per-Category Dispatch

Atomics and synchronization (IDs 0xB5--0xCC, 181--204). Atomic operations delegate to sub_12A7DA0; fences and barriers to sub_12AB550. Cases 0xBA--0xBC map directly to LLVM intrinsic 6 (likely llvm.nvvm.atomic.*) with type-overloaded arguments. Case 0xCB is SM-gated: on SM <= 63 it emits an inline constant; on SM >= 70 it emits intrinsic 3769.

Warp shuffle (IDs 0x15F--0x166, 351--358). All eight variants delegate to sub_12ABB90 parameterized by shuffle mode (0=idx, 1=up, 2=down, 3=butterfly) and sync flag (0=legacy, 1=__shfl_sync_*). The clamp flag distinguishes butterfly from other modes.

Warp vote/ballot (IDs 0x12E--0x135, 0x152--0x159, 0x18B--0x192). Three groups of 8 IDs each, all delegating to sub_12B3540 with the builtin ID as a discriminator. This covers __ballot_sync, __all_sync, __any_sync across integer/float/predicate operand types.

Surface and texture operations (IDs 0xCF--0x113, 0x287--0x2A5, 207--275 + 647--677). The largest category at ~95 IDs (38%). Organized into pairs using two sub-handlers: sub_12ADE80(ctx, intrinsic_base, surface_type, variant, args) for individual load/store operations, and sub_12AA9B0(ctx, surface_type, expr) for combined operations. Surface types are encoded as integers (0=generic, 1=1D, 5=2D, 7=3D, 8=cubemap, 10=1D array, 11=2D array, 14=buffer). Intrinsic bases 3701/3702 are primary read/write; 3698/3699 are 2D-array variants.

The texture handler (case 0x287) is the most complex single case at ~230 lines. It walks the AST to extract the texture name string and return element type, constructs an intrinsic name as "<texname>_<typename>" using a type-name resolution switch (mapping integer subtypes 0--10 to strings like "uchar", "int", "ulonglong"), and emits the call. A global flag (dword_4F06B98) controls whether plain char maps to uchar or schar.

Tensor core / WMMA (IDs 0x16E--0x1D9, 0x2A6--0x2E8, 366--473 + 678--744). The second-largest category at ~85 IDs (34%). Three sub-handlers partition the work: sub_12AC1A0 handles wmma::mma_sync with bias/scale flags (has_bias, has_scale) encoding four accumulator modes; sub_12AC5F0 handles store_matrix_sync; sub_12ACA80 handles load_matrix_sync. IDs group into triplets by matrix shape: m16n16k16, m32n8k16, m8n32k16, m16n16k8 (TF32), bf16, and fp8 (SM 89+) families.

WGMMA (IDs 0x2E9--0x302, 745--770). SM 90+ warpgroup MMA operations. Cases 0x2E9--0x2EE handle fence/commit/wait. Cases 0x2F1--0x2FC implement __wgmma_mma_async through a massive ~800-line handler that selects from a 144-entry intrinsic table spanning IDs 5304--5447. The table is indexed by a 5-dimensional grid: N-size (16/32/64/128), B-operand source (shared vs register), element type (s64 vs other), scale/negate flags, and case variant. Mode bits are packed into a single integer: bit0=accumulate | bit1=transpose | bit2=negate-C | bit4=negate-A.

Memory copy (IDs 0x199, 0x291--0x299, 409 + 657--665). Memcpy variants encode alignment directly in the builtin ID: ID 658 = align 2, ID 659 = align 4, ID 660 = align 8, ID 661 = align 16. The actual emission delegates to sub_12897A0. Memset operations (IDs 410, 663, 665) delegate to sub_12A6DF0.

TMA bulk operations (IDs 0x19B--0x1A0, 411--416). Cases 0x19B and 0x19C are the largest individual handlers (~300 and ~450 lines respectively) for SM 90+ tensor memory access bulk copy/scatter operations. They build operand vectors iteratively and select from intrinsic tables indexed by element count (IDs 4218--4223 for stores, 4244--4250 for loads).

LLVM Intrinsic Fallback Path

When the builtin ID is 0, the default path (lines 3154--3407) looks up the LLVM intrinsic by name via sub_15E2770. If the intrinsic is type-overloaded, argument types are used to resolve the declaration. Each argument is lowered via sub_128F980, with type-mismatch bitcasts (opcode 47) and vector zexts (opcode 33) inserted as needed. Struct-return intrinsics are handled by iterating the return struct's fields with extractvalue.

Function Attributes

CUDA function attributes are lowered through a three-stage pipeline: EDG frontend parsing, attribute emission during IR generation, and a final metadata-attachment pass.

Stage 1: Frontend Parsing (sub_64F1A0)

The EDG parser scans the token stream for preserve_n_data, preserve_n_control, and preserve_n_after identifiers, parses each as an integer, and stores them in a 12-byte struct at offset +336 of the function declaration node:

struct preserve_reg_info {
    int32_t preserve_n_data;     // +0, -1 = not specified
    int32_t preserve_n_control;  // +4, -1 = not specified
    int32_t preserve_n_after;    // +8, -1 = not specified
};

Stage 2: Attribute Emission (sub_12735D0)

During IR generation, the attribute emitter checks declaration flags and writes attribute bundles:

Bit 0x20 at decl+198 (kernel function): emits ("kernel", 1). Then iterates the parameter array (40-byte entries); for each parameter with byte[+33] != 0, emits ("grid_constant", param_index) where param_index is 1-based. This marks individual kernel parameters as grid-constant, enabling the backend to place them in constant memory.
Bit 0x04 at decl+199 (custom ABI): emits ("full_custom_abi", 0xFFFFFFFF).
Preserve-reg struct at decl+336: for each of the three fields, if the value is >= 0, emits the corresponding attribute and then writes -1 back (consumed pattern) to prevent double-emission.

Stage 3: Metadata Attachment (sub_1273F90)

The reader pass iterates all functions' attribute bundles and re-encodes them as LLVM named metadata:

grid_constant. Per-parameter type values are collected into a vector, then bundled under the MDString key "grid_constant" as an MDTuple. The downstream consumer sub_CE8660 queries this metadata to determine aliasing/readonly semantics for kernel parameters.

preserve_reg_abi. The three preserve_n values are collected with their MDString keys ("preserve_n_data", "preserve_n_control") into a vector, then bundled under the composite key "preserve_reg_abi" as an MDTuple. The register allocator and prologue-epilogue inserter query this via sub_314D260.

full_custom_abi. Emitted as a simple (MDString, MDNode(i32 0xFFFFFFFF)) pair. When a function has this attribute but NOT the full_custom_abi flag, the alternative "numParams" key records the explicit parameter count as a nested MDTuple.

Final Metadata Layout

For a __global__ kernel with grid_constant parameters and register preservation:

!kernel_attrs = !{
  !MDString("kernel"), !MDNode(i32 1),
  !MDString("grid_constant"), !MDTuple(
    !MDNode(i32 <param1_type>), !MDNode(i32 <param2_type>), ...
  ),
  !MDString("preserve_reg_abi"), !MDTuple(
    !MDString("preserve_n_data"),    !MDNode(i32 N),
    !MDString("preserve_n_control"), !MDNode(i32 M),
    !MDString("preserve_n_after"),   !MDNode(i32 K)
  )
}

Attribute Semantics

Attribute	Meaning	Backend Effect
`grid_constant`	Kernel parameter is immutable across the grid	Place in constant memory; optimize loads
`preserve_n_data`	N data registers must be preserved across calls	Register allocator reserves R0--RN
`preserve_n_control`	N predicate registers to preserve	Prologue/epilogue saves predicates
`preserve_n_after`	N registers preserved after a call (callee-save count)	Adjusts spill/restore boundaries
`full_custom_abi`	Function bypasses standard CUDA calling convention	Parameter passing determined by explicit annotations
`numParams`	Explicit parameter count for non-full_custom_abi functions	Custom ABI parameter setup

Cross-Reference

Address	Function	Role
`sub_946060`	EmitFunction	Creates entry BB, allocapt, return BB, dispatches to prolog
`sub_938240`	GenerateFunctionProlog	Iterates parameters, ABI dispatch, alloca emission
`sub_9446C0`	EmitParamDecl	Creates alloca+store, registers decl->Value mapping
`sub_921D70`	CreateTmpAlloca	Alloca creation with alignment, inserted at allocapt
`sub_921B80`	CreateAllocaInst	Low-level alloca IR emission
`sub_938130`	IsSRetReturn	Checks ABI kind == 2
`sub_91B770`	IsAggregateType	Type kinds 8--11 (struct/union/class/array)
`sub_93CB50`	EmitCallExpr	Full call instruction emission (1,293 lines)
`sub_9378E0`	ClassifyABI	Return + parameter ABI classification
`sub_939F40`	PrintfExpansion	GPU vprintf lowering for printf calls
`sub_93AE30`	CollectCCAttrs	Builds sret/byval/align attribute list
`sub_955A70` / `sub_12B3FD0`	BuiltinLowering	Mega-switch over ~250 builtin IDs
`sub_1292420` / `sub_932270`	EmitInlineAsm	7-phase asm template-to-IR pipeline
`sub_12735D0`	EmitFunctionAttrs	Writes attribute bundles during IR gen
`sub_1273F90`	ReadFunctionAttrs	Attaches LLVM named metadata from bundles
`sub_64F1A0`	ParsePreserveAttrs	EDG parser for preserve_n_* tokens

Type Translation, Globals & Special Variables

The type translation subsystem is one of the most algorithmically complex parts of NVVM IR generation. It converts the Edison Design Group (EDG) intermediate language type graph --- which can contain arbitrary mutual recursion, template-dependent types, and CUDA address-space qualifiers --- into a well-formed LLVM type system. The same IR generation phase also handles global variable materialization (with CUDA memory-space assignment), kernel metadata emission, and the translation of CUDA built-in variables (threadIdx, blockIdx, etc.) into LLVM intrinsic calls.


Type translation entry	`sub_91AED0` (640 bytes)
Fixed-point driver	`sub_91AB30` (896 bytes)
Topological sort	`sub_919CD0` (896 bytes, 10-level BFS)
Type-kind dispatch	`sub_918E50` (2,400 bytes, 11+ categories)
Type-pair comparator	`sub_911D10` (1,024 bytes)
Global var creation	`sub_915C40` (2,018 bytes)
Address space logic	`sub_916430` (482 bytes)
Annotation emitter	`sub_914410` (3,524 bytes)
Kernel metadata	`sub_93AE30` (~5,600 bytes)
Special var classifier	`sub_920430` (old) / `sub_127F7A0` (new)
Special var codegen	`sub_922290` (old) / `sub_1285550` (new)

EDG-to-LLVM Type Translation

The Problem

EDG represents C++ types as a graph of IL nodes linked through child/parent pointers, member chains, and scope references. This graph can be arbitrarily cyclic: consider struct A { B* b; }; struct B { A* a; }; where translating A requires translating the pointee type B, which requires translating the pointee type A. Template instantiations add another dimension --- a template class body may reference types that cannot be resolved until the template arguments themselves are translated. The type translator must produce valid LLVM types from this graph without infinite recursion or stale mappings.

NVIDIA solves this with a fixed-point iteration scheme: translate every type, detect whether any translation changed a previously-emitted LLVM type, and if so, repeat the entire pass. The iteration terminates when a full pass produces no changes.

Context Object Layout

The type translation pass operates on a context structure initialized by sub_91AB30 and threaded through every function in the subsystem:

Offset	Size	Field
`+0x000`	8	`debug_logger` --- nullable, enables trace output when non-null
`+0x008`	8	`pass_list_ptr` --- vector of `(vtable_ptr, pass_instance)` pairs
`+0x010`	8	`target_info`
`+0x018`	8	`address_space_map` --- qualifier-to-LLVM-AS translation table
`+0x020`	8	`llvm_context` --- the `LLVMContext*`
`+0x028`	8	`module_ptr`
`+0x038`	8	`edg_node_map` --- hash table: EDG nodes to LLVM values
`+0x038`	16	`visited_set` --- open-addressed hash set for dedup (at `+0x38`..`+0x48`)
`+0x050`	4	`iteration_counter`
`+0x060`	12	`visited_set` control (count, capacity, bucket_count)
`+0x078`	8	`processed_list` --- vector of completed types
`+0x090`	16	`type_cache` --- hash table: EDG type pointer to LLVM `Type*`
`+0x0A0`	8	`remap_list` --- vector of type-remapping entries
`+0x150`	8	`alignment_table` --- target-specific alignment data
`+0x168`	4	`threshold` --- type index below which scope lookups are attempted
`+0x2A0`	16	`pending_replacements` --- vector of `(old_type, new_type)` pairs
`+0x310`	1	`flags` --- bit-packed control flags

Fixed-Point Iteration Algorithm

The entry point sub_91AED0 recovers pass infrastructure objects by iterating a vector<pair<void*, void*>> at context+8. Each element is 16 bytes: a vtable pointer identifying the pass, and a pass instance pointer. The function compares vtable pointers against 8 known globals to extract the data layout, reflect pass, target transform info, module context, dominator tree, and alias analysis results. It then calls sub_91AB30, the actual iteration driver.

// sub_91AB30: TypeTranslationPass driver
fn translate_all_types(ctx: &mut TypeTransCtx, module: &EDGModule) {
    // Optional pre-processing (gated by byte_3C34E60)
    if PRE_PROCESS_FLAG {
        pre_process_types(ctx, module);  // sub_90F800
    }

    // Gather initial flags from all module members
    for member in module.members() {       // linked list from module+80
        gather_initial_flags(member);      // sub_AA3700
    }

    // MAIN FIXED-POINT LOOP
    loop {
        let changed = single_iteration(ctx, module);  // sub_91AA50
        if !changed { break; }
    }

    // Optional late fixup pass (gated by byte_3C35480)
    if OPTIMIZATION_FLAG {
        finalize_late_types(ctx, module);  // sub_90F750
        loop {
            let changed = late_fixup(ctx, module);  // sub_917E30
            if !changed { break; }
        }
    }

    // Optional cleanup (gated by dword_3C351E0)
    if CLEANUP_FLAG {
        cleanup_stale_types(ctx);  // sub_90EB40
    }

    flush_and_finalize(ctx);  // sub_909590
}

Each single iteration (sub_91AA50) performs three steps:

Topological sort (sub_919CD0): Build a dependency ordering of all EDG type nodes reachable from the module root.
Invalidate (sub_913880 for each type in reverse order): Remove stale cache entries for types whose dependencies have changed.
Process (sub_9197C0 for each type in reverse order): Translate each type, returning whether any LLVM type was modified.

The iteration returns the logical OR of all sub_9197C0 results. If any type replacement occurred, the outer loop repeats.

10-Level Topological Sort

The function sub_919CD0 produces a dependency-ordered list of EDG types. Rather than a standard DFS-based topological sort, it uses a 10-level iterative BFS implemented with sorted sets at each level. This unusual depth accommodates deeply nested C++ class hierarchies with multiple inheritance, where types at depth N must be resolved before types at depth N+1 can be translated.

Each level maintains a sorted set (vector-backed, managed by sub_6CDA50 for initialization and sub_6CDC80 for merge/sort). Starting from the module's member list, the algorithm:

Inserts root-level type declarations into level 0.
For each level 0..9, discovers type dependencies and inserts them into the next level.
After all 10 levels, concatenates the sets in reverse (leaf types first, composite types last).

The output is a vector of EDG type node pointers ordered so that leaf types precede the composite types that reference them.

EDG Type Kind Dispatch

The core dispatcher sub_918E50 (2,400 bytes) reads the type-kind byte at edg_node+16 and routes to specialized handlers:

Kind Byte	Value	Handler	Description
`0x00`--`0x10`	0--16	Primitive dispatch	`void`, `bool`, `char`, `int`, `float`, `double`, etc.
`0x11`	17	Void special	Void type with swap handling in comparator
`0x05`	5	`sub_5FFE90`	Qualified type (`const`/`volatile`/`restrict`) --- carries address-space info
`0x0D`	13	Enum path	Enum type bridging C/C++ enum constants to LLVM integers
`0x0E`	14	Function path	Function type with parameter chain traversal
`0x1A`	26	`sub_915850`	Array type (subscript form with enumeration base)
`0x1B`	27	Inline handler	Compound type (struct/union/class) --- multi-child with dedup hash
`0x32`--`0x33`	50--51	Union variants	Union type (two internal representations)
`0x36`	54	`sub_918C40`	Typedef / using declaration --- chains through EDG resolution
`0x37`	55	Using variant	Using declaration variant
`0x4B`--`0x4C`	75--76	Pointer/ref	Pointer and reference types --- carry qualifier words for address spaces
`0x4D`	77	Member pointer	Pointer-to-member type
`0x4E`	78	`sub_914070`	Dependent/nested type --- requires scope resolution

For types with kind > 23 that are not special-cased, a default handler applies a bitmask test: 0x100000100003FF >> (kind - 25). If the low bit is set, the type requires scope tracking (kinds 25--34 selectively, plus kinds 57 and 73). The handler then looks up any existing LLVM type for this EDG type via the scope table, and if the mapping has changed, triggers a replacement plus metadata propagation.

Compound Type (Struct/Class) Translation

When kind 0x1B (27) is encountered, the dispatcher uses an inline handler that:

Reads the child count from node+20 & 0xFFFFFFF and divides by 2 (children come in pairs: type descriptor + offset/alignment info).
Builds a reference-counting hash table to detect shared sub-types. If a child type appears exactly once, it can be translated independently. If it appears multiple times, it indicates a shared base class or diamond inheritance pattern.
For unique children, calls sub_911D10 (the type-pair comparator) with the parent scope to translate.

Diamond inheritance is detected by the reference count exceeding 1, which prevents the comparator from making conflicting replacements for the same sub-type.

Type-Pair Comparison Engine

The function sub_911D10 is the core workhorse for comparing and replacing type pairs. It takes (context, type_a, type_b, scope_pair, is_recursive_flag) and maintains a local worklist of (type_a, type_b) pairs:

fn compare_and_replace(ctx, type_a, type_b, scope, is_recursive) {
    let mut worklist = vec![(type_a, type_b)];

    while let Some((a, b)) = worklist.pop() {
        if a == b { continue; }

        // Normalize: larger type index = v15, smaller = v14
        let (v14, v15) = if type_index(a) < type_index(b) { (a, b) } else { (b, a) };

        // Primitive vs compound: record scope mapping
        if v14.kind <= 0x17 && v15.kind > 0x17 {
            record_scope_mapping(ctx, v14, v15);
        }

        // Check for UINT_MAX sentinel (incomplete type) -> swap
        if scope_table_lookup(v15) == UINT_MAX {
            swap(&mut v14, &mut v15);
        }

        // Perform actual replacement
        replace_type(ctx, v14, v15, is_recursive);

        // For pointer/reference types: propagate through children
        if v15.kind == 75 || v15.kind == 76 {
            let qualifier = v15.qualifier_word & 0x7FFF;
            // Address space qualifiers trigger child propagation
            if qualifier == 1 || qualifier == 32 || qualifier == 33 || qualifier == 14 {
                worklist.push((v14.child, v15.child));
            }
        }

        // For union types: push all variant children
        if v15.kind == 50 || v15.kind == 51 {
            for child in v15.children() { worklist.push((v14, child)); }
        }
    }
}

This worklist-based approach avoids stack overflow on deeply nested types while correctly propagating address-space information through pointer chains.

CUDA Address Space Propagation

CUDA memory-space qualifiers flow through the EDG type system via a 15-bit qualifier word stored at edg_node+18. The low 15 bits encode the qualifier ID; bit 15 is a negation flag. During type translation, when the type-pair comparator encounters pointer or reference types (kinds 75/76), it reads the qualifier word and maps it to an LLVM address space:

EDG Qualifier	Value	LLVM Address Space	CUDA Meaning
Generic	0	0	Generic (default)
Global	1	1	`__device__` / global memory
Function	14	---	Method qualifier (not an address space)
Array context A	26	---	Array subscript qualifier A
Array context B	27	---	Array subscript qualifier B
Shared	32	3	`__shared__` memory
Constant	33	4	`__constant__` memory

The conversion is performed by sub_5FFE90 (qualifier to LLVM address space number) and sub_5A3140 (creates the appropriately qualified LLVM pointer type). The function sub_911CB0 combines the conversion with a type-index computation: it takes (type_kind - 24) as a base and combines it with the qualifier to produce a unique index for the scope table.

Address-space propagation is transitive: if struct S contains a __shared__ int* field, the shared qualifier must be reflected in the LLVM type of the pointer field within S. The type-pair comparator achieves this by pushing child pairs onto its worklist whenever a pointer/reference type carries a non-zero qualifier.

Five Caching Layers

To avoid redundant work, the translator maintains five distinct caches:

Cache	Location	Key	Value	Purpose
Visited set	`ctx+0x38`..`+0x48`	EDG node ptr	(presence only)	Prevents re-processing the same declaration
Type cache	`ctx+0x70`..`+0x94`	EDG decl ptr	child type ptr	Tracks which LLVM type a declaration was previously translated to
Type-value map	Per-call in `sub_913E90`	EDG type ptr	LLVM `Type*`	Caches enum/struct translations; supports inline mode (up to 4 entries)
Scope table	`ctx+0x10`, hash at `+8/+24`	scope ID	type info	Maps scope identifiers to type information for type-pair comparison
Type index table	`ctx+0x98`+	compound key	monotonic index	Linear ordering of processed types; Jenkins-like hash for compound keys

All hash tables use the standard DenseMap infrastructure with NVVM-layer sentinels (-8 / -16). See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth policy.

Cache invalidation is handled by sub_913880, which walks a type's member list and removes stale entries. Invalidation cascades: if a struct type is invalidated, all member types that are non-trivial (not kind 54/55 typedef/using) are also removed from the cache.

Template Specialization

Template types are handled by sub_918790 (struct/class type translation with template instantiation support):

sub_41F0F0 extracts template argument descriptions from the EDG IL into a 1,536-byte stack buffer (heap fallback for > 50 arguments).
sub_908040 performs syntactic template argument substitution, producing two lists: substituted types and original types.
If both lists are non-empty and the optimization flags byte_3C35480 + byte_3C353A0 are both set, sub_910920 performs semantic type matching using the full optimization infrastructure.
Otherwise, sub_906590 creates the LLVM type directly from the substitution result.

The two-pass approach (syntactic substitution then semantic matching) handles cases like template<typename T> struct Wrapper { T* data; } where Wrapper<__shared__ int> must produce a pointer in address space 3 --- the syntactic pass substitutes T = __shared__ int, and the semantic pass verifies the LLVM type is correct.

Template specialization support is entirely optional and gated behind configuration flags, allowing it to be disabled for faster compilation when not needed.

Primitive Type Translation Table

The dispatcher sub_918E50 handles kinds 0x00--0x10 (values 0--16) as primitive/scalar types. These map directly from EDG internal type representation to LLVM IR types. The correspondence between the three type-tag namespaces used across cicc is:

EDG Type Kind	EDG Printer `type_kind`	Cast Codegen Tag (`*(type+8)`)	LLVM IR Type	Width
`0x00`	`0x00` error	---	`<error>`	---
`0x01`	`0x01` void	3	`void`	0
`0x02`	`0x02` scalar/integer	17	`iN`	N bits
`0x03`	`0x03` float	1 (half), 2 (float), 3 (double), 4 (fp80), 5 (fp128), 6 (bf16)	see FP table	varies
`0x04`	`0x04` imaginary	---	emulated	varies
`0x05`	`0x05` complex	---	`{ fN, fN }` struct	2x float
`0x06`	`0x06` pointer/ref	18	`ptr` (opaque) or `ptr addrspace(N)`	32/64
`0x07`	`0x07` function	15 (function), 16 (ptr-to-fn)	function type	---
`0x08`	`0x08` array	20	`[N x elem]`	N * elem
`0x09`--`0x0B`	`0x09`--`0x0B` class/struct/union/enum	21 (struct)	`%struct.Name = type { ... }`	layout
`0x0C`	`0x0C` elaborated/typedef	---	resolved target	---
`0x0D`	`0x0D` pointer-to-member	---	`{ ptr, i64 }` or `i64`	64/128
`0x0E`	`0x0E` template param	---	deduced	---
`0x0F`	`0x0F` vector	16	`<N x elem>`	N * elem
`0x10`	`0x10` scalable vector	16	`<vscale x N x elem>`	runtime

The integer type (EDG kind 0x02) carries its bit-width in the upper bytes of the type word. The cast codegen subsystem (sub_128A450) classifies types by the tag byte at *(type+8): tags 1--6 are floating-point (see next section), tag 11 is integer, tag 15 is pointer, and tag 16 is vector/aggregate. The key dispatch idiom (tag - 1) > 5u tests "is NOT a float"; (tag & 0xFD) != 0xB tests "is NOT integer-like".

Floating-Point Type Encoding

Floating-point types use a sub-kind byte stored in the EDG type node at v3[10].m128i_i8[0] (type printer) or equivalently the cast codegen tag at *(type+8). The complete mapping including all NVIDIA-extended formats:

Cast Tag	EDG FP Sub-kind	Mangling	C++ Type	LLVM Type	Width	SM Minimum
1	0 / 0xA	`DF16_`	`_Float16` / `__half`	`half`	16	SM 53 (scalar), SM 70 (packed)
1	1	`Dh`	`__fp16`	`half`	16	SM 53
2	2	`f`	`float`	`float`	32	all
---	3	`DF32x`	`_Float32x`	`double` (promoted)	64	all
3	4	`d`	`double`	`double`	64	all
---	5	`DF64x`	`_Float64x`	`fp128` (emulated)	128	all
---	6	(single)	`long double`	platform-dependent	arch	---
---	7	`u7float80`	`float80`	`x86_fp80`	80	N/A on GPU
---	8	`g`	`__float128`	`fp128`	128	emulated
6	9	`u6__bf16` or `DF16b`	`__bf16` / `__nv_bfloat16`	`bfloat`	16	SM 80
---	0xB	`DF32_`	`_Float32`	`float`	32	all
---	0xC	`DF64_`	`_Float64`	`double`	64	all
---	0xD	`DF128_`	`_Float128`	`fp128`	128	emulated

The bf16 mangling has a three-way ABI gate controlled by qword_4F077B4 (low 32 = use_new_bf16_mangling, high 32 = bf16_abi_version) and qword_4F06A78 (secondary selector). Old ABI emits u6__bf16 (Itanium vendor-extended); C++23 ABI emits DF16b (P1467 standard). The __nv_bool type (EDG printer case 0x02, bit 4 of +162) is a CUDA-specific boolean that emits "__nv_bool" when sub_5D76E0 (CUDA mode check) returns true, or "_Bool" / "bool" otherwise.

Two additional NVIDIA-specific types have dedicated mangling:

EDG Type Code	Mangling	C++ Type	Purpose
17	`u11__SVCount_t`	`__SVCount_t`	ARM SVE predicate count
18	`u6__mfp8`	`__mfp8`	8-bit minifloat (FP8 E4M3/E5M2 base)

On the LLVM side, the __mfp8 type maps to i8 storage with metadata annotations indicating the floating-point interpretation.

CUDA FP8/FP6/FP4 Extended Type Keywords

CUDA 12.x+ introduces narrow floating-point types for transformer inference and tensor core operations. The EDG parser (sub_691320) recognizes these as token values 236 and 339--354, all resolved through sub_6911B0 (CUDA type-token resolver):

Token	Keyword	Format	Width	Packed Variant	SM Requirement
236	`__nv_fp8_e4m3`	E4M3 (4-bit exponent, 3-bit mantissa)	8	---	SM 89
339	`__nv_fp8_e5m2`	E5M2 (5-bit exponent, 2-bit mantissa)	8	---	SM 89
340	`__nv_fp8x2_e4m3`	E4M3 packed pair	16	2 elements	SM 89
341	`__nv_fp8x2_e5m2`	E5M2 packed pair	16	2 elements	SM 89
342	`__nv_fp8x4_e4m3`	E4M3 packed quad	32	4 elements	SM 89
343	`__nv_fp8x4_e5m2`	E5M2 packed quad	32	4 elements	SM 89
344	`__nv_fp6_e2m3`	E2M3 (2-bit exponent, 3-bit mantissa)	6	---	SM 100
345	`__nv_fp6_e3m2`	E3M2 (3-bit exponent, 2-bit mantissa)	6	---	SM 100
346	`__nv_fp6x2_e2m3`	E2M3 packed pair	12	2 elements	SM 100
347	`__nv_fp6x2_e3m2`	E3M2 packed pair	12	2 elements	SM 100
348	`__nv_mxfp8_e4m3`	MX-format E4M3	8	---	SM 100
349	`__nv_mxfp8_e5m2`	MX-format E5M2	8	---	SM 100
350	`__nv_mxfp6_e2m3`	MX-format E2M3	6	---	SM 100
351	`__nv_mxfp6_e3m2`	MX-format E3M2	6	---	SM 100
352	`__nv_mxfp4_e2m1`	MX-format E2M1 (FP4)	4	---	SM 100
353	`__nv_satfinite`	Saturation-to-finite modifier	---	---	SM 89
354	`__nv_e8m0`	E8M0 exponent-only scale format	8	---	SM 100

The resolver sub_6911B0 follows the field_140 == 12 (qualified/elaborated type) chain to find the base type node, then sets v325 = 20 (typename). At the LLVM level, these narrow types are lowered to integer storage types (i8, i16, i32) with type metadata or intrinsic-based interpretation. The cvt_packfloat intrinsic family handles conversion to and from these formats with explicit format specifiers:

cvt_packfloat Case	PTX Suffix	Format
2	`.e4m3x2`	FP8 E4M3 pair
3	`.e5m2x2`	FP8 E5M2 pair
4	`.bf16x2`	BFloat16 pair
5	`.e2m1x2`	FP4 E2M1 pair (SM 100+)
6	`.e2m3x2`	FP6 E2M3 pair (SM 100+)
7	`.e3m2x2`	FP6 E3M2 pair (SM 100+)
8	`.ue8m0x2`	UE8M0 scale pair (SM 100+)

Address Space Annotations on Types

CUDA memory-space qualifiers propagate through the EDG type system via a 15-bit qualifier word at edg_node+18. The low 15 bits encode a qualifier ID; bit 15 is a negation flag. The qualifier word is the single mechanism through which __device__, __shared__, __constant__, and __managed__ semantics reach the LLVM type system.

EDG qualifier word to LLVM address space mapping (performed by sub_5FFE90):

Qualifier Word (`node+18 & 0x7FFF`)	LLVM Address Space	CUDA Source	Notes
0	0	(default/generic)	Unqualified pointers
1	1	`__device__` / global	Explicit global annotation
9	0 (with flag check via `sub_5F3280`)	(generic variant)	Conditional on context
14	---	`__host__` / method qualifier	Not an address space --- function qualifier
26	---	(array subscript context A)	Internal, not an address space
27	---	(array subscript context B)	Internal, not an address space
32	3	`__shared__`	Per-block shared memory
33	4	`__constant__`	Read-only constant memory

The function sub_5A3140 creates the appropriately address-space-qualified LLVM pointer type given the qualifier output from sub_5FFE90. The helper sub_911CB0 combines address space information with the type kind to produce a unique scope-table index: it computes (type_kind - 24) as a base and combines it with the qualifier to produce a monotonic key.

EDG frontend encoding (from sub_691320 parser, tokens 133--136, and sub_667B60):

Parser Token	CUDA Keyword	`v305` Value	EDG `memory_space_code`	Target AS
133	`__shared__`	4	2	3
134	`__device__`	5	1	1
135	`__constant__`	6	3	4
136	`__managed__`	7	(special)	0 + `"managed"` annotation
273	`__global__` (addr-space attr)	---	0	0
274	`__shared__` (addr-space attr)	---	2	3
275	`__constant__` (addr-space attr)	---	3	4
276	`__generic__` (addr-space attr)	---	(parsed)	(parsed)

Address-space propagation through types is transitive: if struct S contains a __shared__ int* field, the shared qualifier flows through the pointer type and is preserved in the LLVM ptr addrspace(3) type of that field. The type-pair comparator sub_911D10 achieves this by pushing child pairs onto its worklist whenever a pointer/reference type (kinds 75/76) carries a non-zero qualifier. The qualifier-word masks 1, 14, 32, and 33 are the four values that trigger this child propagation.

For a full cross-reference of all 10 address spaces (including AS 5 local, AS 6 tensor memory, AS 7 shared cluster, AS 25 internal device, AS 53 MemorySpaceOpt annotation, AS 101 param), see Address Spaces.

Vector Type Handling

NVPTX has a highly constrained vector type model. Only four vector types are legal --- all packed into 32-bit Int32HalfRegs (%hh prefix in PTX):

Legal Vector Type	LLVM MVT	PTX Register Class	PTX Suffix	SM Minimum
`v2f16`	`v2f16`	`Int32HalfRegs`	`.f16x2`	SM 70 (arith), SM 53 (ld/st)
`v2bf16`	`v2bf16`	`Int32HalfRegs`	`.bf16x2`	SM 80
`v2i16`	`v2i16`	`Int32HalfRegs`	`.s16x2`	SM 70
`v4i8`	`v4i8`	`Int32HalfRegs`	(packed bytes)	SM 70

All wider vector types are illegal and undergo recursive split/scalarize during type legalization. The split depth for common CUDA vector types:

CUDA Type	LLVM Type	Split Chain	Final Form
`float4`	`v4f32`	v4f32 -> 2x v2f32 -> 4x f32	4 scalar `float` ops
`float2`	`v2f32`	v2f32 -> 2x f32	2 scalar `float` ops
`int4`	`v4i32`	v4i32 -> 2x v2i32 -> 4x i32	4 scalar `i32` ops
`double2`	`v2f64`	v2f64 -> 2x f64	2 scalar `double` ops
`half2`	`v2f16`	legal (no split)	single `.f16x2` packed op
`__nv_bfloat162`	`v2bf16`	legal (no split, SM 80+)	single `.bf16x2` packed op
`short2`	`v2i16`	legal (no split)	single `.s16x2` packed op
`char4` / `uchar4`	`v4i8`	legal (no split)	single packed-byte op
`half` (4 elements)	`v4f16`	v4f16 -> 2x v2f16	2 packed `.f16x2` ops
`half` (8 elements)	`v8f16`	v8f16 -> v4f16 -> 2x v2f16	4 packed `.f16x2` ops

The critical architectural insight: v2f32 is NOT legal on NVPTX (no 64-bit packed float register class exists), so float4 always fully scalarizes to four independent f32 operations. In contrast, half2 stays packed throughout the pipeline, delivering 2x throughput via add.f16x2, mul.f16x2, and fma.rn.f16x2 PTX instructions.

SM-version gating affects which types are legal at which pipeline stage:

SM < 53: No legal vector types; v2f16 must be scalarized, and scalar f16 is promoted to f32.
SM 53--69: Scalar f16 is legal; v2f16 is legal for load/store but packed arithmetic may be Custom or Expand.
SM 70+: v2f16 fully legal with packed arithmetic. i128 scalar register class added.
SM 80+: v2bf16 added as legal vector type.
SM 100+: Additional packed FP types for cvt_packfloat --- e2m1x2, e2m3x2, e3m2x2, ue8m0x2.

Tensor core matrix fragments bypass vector legalization entirely. WMMA and WGMMA intrinsics represent matrix data as individual scalar registers or {f16, f16, ...} struct aggregates, not as LLVM vector types. See MMA Codegen for the tensor-core lowering path.

Cast Codegen Type Tags

The cast emission function sub_128A450 uses a distinct type-tag namespace at *(type+8). This tag drives all cast instruction selection and must be clearly distinguished from the EDG type-kind byte at edg_node+16:

Tag	LLVM Type	Cast Behavior
1	`half` (f16)	Float family; float-to-float casts use `fpext`/`fptrunc`
2	`float` (f32)	Float family
3	`double` (f64)	Float family
4	`x86_fp80`	Float family (not used on GPU)
5	`fp128`	Float family; triggers standard LLVM cast path (no `__nv_*_rz` intrinsic)
6	`bfloat` (bf16)	Float family
11	`iN` (integer)	Integer family; width at `*(type+8) >> 8`
15	`ptr`	Pointer family
16	`<N x elem>` (vector)	Vector/aggregate; address-space extraction via `sub_16463B0`

Integer-to-float conversions (tags 11 -> 1..6) default to sitofp/uitofp but can route through NVIDIA-specific __nv_*_rz round-to-zero intrinsics when unk_4D04630 is clear. These intrinsics (__nv_float2int_rz, __nv_double2ll_rz, etc.) are emitted as plain function calls and later pattern-matched by the PTX backend to cvt.rz.* instructions. The fp128 path always uses standard LLVM casts because 128-bit floating point is emulated via FP128/I128 library calls.

SelectionDAG SimpleVT Encoding

After IR generation, types enter the SelectionDAG type system where they are encoded as single-byte SimpleVT values for the legality table lookup at NVPTXTargetLowering + 2422:

SimpleVT	LLVM Type	Bitwidth
0	extended/custom	computed via `sub_1F58D40`
1	`i1`	1
2	`i2`	2
3	`i8`	8
4	`i16`	16
5	`i32`	32
6	`i64`	64
7	`i128`	128
8	`f16` / `bf16`	16
9	`f32`	32
10	`f64`	64
14--55	fixed-width vector types	vector of above
56--109	scalable vector types	scalable vector of above

The bitwidth-to-SimpleVT conversion pattern appears 11 times in the 348KB DAGTypeLegalizer::run monolith (sub_20019C0), and the vector-to-scalar-element switch table (cases 14--109 mapping back to scalar VT 2--10) appears 6 times. This redundancy is an artifact of the monolithic inlining --- upstream LLVM factors these into per-category files (LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, etc.).

Global Variable Code Generation

Module-Level Driver

Global variable codegen is driven by sub_915990 (~2,700 bytes), which iterates all EDG IL global declarations and categorizes them into sorted sets:

Regular device globals
__constant__ globals
__shared__ globals
__managed__ globals
Texture references
Surface references
Grid constants

After categorization, a topological sort (using the same sub_3FEBB0/sub_3FED60 graph primitives as the type translator) determines the order in which globals must be materialized. If global A's initializer references global B, then B must be code-generated first. The transitive dependency discovery is performed by sub_914960, a BFS that walks EDG IL linkage chains, filtering nodes with kind byte in range [25..34] (variable, function, and template declarations).

Address Space Determination

The function sub_916430 (482 bytes) examines EDG IL node attribute bytes to determine the NVPTX address space for a global variable:

fn determine_address_space(edg_node: &EDGNode) -> u32 {
    let storage_class = edg_node[0x88];
    let flags_9c      = edg_node[0x9C];
    let flags_b0      = edg_node[0xB0];
    let flags_ae      = edg_node[0xAE];
    let flags_a8      = edg_node[0xA8] as u64;

    // __constant__: storage class 2
    if storage_class == 2 {
        return 4;  // constant address space
    }

    // __shared__: bit 7 of flags_9c
    if flags_9c & 0x80 != 0 {
        if flags_ae & 1 != 0 {
            return 3;  // extern __shared__
        }
        if flags_b0 & 0x20 != 0 {
            return 5;  // local memory (stack-local shared variant)
        }
        return 3;  // __shared__
    }

    // Bit 6 of flags_9c: device-side memory
    if flags_9c & 0x40 != 0 {
        if edg_node[0xF0] != 0 {
            return 3;  // template-instantiated shared variable
        }
        return 0;  // generic device
    }

    // Extended attribute flags
    if flags_a8 & 0x2000100000 != 0 {
        return 3;  // shared-like semantics
    }

    if storage_class > 2 {
        emit_diagnostic("unsupported storage class!");
    }

    return 0;  // default: generic device memory
}

NVPTX Address Space Assignment

See Address Spaces for the complete master table mapping LLVM AS numbers to PTX qualifiers, hardware, and pointer widths.

In the IR generation context: address space 0 (generic) is the default for __device__ variables. Address space 1 (global) appears in pointer types when the global qualifier is explicit in the type annotation (as opposed to being inferred from the variable declaration). __managed__ variables use address space 0 (same as regular device globals) but receive a "managed" annotation in nvvm.annotations that the runtime uses to set up Unified Virtual Memory mappings.

GlobalVariable Object Creation

The function sub_915C40 (2,018 bytes) materializes an LLVM GlobalVariable:

Hash table lookup: Checks whether the EDG node has already been materialized. The table at ctx+0x178..0x190 maps EDG node pointers to GlobalVariable*. If found with a different type, calls GlobalVariable::mutateType to reconcile.
Allocation: Allocates 88 bytes (0x58) via operator new, then calls the GlobalVariable constructor with module, type, isConstant flag, linkage, initializer (null for declarations), name, and address space.
Alignment: Computes alignment via sub_91CB50 (a DataLayout wrapper), then converts to log2 via BSR (bit-scan-reverse) for LLVM's MaybeAlign representation. Always explicitly set, even for naturally-aligned types.
Initializer: If edg_node[0xB0] & 0x20 is set and the variable is not extern (edg_node[0x88] != 1), calls sub_916690 to generate the initializer IR. The initializer handler dispatches on a variant byte: variant 0/3 for constant expressions, variant 1/2 for aggregate initializers.
__managed__ annotation: If edg_node[0x9D] & 1 is set, emits ("managed", 1) to the annotation list via sub_913680.
Texture/surface detection: If the mode flag at ctx+0x168 has bit 0 set, calls sub_91C2A0 (isTextureType) and sub_91C2D0 (isSurfaceType). Matching variables get "texture" or "surface" annotations and are inserted into a red-black tree at ctx+0x200 for ordered tracking during annotation emission.
Registration: The new GlobalVariable* is stored into the hash table for future lookups.

Finalization: Metadata and `@llvm.used`

After all globals are materialized, sub_915400 calls four finalization functions in sequence:

sub_9151E0 --- emit nvvmir.version: Creates a named metadata node "nvvmir.version" containing version operands as ConstantInt values wrapped in ConstantAsMetadata. When debug info is present (ctx+0x170 non-null), the tuple has 4 operands including address-space-qualified indices; otherwise 2 operands.

sub_914410 --- emit nvvm.annotations: Iterates the annotation list at ctx+0x1B0..0x1B8 and creates MDTuple entries under the named metadata "nvvm.annotations". Each annotation record produces a {GlobalValue*, MDString-key, ConstantInt-value} triple. Three annotation categories receive special batching: "grid_constant", "preserve_n_data", and "preserve_reg_abi" --- these are collected into compound MDTuples rather than emitting one per parameter, reducing metadata size in kernels with many annotated parameters.

sub_90A560 --- emit @llvm.used: Builds the @llvm.used global array that prevents LLVM from dead-stripping texture references, surface references, and managed variables. The function iterates the registered global triples at ctx+0x198..0x1A0 (24-byte records, hence the 0xAAAAAAAAAAAAAAAB magic divisor for dividing by 3), bitcasts each GlobalValue* to i8*, constructs a ConstantArray of type [N x i8*], and creates a global with name "llvm.used", appending linkage, and section "llvm.metadata".

Conditional: If debug info is present, emits a "Debug Info Version" module flag with value 3 via Module::addModuleFlag. If enabled, also emits "llvm.ident" metadata identifying the compiler.

Kernel Metadata

Annotation Emitter (`sub_93AE30`)

After a kernel's function body has been code-generated, sub_93AE30 translates EDG-level kernel attributes (__launch_bounds__, __cluster_dims__) into LLVM named metadata under "nvvm.annotations". The function signature:

void emitKernelAnnotationMetadata(
    NVVMContext *ctx,       // ctx->module at offset +344
    FuncDecl    *funcDecl,  // EDG function declaration, params at +16, count at +8
    LaunchAttr  *launch,    // __launch_bounds__/cluster attrs, NULL if none
    MDNodeVec   *out        // output vector of metadata nodes
);

Parameter Metadata

For each function parameter (stride 40 bytes, iterated from funcDecl+16):

Visibility check: If launch attributes exist and bit 0x20 of launch+198 is clear, or param+32 != 0, emits opcode 22 (hidden/implicit parameter). If dword_4D04628 is set and the launch bit is set, calls sub_8D2E30 to check for special types and emits opcode 40.
Type dispatch:
- Type 1 (pointer): Checks sub_91B6F0 for read-only image/sampler (opcode 54) and sub_91B730 for surface reference (opcode 79).
- Type 2 (value): Computes alignment metadata via sub_91A390, then log2 via BSR, emits packed (log2, hasValue) pair. Checks for alignment attribute tag 92 via sub_A74D20.
MDNode creation: sub_A7B020(module, paramIndex, &attrAccum) creates the MDNode for each parameter.

Cluster Metadata

Triggered when launch is non-null and *(launch+328) points to a valid cluster config. The cluster config struct:

Offset	Field	Used As
`+20`	`[5]`	`reqntid.x` (cluster)
`+24`	`[6]`	`reqntid.y` (cluster)
`+28`	`[7]`	`reqntid.z` (cluster)
`+40`	`[10]`	`cluster_dim.z` (also presence flag: > 0 triggers emission)
`+44`	`[11]`	`cluster_dim.y`
`+48`	`[12]`	`cluster_dim.x`

When cluster_config[10] > 0, three metadata entries are emitted in order:

nvvm.blocksareclusters --- boolean flag, no value string. Emitted unconditionally.
nvvm.reqntid --- the three cluster dimension fields [12],[11],[10] are converted to decimal strings and concatenated with commas: "{x},{y},{z}". Uses SSO std::string objects with a two-digit lookup table ("00","01",...,"99") for fast integer-to-string conversion. A 0x3FFFFFFFFFFFFFFF sentinel triggers a fatal "basic_string::append" error on overflow.
nvvm.cluster_dim --- the three fields [7],[6],[5] are similarly concatenated.

Function-Level Metadata Node

After all per-parameter and cluster metadata is accumulated, if the accumulator is non-empty, sub_A7B020(module, 0xFFFFFFFF, &attrAccum) creates a function-level MDNode with parameter index -1 (sentinel). This node carries all function-level annotations combined.

Annotation Reader (`sub_A84F90`)

The inverse of the emitter. Reads "nvvm.annotations" named metadata from an LLVM Module and populates internal structures. For each {function_ref, key_string, value} operand tuple, the key is matched via raw integer comparisons (not strcmp):

Key String	Match Method	Handler
`"kernel"`	6-byte `i32+i16` compare	`sub_CE8040`: set/clear `nvvm.kernel` flag
`"maxntidx/y/z"`	7-byte prefix + suffix char	`sub_A7C1C0` with `"nvvm.maxntid"`
`"reqntidx/y/z"`	7-byte prefix + suffix char	`sub_A7C1C0` with `"nvvm.reqntid"`
`"cluster_dimx/y/z"`	12-byte qword+i32 + suffix	`sub_A7C1C0` with `"nvvm.cluster_dim"`
`"maxnreg"`	7-byte qword + byte `'g'`	`sub_B2CD60` with `"nvvm.maxnreg"`
`"minctasm"`	8-byte single qword compare	`sub_B2CD60` with `"nvvm.minctasm"`
`"maxclusterrank"`	14-byte multi-width compare	`sub_B2CD60` with `"nvvm.maxclusterrank"`
`"cluster_max_blocks"`	18 bytes	Same handler as `maxclusterrank`
`"align"`	5 bytes	`sub_B2CCF0`: BSR-based log2 alignment

The raw integer comparison technique avoids strcmp overhead by loading the key bytes as i32/i64 values and comparing in a single instruction. For example, "kernel" is checked as two loads: *(uint32_t*)key == 0x6E72656B and *(uint16_t*)(key+4) == 0x6C65.

Complete Metadata String Catalog

Module-level named metadata:

Key	Purpose
`nvvm.annotations`	Container for all kernel and global annotations
`nvvm.annotations_transplanted`	Flag: annotations already migrated to function-level
`nvvm.reflection`	Compile-time reflection constants
`nvvmir.version`	NVVM IR version (2 or 4 operands)
`llvm.used`	Array preventing dead-stripping of annotated globals
`llvm.ident`	Compiler identification string

Function-level metadata keys:

Key	Value Format	Source
`nvvm.kernel`	(boolean presence)	`__global__` qualifier or calling convention `0x47`
`nvvm.maxntid`	`"x,y,z"`	`__launch_bounds__(maxThreads)`
`nvvm.reqntid`	`"x,y,z"`	`__launch_bounds__` or cluster config
`nvvm.maxnreg`	decimal string	`__launch_bounds__(..., ..., maxRegs)`
`nvvm.minctasm`	decimal string	`__launch_bounds__(..., minCTAs)`
`nvvm.maxclusterrank`	decimal string	SM >= 90 cluster rank limit
`nvvm.blocksareclusters`	(boolean presence)	`__cluster_dims__` present
`nvvm.cluster_dim`	`"x,y,z"`	`__cluster_dims__(x,y,z)`

Global variable annotations (emitted as {GlobalValue*, MDString, i32} triples in nvvm.annotations):

Annotation	Value	Trigger
`"managed"`	1	`__managed__` qualifier
`"texture"`	1	Texture reference type detected
`"surface"`	1	Surface reference type detected
`"grid_constant"`	(batched)	`__grid_constant__` parameter attribute
`"preserve_n_data"`	(batched)	NVIDIA-internal preservation hint
`"preserve_reg_abi"`	(batched)	NVIDIA-internal register ABI hint

Metadata Accessor Functions

The backend reads metadata through typed accessor functions in the 0xCE7xxx--0xCE9xxx range:

Address	Reconstructed Name	Returns
`sub_CE9220`	`isKernel(func)`	`true` if linkage == `0x47` OR `nvvm.kernel` present
`sub_CE8D40`	`getMaxNtid(out, func)`	Parses `"nvvm.maxntid"` as `(x,y,z)` triple
`sub_CE8DF0`	`getReqNtid(out, func)`	Parses `"nvvm.reqntid"` as `(x,y,z)` triple
`sub_CE8EA0`	`getClusterDim(out, func)`	Parses `"nvvm.cluster_dim"` as `(x,y,z)` triple
`sub_CE9030`	`getMaxClusterRank(func)`	Checks `"cluster_max_blocks"` then `"nvvm.maxclusterrank"`
`sub_CE90E0`	`getMinCtaSM(func)`	Checks `"minctasm"` then `"nvvm.minctasm"`
`sub_CE9180`	`getMaxNReg(func)`	Checks `"maxnreg"` then `"nvvm.maxnreg"`

Each accessor first checks the function-level metadata (post-transplant), then falls back to the raw nvvm.annotations tuples (pre-transplant). The isKernel check is especially important: it recognizes kernels either by calling convention 0x47 or by the nvvm.kernel metadata presence, ensuring compatibility with both the EDG frontend path and bitcode loaded through LibNVVM.

Metadata Lifecycle

The complete flow from CUDA source to PTX directives:

CUDA:  __global__ void kern() __launch_bounds__(256, 2) __cluster_dims__(2, 1, 1)

EDG:   LaunchAttr { cluster_config[12]=256, [11]=1, [10]=1, [7]=1, [6]=1, [5]=2 }

sub_93AE30:
  -> nvvm.blocksareclusters (presence flag)
  -> nvvm.reqntid = "256,1,1"
  -> nvvm.cluster_dim = "2,1,1"
  -> function-level MDNode (index -1)

sub_A84F90:  reads back on bitcode load

Backend accessors (CE8xxx): typed access

PTX emitter (sub_3022E70):
  .blocksareclusters
  .reqntid 256, 1, 1
  .reqnctapercluster 2, 1, 1

Special Variables: threadIdx, blockIdx, blockDim, gridDim, warpSize

Recognition Pipeline

CUDA built-in variables (threadIdx, blockIdx, blockDim, gridDim, warpSize) are not stored in memory --- they map directly to PTX special registers accessed via LLVM intrinsics. Two parallel codegen paths exist: an older one in the 0x920xxx range and a newer one in the 0x1285xxx range. Both share the same logic structure.

The classifier function isSpecialRegisterVar (sub_920430 / sub_127F7A0) checks five preconditions before recognizing a variable:

Inside kernel: (ctx->flags_at_360 & 1) != 0 --- only valid in __global__ function context.
Not extern: (sym->byte_89 & 1) == 0.
Not template-dependent: *(signed char*)(sym+169) >= 0.
Element count == 1: sym->elem_count_at_136 == 1.
Name non-null: sym->name_at_8 != NULL.

If all pass, the name is compared via strcmp against the five known strings. The output category:

Category	Name	Type
0	`threadIdx`	`dim3` (3-component struct)
1	`blockDim`	`dim3`
2	`blockIdx`	`dim3`
3	`gridDim`	`dim3`
4	`warpSize`	scalar `int`

Intrinsic ID Table

A static 2D array int intrinsicIDs[5][3] maps (category, component) to LLVM intrinsic IDs:

CUDA Variable	.x	.y	.z
`threadIdx`	`@llvm.nvvm.read.ptx.sreg.tid.x`	`.tid.y`	`.tid.z`
`blockDim`	`@llvm.nvvm.read.ptx.sreg.ntid.x`	`.ntid.y`	`.ntid.z`
`blockIdx`	`@llvm.nvvm.read.ptx.sreg.ctaid.x`	`.ctaid.y`	`.ctaid.z`
`gridDim`	`@llvm.nvvm.read.ptx.sreg.nctaid.x`	`.nctaid.y`	`.nctaid.z`
`warpSize`	`@llvm.nvvm.read.ptx.sreg.warpsize`	---	---

Each intrinsic is a zero-argument call returning i32. The old codegen path uses intrinsic ID 9374 for warpSize; the new path uses 4348.

dim3 Member Access Codegen

Two functions handle the code generation, depending on whether the access is a full dim3 struct or a single component:

Full struct access (sub_922290 / sub_1285550): For threadIdx as a whole (all three components), loops 3 times:

for (component = 0; component < 3; component++) {
    intrinsicID = intrinsicIDs[category][component];
    decl = Module::getOrInsertIntrinsic(intrinsicID);
    callInst = CallInst::Create(decl);  // zero-arg, returns i32
    // Insert into struct via InsertValue
}

The three call results are composed into the struct type via CreateInsertValue. The IR value is named "predef_tmp".

Single component access (sub_9268C0 / sub_1286E40): For threadIdx.x specifically, the member name's first character is extracted from member_symbol+56+8:

'x' (0x78) with null terminator '\0' at next byte -> component 0
'y' (0x79) -> component 1
'z' (0x7A) -> component 2

The null-terminator check prevents false matches on member names like "xy". A single intrinsic call is emitted, named "predef_tmp_comp":

%predef_tmp_comp = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()

Both paths compute alignment from the return type's bit-width via BSR and handle sign extension: if the type tag byte at +140 satisfies (tag & 0xFB) == 8 (signed int), the result is marked as signed.

PTX Backend Mapping

The NVPTX backend (sub_21E86B0) maps internal register encodings (single-byte case labels using ASCII character codes) to PTX special register names:

Code	ASCII	PTX Register
`0x26`	`&`	`%tid.x`
`0x27`	`'`	`%tid.y`
`0x28`	`(`	`%tid.z`
`0x29`	`)`	`%ntid.x`
`0x2A`	`*`	`%ntid.y`
`0x2B`	`+`	`%ntid.z`
`0x2C`	`,`	`%ctaid.x`
`0x2D`	`-`	`%ctaid.y`
`0x2E`	`.`	`%ctaid.z`
`0x2F`	`/`	`%nctaid.x`
`0x30`	`0`	`%nctaid.y`
`0x31`	`1`	`%nctaid.z`

Codes 0x5E (^) and 0x5F (_) are delegated to sub_3958DA0 for cluster and warp-level registers. Any unhandled code triggers a fatal "Unhandled special register" error. Register names are written via optimized memcpy of 6--9 bytes directly to the output stream.

ISel Lowering

The instruction selector (sub_36E4040) validates that the intrinsic declaration returns i32 (type code 7 at offset +48 of the overload descriptor). If the type does not match, it emits a fatal error: "Unsupported overloaded declaration of llvm.nvvm.read.sreg intrinsic". It then creates a MachineSDNode with NVPTX target opcode 3457.

EDG Frontend Diagnostic

The EDG frontend includes a diagnostic at sub_6A49A0 that detects writes to predefined read-only variables. When a store target matches any of the five built-in names, it emits diagnostic 0xDD0:

error: cannot assign to variable 'threadIdx' with predefined meaning in CUDA

This diagnostic fires during semantic analysis, long before IR generation. It ensures that CUDA programs cannot accidentally (or intentionally) write to hardware register proxies.

Libdevice Linking

NVIDIA embeds a complete copy of the libdevice math library -- 455,876 bytes of LLVM bitcode -- directly inside the cicc binary. This library provides GPU-optimized implementations of ~350 mathematical intrinsics (trigonometric, exponential, rounding, Bessel functions, error functions, type conversions, and integer utilities) that are linked into every CUDA compilation during the LNK pipeline stage. The linker (sub_12C06E0, 63KB) validates bitcode magic bytes, enforces the nvptx64- target triple prefix, checks NVVM IR version metadata for cross-release compatibility, and performs symbol-size matching across all modules before producing a single merged module. Two identical copies of the embedded bitcode exist in the binary -- one for each compilation path -- ensuring the library is always available without filesystem access.

Upstream LLVM has no equivalent of this embedded-library mechanism. Clang relies on external libdevice.10.bc files discovered through --cuda-path at driver level. NVIDIA's approach eliminates the file-lookup step entirely, making cicc self-contained: the entire math library ships inside the compiler binary itself.


Embedded size	455,876 bytes (445 KB) per copy
Copies in binary	2: `unk_3EA0080` (Path A), `unk_420FD80` (Path B)
Function count	352 defined (349 `__nv_` public + 3 `__internal_` helper)
`__nvvm_reflect` calls	2,016 (architecture/precision dispatch)
Target triple	`nvptx64-nvidia-gpulibs`
NVVM IR version	`!nvvmir.version = !{i32 2, i32 0}` (always-compatible sentinel)
Attribute group	`#0 = { alwaysinline nounwind }` on all public functions
Module linker	`sub_12C06E0` (63KB, 2,154 lines)
Version checker	`sub_12BFF60` (9KB, 362 lines)
Pipeline stage	LNK (first stage, before OPT)
Override	`-nvvmir-library <path>` CLI flag substitutes an external file
Version bypass	`NVVM_IR_VER_CHK=0` disables IR version validation

Embedded Bitcode Layout

The cicc binary contains two byte-identical copies of the libdevice bitcode at different virtual addresses. Each compilation path uses its own copy, avoiding any shared-state coordination between Path A (nvcc-invoked) and Path B (standalone/LibNVVM):

Binary offset         Path   Referenced by          Size
─────────────────────────────────────────────────────────────
unk_3EA0080           A      sub_905EE0 (43KB)      455,876 bytes
unk_420FD80           B      sub_1265970 (48KB)      455,876 bytes

Both copies contain identical LLVM bitcode with:

Data layout: e-i64:64-v16:16-v32:32-n16:32:64
Target triple: nvptx64-nvidia-gpulibs (note: gpulibs, not cuda)
Producer: clang version 3.8.0 (tags/RELEASE_380/final) -- the bitcode was originally compiled with an ancient Clang but has been maintained through bitcode format upgrades across CUDA toolkit releases
Version metadata: !nvvmir.version = !{i32 2, i32 0} -- this specific version tuple (2, 0) is hard-coded in the version checker as an always-compatible sentinel

The duplication exists because the two compilation paths (sub_905EE0 for Path A, sub_1265970 for Path B) are entirely independent code paths with no shared module state. Deduplicating the data would require introducing a shared pointer, which NVIDIA apparently considered not worth the ~445KB savings in a 60MB binary.

Loading the Embedded Bitcode

In both paths, the embedded bitcode is passed to sub_12BCB00 (the nvvmCUAddModuleFromBuffer API wrapper) with a hardcoded size constant:

// Path A (sub_905EE0, line ~167):
v19 = sub_12BCB00(compilation_unit, &unk_3EA0080, 455876, 0);

// Path B (sub_1265970, line ~448):
v19 = sub_12BCB00(compilation_unit, &unk_420FD80, 455876, 0);

When the -nvvmir-library <path> flag is provided, the corresponding path opens the file, reads its contents into memory, and passes that buffer to sub_12BCB00 instead of the embedded pointer. This override is used primarily for testing custom libdevice builds.

Libdevice Function Inventory

The library defines 352 functions across 10 categories. All 349 public functions carry alwaysinline nounwind attributes, meaning they will be unconditionally inlined during the OPT stage after linking. Three internal helper functions (__internal_trig_reduction_slowpathd, __internal_accurate_pow, __internal_lgamma_pos) use noinline nounwind to avoid code size explosion in their callers.

Category	Count	Examples
Type conversions	75	`__nv_float2int_rn`, `__nv_double2ull_rz`, `__nv_int2float_rd`, `__nv_half2float`
Rounded arithmetic	74	`__nv_fmaf_rn`, `__nv_fdiv_rz`, `__nv_dsqrt_rd`, `__nv_dadd_ru`, `__nv_fmul_rn`
Trigonometric	34	`__nv_sinf`, `__nv_cos`, `__nv_tanf`, `__nv_asinf`, `__nv_atan2`, `__nv_sincospi`
Special functions	30	`__nv_erff`, `__nv_lgamma`, `__nv_j0`, `__nv_y1`, `__nv_cyl_bessel_i0`, `__nv_normcdf`
Roots and norms	28	`__nv_sqrtf`, `__nv_rsqrt`, `__nv_cbrt`, `__nv_hypot`, `__nv_norm3d`, `__nv_rnorm4d`
Exponential/logarithmic	28	`__nv_expf`, `__nv_log2`, `__nv_exp10`, `__nv_log1p`, `__nv_ldexp`, `__nv_frexp`
Integer utilities	27	`__nv_clz`, `__nv_popc`, `__nv_brev`, `__nv_mulhi`, `__nv_abs`, `__nv_byte_perm`
Float utilities	20	`__nv_fabsf`, `__nv_fminf`, `__nv_copysign`, `__nv_fmod`, `__nv_nextafter`, `__nv_nan`
Rounding	14	`__nv_floorf`, `__nv_ceil`, `__nv_truncf`, `__nv_roundf`, `__nv_nearbyintf`, `__nv_rint`
Classification	11	`__nv_isinff`, `__nv_isnand`, `__nv_isfinited`, `__nv_signbitf`, `__nv_ilogb`, `__nv_logb`
Internal helpers	3	`__internal_trig_reduction_slowpathd`, `__internal_accurate_pow`, `__internal_lgamma_pos`

Every public function body contains calls to @__nvvm_reflect with query strings (__CUDA_FTZ, __CUDA_ARCH, __CUDA_PREC_SQRT) that are resolved by the NVVMReflect pass during optimization. This is how the same bitcode adapts to different precision modes and SM architectures -- see NVVMReflect for details on the reflection mechanism. The 2,016 reflect calls across 352 functions means an average of ~5.7 architecture/precision branch points per function.

Struct Types

The bitcode defines five aggregate types used by multi-return functions:

%struct.uint2                  = type { i32, i32 }
%struct.float2                 = type { float, float }
%struct.trig_reduction_return  = type { double, i32 }
%struct.ulonglong2             = type { i64, i64 }
%struct.double2                = type { double, double }

trig_reduction_return is used by the internal trigonometric range reduction helper. The float2/double2 types appear in sincos/sincospi which return both sine and cosine through output pointers.

Constant Tables

The bitcode contains precomputed coefficient tables in address space 1 (global memory):

Global	Type	Purpose
`@__cudart_i2opi_f`	`[6 x i32]`	Float-precision inverse-of-pi table for trig reduction
`@__cudart_i2opi_d`	`[18 x i64]`	Double-precision inverse-of-pi table for trig reduction
`@__cudart_sin_cos_coeffs`	`[16 x double]`	Chebyshev coefficients for sin/cos polynomial approximation

Module Linker Algorithm

sub_12C06E0 (63KB) is the central module linker that operates during the LNK pipeline stage. It receives a list of user modules and a list of builtin modules (which includes libdevice), validates them, and produces a single merged LLVM module. The algorithm proceeds in six phases:

Phase A: Module Iteration and Bitcode Validation

For each module in the input list (from a1[0] to a1[1], stepping by 4 qwords per entry), the linker:

Opens and reads the module data via sub_16C2450
Validates LLVM bitcode magic bytes -- accepts two formats:
- Raw bitcode: bytes 0xDE 0xC0 0x17 0x0B (little-endian 0x0B17C0DE)
- Bitcode wrapper: bytes 0x42 0x43 0xC0 0xDE (ASCII "BC" prefix)
Determines the buffer name (falls back to "Unknown buffer" if the vtable function is sub_12BCB10)
Parses bitcode into an LLVM Module via sub_15099C0

for each entry in modules[a1[0] .. a1[1]]:
    buffer = open_and_read(entry.data, entry.size, entry.name)
    magic = read_4_bytes(buffer)
    if magic != 0x0B17C0DE and magic != 0xDEC04342:
        *error_code = 9   // invalid bitcode
        return NULL
    name = (entry.vtable_func == sub_12BCB10)
           ? "Unknown buffer"
           : entry.vtable_func(entry)
    module = parse_bitcode(buffer, llvm_ctx, name)

Phase B: Triple Validation

After parsing all modules, the linker enforces that every module's target triple starts with nvptx64-. The comparison uses a prefix match against the global string at off_4CD49B0:

for each parsed_module:
    triple = get_triple(parsed_module)   // offset +240
    if triple.length == 0:
        error: "Module does not contain a triple, should be 'nvptx64-'"
        *error_code = 9
    else if !starts_with(triple, "nvptx64-"):
        error: "<module_name>: Module does not contain a triple, should be 'nvptx64-'"
        *error_code = 9

The libdevice bitcode has triple nvptx64-nvidia-gpulibs, which passes this prefix check. User modules typically have nvptx64-nvidia-cuda.

Phase C: IR Version Check

For each module, the linker calls sub_12BFF60 (the version checker -- see next section). If the check fails, the linker emits a diagnostic and returns error code 3:

for each parsed_module:
    result = NVVMIRVersionCheck(modules, parsed_module, flags)
    if result != 0:
        error: "<name>: error: incompatible IR detected. "
               "Possible mix of compiler/IR from different releases."
        *error_code = 3
        return NULL

Phase D: Single-Module Fast Path

When only one module exists (no linking needed), the linker returns it directly via sub_1C3DFC0 without invoking any linking machinery. This fast path avoids the overhead of LLVM's Linker::linkModules for the common case of a single translation unit without libdevice.

Phase E: Multi-Module User Linking

For N > 1 user modules, the linker:

Selects one module as the "primary" (index v57)
Copies the primary module's triple and data layout to all secondary modules (ensuring consistency)
Calls sub_12F5610 -- NVIDIA's wrapper around LLVM's Linker::linkModules -- to merge all user modules into a single module

if module_count > 1:
    primary = modules[v57]
    for each secondary in modules where index != v57:
        set_triple(secondary, get_triple(primary))
        set_data_layout(secondary, get_data_layout(primary))
    result = LinkModules(&modules, linking_state, &error_str, &warnings, options)
    if result != 0:
        error: "<module_name>: link error: <details>"
        *error_code = 9

Phase F: Builtin Linking

After user modules are merged, the linker processes builtin modules from a1[3] to a1[4] (this is where libdevice lives). Each builtin module goes through the same bitcode validation and parsing as user modules, then is linked into the main module using sub_1CCEBE0 -- a different linking function than the user-module linker, likely Linker::linkModules with Linker::OverrideFromSrc flags for builtin definitions:

for each builtin in modules[a1[3] .. a1[4]]:
    validate_and_parse(builtin)
    set_triple(builtin, get_triple(main_module))
    result = LinkBuiltinModule(main_module, builtin, &error_string)
    if result != 0:
        error: "builtins: link error: <details>"
        // continues -- does not abort on builtin link failure
    post_link_cleanup(main_module, target_features)

The post-link cleanup sequence (sub_1611EE0 through sub_160FE50) configures target features on the merged module and finalizes symbol resolution.

Phase G: Symbol Size Matching

The final validation phase walks every global symbol in the linked module and checks that declarations and definitions agree on type sizes. The linker maintains a binary search tree keyed by symbol name and computes type sizes using a recursive size calculator:

Type code	Type	Size formula
1	half	16 bits
2	float	32 bits
3, 9	double, i64	64 bits
4	fp80	80 bits
5, 6	fp128	128 bits
7	pointer	8 * pointer_size
0xB	integer	`bits >> 8`
0xD	struct	sum of member sizes
0xE	array	`alignment * count * ceil(element_bits / (8 * alignment))`
0xF	named type	resolved recursively
0x10	vector	element_size * count

for each global_symbol in linked_module:
    name = get_name(global_symbol)
    if name in size_tree:
        existing_size = size_tree[name].size
        new_size = compute_type_size(global_symbol.type)
        if existing_size != new_size:
            error: "Size does not match for <name> in <module_A> "
                   "with size X specified in <module_B> with size Y."
            size_mismatch = true
    else:
        size_tree.insert(name, compute_type_size(global_symbol.type))
if size_mismatch:
    *error_code = 9

Triple and Version Validation

NVVM IR Version Checker (`sub_12BFF60`)

The version checker validates the nvvmir.version metadata node that every NVVM-produced bitcode module carries. It ensures that modules compiled by different CUDA toolkit versions are not accidentally mixed.

Metadata lookup: The checker searches for two named metadata nodes:

"nvvmir.version" -- the IR version tuple
"llvm.dbg.cu" -- debug compile unit (presence indicates debug info exists)

Both are looked up via sub_1632310 (named metadata search on the module).

Version tuple format: The metadata node contains either 2 or 4 constant integer operands:

Format	Operands	Meaning
2-element	`{major, minor}`	IR version only
4-element	`{major, minor, dbg_major, dbg_minor}`	IR version + debug IR version

Compatibility check: For the IR version, sub_12BDA30 performs the actual comparison. The special case (major=2, minor=0) always passes -- this is exactly the version carried by the embedded libdevice, ensuring it is compatible with any user module regardless of toolkit version.

For the debug version, sub_12BD890 checks compatibility with a similar special case: (debug_major=3, debug_minor<=2) always passes.

Unique node deduplication: The checker builds a hash set of unique metadata nodes using the standard DenseMap infrastructure with NVVM-layer sentinels (-8 / -16). See Hash Table and Collection Infrastructure for the hash function and probing strategy. This deduplication handles the case where multiple source files within a compilation unit carry identical version metadata -- each unique version is checked exactly once.

Final gate: If debug info is present in the module, the debug mode flag is set, but no debug version was validated (because the metadata lacked elements 2-3), the checker returns 3 (incompatible). This catches the case where a debug-compiled user module is linked against a non-debug library that lacks debug version metadata.

Symbol Resolution During LNK

The LNK stage processes libdevice functions through LLVM's standard symbol resolution mechanism. Because all 349 public libdevice functions carry the alwaysinline attribute, the resolution and inlining follow a specific sequence:

Declaration matching: User code that calls __nv_sinf(x) contains an external declaration declare float @__nv_sinf(float). The linker resolves this declaration against the define float @__nv_sinf(float) in libdevice.
__nvvm_reflect remains unresolved: After linking, libdevice function bodies contain calls to @__nvvm_reflect which are still unresolved declarations. These are handled during the OPT stage by the NVVMReflect pass, not during linking.
Dead function elimination: Functions from libdevice that are never called by user code are eliminated by GlobalDCE during the OPT stage. Since libdevice provides 352 functions but a typical kernel uses only a handful, the vast majority are stripped.
alwaysinline enforcement: During the OPT stage, the AlwaysInliner pass processes all libdevice functions. After inlining, the original function bodies become dead (no remaining callers) and are removed by subsequent DCE.

The net effect: a kernel calling __nv_sinf ends up with the sinf implementation inlined directly into the kernel body, with __nvvm_reflect calls already resolved to constants by NVVMReflect, and all unused branches from precision/architecture dispatch eliminated by SimplifyCFG.

Constant Folding Interaction

The constant folding engine (sub_14D90D0, 27KB) has special knowledge of libdevice functions. When a libdevice intrinsic is called with constant arguments, the fold eligibility checker determines whether the call can be evaluated at compile time -- before the libdevice function is inlined.

This creates an important ordering constraint:

LNK stage:  link libdevice → user module now has __nv_sinf definitions
OPT stage:  NVVMReflect  → resolve __CUDA_FTZ, __CUDA_ARCH queries
            ConstantFold → fold __nv_sinf(0.0) → 0.0 (if eligible)
            AlwaysInline → inline remaining __nv_sinf calls
            SimplifyCFG  → remove dead reflect branches
            GlobalDCE    → remove unused libdevice functions

The fold eligibility checker (sub_14D90D0) uses three dispatch mechanisms to identify foldable functions:

LLVM intrinsic ID switch (IDs 0-211): Covers standard LLVM intrinsics like llvm.sin, llvm.cos, llvm.sqrt, llvm.fma, llvm.floor, llvm.ceil, llvm.exp, llvm.log, llvm.pow, llvm.fabs, llvm.bswap, llvm.ctlz, llvm.ctpop, and overflow arithmetic.

NVVM intrinsic ID ranges (IDs > 211): Covers NVIDIA-specific intrinsics organized as binary-search ranges with bitmask dispatch:

Range	IDs	Examples
0xEB4-0xEE3	3764-3811	`nvvm.ceil.f`, `nvvm.ctlz.i`, `nvvm.cos.approx.ftz.f`
0xF1E-0xF72	3870-3954	`nvvm.exp2.approx`, `nvvm.fabs.f`, `nvvm.floor.f`, `nvvm.sqrt.f`
0xFE8-0xFEA	4072-4074	`nvvm.sin.approx.ftz.f` and similar
0x1012-0x104C	4114-4172	`nvvm.max.i`, `nvvm.min.ui`, `nvvm.min.ll`
0x1086-0x1087	4230-4231	`nvvm.mul.hi.*`
0x117B-0x1184	4475-4484	`nvvm.sqrt.rn.d`, `nvvm.sqrt.approx.ftz.f`
0x1C80-0x1CAC	7296-7340	`nvvm.fmax.f`, `nvvm.fmin.ftz.nan.f`

Name-based matching (ID = 0): When the call target is not a recognized LLVM or NVVM intrinsic, the checker falls back to string matching on the function name. It dispatches on the first character, then uses DWORD integer comparisons for 4-byte names and memcmp for longer names:

Foldable C library names:
  sin, sinf, cos, cosf, tan, tanf, acos, acosf, asin, asinf,
  atan, atanf, atan2, atan2f, ceil, ceilf, cosh, coshf,
  exp, expf, exp2, exp2f, fabs, fabsf, floor, floorf,
  fmod, fmodf, log, logf, log10, log10f, pow, powf,
  round, roundf, sinh, sinhf, sqrt, sqrtf, tanh, tanhf

Convergent gate: Before any folding, the checker verifies that the callee does not carry the convergent attribute (kind 0x34). Convergent functions have warp-synchronous semantics and must not be speculatively constant-folded, even if all arguments are constants.

Configuration

Environment Variables

Variable	Effect
`NVVM_IR_VER_CHK`	Set to `"0"` to disable IR version validation. Any other value or unset = enabled (default). Checked in `sub_12BFF60` at `0x12BFF60` and in the duplicate verifier at `0x2259720`.

CLI Flags

Flag	Effect
`-nvvmir-library <path>`	Override the embedded libdevice with an external bitcode file. The file is opened, read into memory, and passed to the linker in place of the embedded `unk_3EA0080`/`unk_420FD80` pointer.
`-opt` / `-llc`	When passed as the first extra argument, skips builtin linking entirely (jumps past the libdevice linking code to direct pipeline stage invocation).
`-keep`	Preserves `.lnk.bc` intermediate file showing the linked module (user + libdevice) before optimization.

Intermediate Files

When -keep is active, the LNK stage serializes its output to a .lnk.bc file alongside the input:

input.cu  →  input.lnk.bc  (linked: user + libdevice)
          →  input.opt.bc  (optimized: after OPT stage)
          →  input.ptx     (final: after LLC stage)

The .lnk.bc file is useful for verifying which libdevice functions survived linking and how __nvvm_reflect calls appear before the NVVMReflect pass resolves them.

Function Map

Function	Address	Size	Role
`ModuleLinker`	`sub_12C06E0`	63KB	Main bitcode linker: validates magic, triple, version; links user modules, then builtins
`NVVMIRVersionCheck`	`sub_12BFF60`	9KB	Reads `nvvmir.version` metadata, checks compatibility via `sub_12BDA30`/`sub_12BD890`
`CheckIRVersion`	`sub_12BDA30`	~2KB	IR version compatibility predicate (special-cases `{2,0}` as always-compatible)
`CheckDebugVersion`	`sub_12BD890`	~2KB	Debug IR version compatibility predicate (special-cases `{3, <=2}`)
`PipelineOrchestrator`	`sub_12C35D0`	41KB	4-stage pipeline driver; calls `sub_12C06E0` during LNK stage
`LibNVVMPipelineA`	`sub_905EE0`	43KB	Path A pipeline driver; references `unk_3EA0080` for embedded libdevice
`LibNVVMPipelineB`	`sub_1265970`	48KB	Path B pipeline driver; references `unk_420FD80` for embedded libdevice
`nvvmCUAddModuleFromBuffer`	`sub_12BCB00`	~1KB	API wrapper that adds a bitcode buffer to the compilation unit
`LibNVVM API dispatch`	`sub_12BC0F0`	3KB	Resolves LibNVVM API function pointers by hash ID
`ParseBitcodeFile`	`sub_15099C0`	~8KB	LLVM bitcode parser entry point
`LinkBuiltinModule`	`sub_1CCEBE0`	~4KB	Links a single builtin module into the main module (`Linker::linkModules` with `OverrideFromSrc` `[MEDIUM confidence]` -- inferred from the override-from-source semantics of builtin linking and the 4KB size matching a thin wrapper around LLVM's linker API, but no diagnostic string confirms the exact LLVM API call)
`LinkUserModules`	`sub_12F5610`	~4KB	Links multiple user modules (`Linker::linkModules` `[MEDIUM confidence]` -- same reasoning as above; wrapper size and call pattern match, but unconfirmed by string evidence)
`CanFoldIntrinsic`	`sub_14D90D0`	27KB	Constant-fold eligibility checker for math intrinsics
embedded libdevice (Path A)	`unk_3EA0080`	455,876B	Raw LLVM bitcode blob
embedded libdevice (Path B)	`unk_420FD80`	455,876B	Raw LLVM bitcode blob (identical copy)

Reimplementation Checklist

Embedded bitcode storage and loading. Embed the libdevice bitcode blob (455,876 bytes) directly in the compiler binary, provide two independent copies for dual-path compilation (Path A / Path B), and implement the nvvmCUAddModuleFromBuffer API wrapper to load the embedded blob or an external override file via -nvvmir-library.
Bitcode magic validation. Accept two bitcode formats: raw bitcode (0xDE 0xC0 0x17 0x0B, little-endian 0x0B17C0DE) and bitcode wrapper (0x42 0x43 0xC0 0xDE, ASCII "BC" prefix). Reject anything else with error code 9.
Target triple and IR version validation. Enforce nvptx64- prefix on all module triples. Implement the NVVM IR version checker that reads nvvmir.version metadata (2-element or 4-element tuples), special-cases version {2,0} as always-compatible (the libdevice sentinel), and checks debug IR version compatibility for {3, <=2}.
Multi-module linking pipeline. Implement the six-phase linker: (A) module iteration with bitcode validation, (B) triple validation, (C) IR version check, (D) single-module fast path, (E) multi-module user linking with primary module selection and triple/data-layout propagation, (F) builtin linking with OverrideFromSrc semantics.
Symbol size matching. Walk all global symbols in the linked module, compute type sizes recursively (handling half/float/double/pointer/integer/struct/array/vector types), and verify that declarations and definitions agree on type sizes using a binary search tree keyed by symbol name.
Constant folding integration. Implement the fold eligibility checker for libdevice functions with three dispatch mechanisms (LLVM intrinsic ID switch for IDs 0--211, NVVM intrinsic ID ranges for IDs >211, name-based matching for C library names), gated by the convergent attribute check to prevent folding warp-synchronous functions.

Cross-References

Entry Point & CLI -- dual-path architecture, -nvvmir-library flag handling
NVVMReflect -- resolution of __nvvm_reflect calls embedded in libdevice functions
Optimizer Pipeline -- OPT stage where inlining and DCE process linked libdevice
Environment Variables -- NVVM_IR_VER_CHK documentation
Bitcode I/O -- bitcode reader/writer infrastructure used by the linker

LLVM Optimizer

NVIDIA's LLVM optimizer in cicc v13.0 is not a straightforward invocation of the upstream LLVM opt pipeline. Instead, it implements a proprietary two-phase compilation model where the same 49.8KB pipeline assembly function (sub_12E54A0) is called twice with different phase counters, allowing analysis passes to run in Phase I and codegen-oriented passes in Phase II. Individual passes read a TLS variable (qword_4FBB3B0) to determine which phase is active and skip themselves accordingly.

The optimizer also supports concurrent per-function compilation: after Phase I completes on the whole module, Phase II can be parallelized across functions using a thread pool sized to get_nprocs() or a GNU Jobserver token count. This is a significant departure from upstream LLVM, which processes functions sequentially within a single pass manager invocation.

The entire optimization behavior is controlled by the NVVMPassOptions system — a 4,512-byte struct with 221 option slots (114 string + 100 boolean + 6 integer + 1 string-pointer) that provides per-pass enable/disable toggles and parametric knobs. This system is completely proprietary and has no upstream equivalent.

Address range 0x12D0000–0x16FFFFF (~4.2 MB of code).


Pipeline assembler	`sub_12E54A0` (49.8KB, 1,553 lines, ~150 pass insertions)
Phase orchestrator	`sub_12E7E70` (9.4KB, Phase I / Phase II)
Concurrent entry	`sub_12E1EF0` (51.3KB, jobserver + split-module + thread pool)
PassOptions init	`sub_12D6300` (125KB, 4,786 lines, 221 option slots)
New PM registration	`sub_2342890` (2,816 lines, 35 NVIDIA + ~350 LLVM passes)
Target creation	`sub_12EA530` (4.1KB, `"nvptx"` / `"nvptx64"`)
AddPass	`sub_12DE0B0` (3.5KB, hash-table-based pass insertion)
Tier 0 sub-pipeline	`sub_12DE330` (4.8KB, ~40 passes)
Tier 1/2/3 sub-pipeline	`sub_12DE8F0` (17.9KB, phase-conditional)
Codegen dispatch	`sub_12DFE00` (20.7KB)
LTO pipeline	`sub_12F5F30` (37.8KB, dead kernel elimination)
jemalloc	5.3.x statically linked (~400 functions at `0x12FC000`)

Architecture

sub_12E1EF0 (51KB, concurrent compilation entry)
  │
  ├─ GNU Jobserver init (sub_16832F0, --jobserver-auth=R,W from MAKEFLAGS)
  ├─ Bitcode reading + verification (sub_153BF40)
  ├─ Function sorting by priority (sub_12E0CA0)
  ├─ Thread pool creation (sub_16D4AB0, min(requested, num_functions) threads)
  │
  └─ sub_12E7E70 (9.4KB, two-phase orchestrator)
       │
       ├─ Phase I: qword_4FBB3B0 = 1
       │    └─ sub_12E54A0 (whole-module analysis + early optimization)
       │
       ├─ Concurrency check: sub_12D4250 (>1 defined function?)
       │    ├─ Yes, threads>1 → per-function Phase II via thread pool
       │    │    └─ sub_12E86C0 per function (qword_4FBB3B0 = 2)
       │    └─ No → sequential Phase II
       │         └─ sub_12E54A0 (qword_4FBB3B0 = 2)
       │
       └─ qword_4FBB3B0 = 3 (done)

sub_12E54A0 (49.8KB, MASTER PIPELINE ASSEMBLY)
  │
  ├─ Top branch: a4[4384] → Pipeline B (fast/codegen-only)
  │                    else → Pipeline A (normal LLVM)
  │
  ├─ Target machine setup
  │    ├─ Triple: "nvptx64" or "nvptx" (based on pointer size)
  │    ├─ sub_16D3AC0 → TargetRegistry::lookupTarget()
  │    ├─ TargetOptions: PIC=1, CodeModel=8, OptLevel=1, ThreadModel=1
  │    └─ DataLayout from qword_4FBB430
  │
  ├─ Phase 0: Infrastructure (TLI, TTI, Verifier, AssumptionCache, ProfileSummary)
  ├─ Phase 1: Language dispatch (a4[3648]: "ptx"/"mid"/default)
  ├─ Phase 2: Pre-optimization passes
  ├─ Phase 3: Main optimization loop (tier threshold dispatch)
  ├─ Phase 4: Post-opt language-specific pipelines
  ├─ Phase 5: Finalization (NVVMLowerBarriers, BreakCriticalEdges, codegen)
  ├─ Phase 6: Phase 2 codegen check (qword_4FBB3B0 == 2)
  ├─ Phase 7: PassManager::run
  └─ Phase 8: Basic block naming ("F%d_B%d" for debug)

Two-Phase Compilation — `sub_12E7E70`

Field	Value
Address	`0x12E7E70`
Size	9.4KB
Strings	`"Phase I"`, `"Phase II"`, `"Concurrent=Yes/No"`

The two-phase model exists because certain optimization passes (e.g., inter-procedural memory space propagation, global inlining decisions) require whole-module visibility, while others (register pressure-driven rematerialization, instruction scheduling) operate per-function and benefit from parallelization. Phase I runs the whole-module analysis and early optimization passes; Phase II runs the per-function backend-oriented passes.

Both phases call the same sub_12E54A0. The difference: qword_4FBB3B0 (TLS variable) is set to 1 or 2 before each call. Individual passes read this counter and skip themselves if the current phase doesn't match their intended execution phase. When the module contains only a single defined function, the phase mechanism is bypassed entirely — a single unphased call handles everything.

Phase State Machine:

  START → [phase=1] → sub_12E54A0 (Phase I)
    │
    error? → RETURN
    │
    count_functions()
    ├─ 1 func → [phase=2] → sub_12E54A0 → [phase=3] → DONE
    ├─ N funcs, threads>1 → per-function Phase II (thread pool) → [phase=3] → DONE
    └─ N funcs, threads≤1 → [phase=2] → sub_12E54A0 → [phase=3] → DONE

Single-function modules skip the phase mechanism entirely — a single unphased call to sub_12E54A0.

GNU Jobserver Integration

When cicc is invoked from a parallel make -jN build, it can participate in the GNU Jobserver protocol to limit its own thread count to the available parallelism tokens. This prevents oversubscription — without it, a -j16 build could spawn 16 cicc processes each creating their own thread pool, resulting in hundreds of threads competing for CPU time. The jobserver reads the --jobserver-auth=R,W pipe file descriptors from the MAKEFLAGS environment variable.

In sub_12E1EF0 (lines 833–866), when a4+3288 is set:

v184 = sub_16832F0(&state, 0);   // parse MAKEFLAGS for --jobserver-auth=R,W
if (v184 == 5 || v184 == 6)      // pipe issues
    warning("jobserver pipe problem");
elif (v184 != 0)
    fatal("GNU Jobserver support requested, but an error occurred");

sub_16832F0 allocates a 296-byte state structure, parses MAKEFLAGS, creates a pipe for token management, and spawns a pthread to manage tokens. Throttles concurrent per-function compilations to match the build's -j level.

Split-Module Compilation

Split-module compilation is NVIDIA's mechanism for the -split-compile=N flag. It decomposes a multi-function module into individual per-function bitcode blobs, compiles each independently (potentially in parallel), then re-links the results. This trades away inter-procedural optimization opportunities for compilation speed and reduced peak memory usage — a worthwhile tradeoff for large CUDA kernels during development iteration.

When optimization level (a4+4104) is negative, enters split-module mode:

Each function's bitcode is extracted via sub_1AB9F40 with filter callback sub_12D4BD0
Module name: "<split-module>" (14 chars)
After thread pool completes, split modules are re-linked via sub_12F5610
Linkage attributes restored from hash table (external linkage types: bits 0–5, dso_local: bit 6 of byte+33)

Pipeline Assembly — `sub_12E54A0`

The pipeline assembly function is the heart of the optimizer. At 49.8KB with ~150 AddPass calls, it constructs the complete LLVM pass pipeline at runtime rather than using a static pipeline description. The function first sets up target machine infrastructure (triple, data layout, subtarget features), then dispatches into one of three language-specific paths that determine which passes run and in what order. After the language-specific path completes, a shared finalization phase runs barriers, critical edge breaking, and codegen preparation.

A distinguishing feature of NVIDIA's pipeline is the tier system: passes are organized into Tiers 0–3, each gated by a threshold counter. As compilation progresses through the main loop (which iterates over external plugin/extension pass entries), tiers fire when the accumulated pass count exceeds their threshold. This allows NVIDIA to precisely control where in the pipeline their custom passes interleave with standard LLVM passes.

Language-Specific Paths

The pipeline branches based on a4[3648] (language string). The three paths represent different optimization strategies for different IR maturity levels:

String	Path	Pass Count	Key Difference
`"ptx"`	Path A	~15	Light: NVVMPeephole → LLVM standard → DCE → MemorySpaceOpt
`"mid"`	Path B	~45	Full: SROA → GVN → LICM → LoopIndexSplit → Remat → all NVIDIA passes
(default)	Path C	~40	General: 4 LLVM standard passes + NVIDIA interleaving

Tier System

The main loop iterates over entries at a4[4488] (16-byte stride: vtable + phase_id):

if (opt_enabled && phase_id > opt_threshold) → sub_12DE330  // Tier 0 (full)
if (tier1_flag && phase_id > tier1_threshold) → sub_12DE8F0(1) // Tier 1
if (tier2_flag && phase_id > tier2_threshold) → sub_12DE8F0(2) // Tier 2
if (tier3_flag && phase_id > tier3_threshold) → sub_12DE8F0(3) // Tier 3

Each tier fires once (flag cleared after execution). Remaining tiers fire unconditionally after the loop.

Tier 0 — Full Optimization (`sub_12DE330`)

Tier 0 is the most aggressive optimization sub-pipeline. It runs ~40 passes in a carefully ordered sequence that interleaves standard LLVM passes with NVIDIA-specific ones. The ordering reveals NVIDIA's optimization strategy: start with GVN and SCCP for value simplification, then run NVIDIA's custom NVVMReflect and NVVMVerifier to clean up NVVM-specific constructs, followed by aggressive loop transformations (LoopIndexSplit, LoopUnroll, LoopUnswitch), and finally register-pressure-sensitive passes (Rematerialization, DSE, DCE) to prepare for codegen.

~40 passes in order:

Confidence note: Pass identifications are based on diagnostic strings, factory signatures, and pipeline ordering. Most are HIGH confidence. Entries with [MEDIUM confidence] are inferred from code structure rather than direct string evidence.

#	Factory	Likely Pass	Guarded By
1	`sub_1654860(1)`	BreakCriticalEdges	—
2	`sub_1A62BF0(1,...)`	LLVM standard pipeline #1	—
3	`sub_1B26330`	MemCpyOpt	—
4	`sub_185D600`	IPConstantPropagation	—
5	`sub_1C6E800`	GVN	—
6	`sub_1C6E560`	NewGVN/GVNHoist `[MEDIUM confidence]`	—
7	`sub_1857160`	NVVMReflect	—
8	`sub_1842BC0`	SCCP	—
9	`sub_12D4560`	NVVMVerifier	—
10	`sub_18A3090`	NVVMPredicateOpt	—
11	`sub_184CD60`	ConstantMerge	—
12	`sub_1869C50(1,0,1)`	Sink/MemSSA `[MEDIUM confidence]`	`!opts[1040]`
13	`sub_1833EB0(3)`	TailCallElim/JumpThreading `[MEDIUM confidence]`	—
14	`sub_1952F90(-1)`	LoopIndexSplit	—
15	`sub_1A62BF0(1,...)`	LLVM standard pipeline #1	—
16	`sub_1A223D0`	NVVMIRVerification	—
17	`sub_1A7A9F0`	InstructionSimplify	—
18	`sub_1A62BF0(1,...)`	LLVM standard pipeline #1	—
19	`sub_1A02540`	GenericToNVVM	—
20	`sub_198DF00(-1)`	LoopSimplify	—
21	`sub_1C76260`	ADCE	`!opts[1320]`
22	`sub_195E880(0)`	LICM	`opts[2880]`
23	`sub_19C1680(0,1)`	LoopUnroll	`!opts[1360]`
24	`sub_19401A0`	InstCombine	—
25	`sub_1968390`	SROA	—
26	`sub_196A2B0`	EarlyCSE	—
27	`sub_19B73C0(2,...)`	LoopUnswitch	—
28	`sub_190BB10(0,0)`	SimplifyCFG	—
29	`sub_1A13320`	NVVMRematerialization	—
30	`sub_18F5480`	DSE	—
31	`sub_18DEFF0`	DCE	—
32	`sub_1A62BF0(1,...)`	LLVM standard pipeline #1	—
33	`sub_18B1DE0`	NVVMLoopPass `[MEDIUM confidence]`	—
34	`sub_1841180`	FunctionAttrs	—

"mid" Path — Complete Pass Ordering

The "mid" path is the primary optimization pipeline for standard CUDA compilation. At ~45 passes, it is the most comprehensive of the three paths. The key pattern is repeated interleaving of NVIDIA custom passes with standard LLVM passes: NVVMIntrinsicLowering runs 4 times at different points, NVVMReflect runs 3 times, and NVVMIRVerification runs after each major transformation to catch correctness regressions early. The MemorySpaceOpt pass appears once in this sequence (gated by !opts[1760]) — it runs again later via the parameterized <second-time> invocation in Tier 1/2/3.

ConstantMerge → NVVMIntrinsicLowering → MemCpyOpt → SROA → NVVMPeephole → NVVMAnnotations → LoopSimplify → GVN → NVVMIRVerification → SimplifyCFG → InstCombine → LLVM standard #5 → NVVMIntrinsicLowering → DeadArgElim → FunctionAttrs → DCE → ConstantMerge → LICM → NVVMLowerBarriers → MemorySpaceOpt → Reassociate → LLVM standard #8 → NVVMReflect → ADCE → InstructionSimplify → DeadArgElim → TailCallElim → DeadArgElim → CVP → Sink → SimplifyCFG → DSE → NVVMSinking2 → NVVMIRVerification → EarlyCSE → NVVMReflect → LLVM standard #8 → NVVMIntrinsicLowering → IPConstProp → LICM → NVVMIntrinsicLowering → NVVMBranchDist → NVVMRemat

NVVMPassOptions — `sub_12D6300`

NVVMPassOptions is NVIDIA's proprietary mechanism for fine-grained control over every optimization pass. Unlike LLVM's cl::opt system (which uses global command-line options), NVVMPassOptions stores per-pass configuration in a flat struct that is allocated once and passed through the pipeline by pointer. This design avoids the global-state problems of cl::opt and allows different compilation units to have different pass configurations within the same process — critical for the concurrent per-function compilation model.

The 125KB initialization function is the largest in the optimizer range. Its size comes from the sheer number of option slots: each of the 221 slots requires a hash-table lookup, a default-value resolution, and a type-specific store, with most slots organized in pairs (a string parameter + a boolean enable flag).

Field	Value
Address	`0x12D6300`
Size	125KB (4,786 lines)
Output struct	4,512 bytes (allocated via `sub_22077B0(4512)`)
Slot count	221 (indices 1–221)
Slot types	114 string + 100 boolean + 6 integer + 1 string-pointer

Struct Layout

Region	Offset	Content
Header	0–7	`int opt_level` (from `a2+112`)
Registry ptr	8–15	Pointer to PassOptionRegistry
Slot pairs	16–4479	221 option slots (string/bool/int pairs)
Sentinel	4480–4511	4 qwords zeroed

Option Slot Types

Type	Size	Writer	Count
String	24B	`sub_12D6090`	114
Bool (compact)	16B	`sub_12D6100`	83
Bool (inline)	16B	direct byte write	17
Integer	16B	`sub_16D2BB0` (parseInt)	6
String pointer	28B	direct qword write (slot 181 only)	1

Pair Organization

Slots are organized in pairs: even = string parameter (the pass's configuration value or name), odd = boolean enable/disable toggle (the do-X flag). This consistent pairing means each "pass knob" has both a parametric value and an on/off switch, allowing passes to be individually disabled without removing their configuration — useful for A/B testing optimizations.

Exceptions to the pair pattern: slots 160–162 (3 consecutive strings — a pass with 3 string parameters), slots 192–193 (2 consecutive bools — a pair of binary flags), slot 181 (the only string-pointer type, storing a char* + length directly — likely a file path or regex pattern).

Defaults Enabled (14 of 100 booleans)

Slots: 19, 25, 93, 95, 117, 141, 143, 151, 155, 157, 159, 165, 211, 219. These are passes that run by default and must be explicitly disabled.

Integer Defaults

Slot	Default	Likely Purpose
9	1	Iteration count / threshold
197	20	Limit (e.g., unroll count)
203	-1	Sentinel (unlimited/auto)
205	-1	Sentinel
207	-1	Sentinel
215	0	Disabled counter

Known Option Names

Boolean toggles (do-X / no-X): do-ip-msp, do-licm, do-remat, do-clone-for-ip-msp, do-cssa, do-scev-cgp, do-function-scev-cgp, do-scev-cgp-aggresively, do-base-address-strength-reduce, do-base-address-strength-reduce-chain, do-comdat-renaming, do-counter-promotion, do-lsr-64-bit, do-sign-ext-expand, do-sign-ext-simplify

Parametric knobs: remat-for-occ, remat-gep-cost, remat-max-live-limit, remat-maxreg-ceiling, remat-move, remat-single-cost-limit, remat-use-limit, branch-dist-block-limit, branch-dist-func-limit, branch-dist-norm, scev-cgp-check-latency, scev-cgp-control, scev-cgp-cross-block-limit, scev-cgp-idom-level-limit, scev-cgp-inst-limit, scev-cgp-norm, cssa-coalesce, cssa-verbosity, base-address-strength-reduce-iv-limit

Dump flags: dump-ip-msp, dump-remat, dump-branch-dist, dump-scev-cgp, dump-sink2, dump-before-cssa, dump-normalize-gep, dump-simplify-live-out

New PM Pass Registration — `sub_2342890`

NVIDIA maintains both the Legacy Pass Manager and the New Pass Manager in cicc v13.0. The New PM registration lives in a single 2,816-line function that registers every analysis, pass, and printer by calling sub_E41FB0(pm, class_name, len, pass_name, len) for each. Standard LLVM passes use the llvm:: prefix (stripped during registration), while NVIDIA custom passes use their own class names.

The registration function also handles parameterized pass parsing: when the pipeline text parser encounters a pass name with angle-bracket parameters (e.g., memory-space-opt<first-time;warnings>), it calls a registered parameter-parsing callback that returns a configured pass options struct. This is how MemorySpaceOpt can run twice with different configurations in the same pipeline.

NVIDIA Custom Passes (35 total)

Module passes (12): check-gep-index, check-kernel-functions, cnp-launch-check, ipmsp, nv-early-inliner, nv-inline-must, nvvm-pretreat, nvvm-verify, printf-lowering, select-kernels, lower-ops*, set-global-array-alignment*

Function passes (20): basic-dbe, branch-dist, byval-mem2reg, bypass-slow-division, normalize-gep, nvvm-reflect-pp, nvvm-peephole-optimizer, old-load-store-vectorizer, remat, propagate-alignment, reuse-local-memory, set-local-array-alignment, sinking2, d2ir-scalarizer, sink<rp-aware>, memory-space-opt*, lower-aggr-copies*, lower-struct-args*, process-restrict*

Loop pass (1): loop-index-split

Analyses (2): rpa (RegisterPressureAnalysis), merge-sets (MergeSetsAnalysis)

* = parameterized

Key Discoveries

nvvm-reflect-pp is actually SimplifyConstantConditionalsPass, not a reflection pass. It runs after NVVMReflect resolves __nvvm_reflect() calls to constants, cleaning up the resulting dead branches and unreachable code. The misleading name ("pp" = post-processing) obscures what is essentially a targeted dead-code-elimination pass.
memory-space-opt runs twice in the pipeline with different parameterizations: <first-time> early in optimization (conservative, uses available alias information) and <second-time> late (aggressive, benefits from earlier optimizations having simplified the IR). This two-pass approach is necessary because address space resolution depends on pointer analysis quality, which improves as other passes simplify the code.
d2ir-scalarizer reuses LLVM's ScalarizerPass class under a different name, suggesting NVIDIA added a custom registration point to control when scalarization happens in the NVPTX pipeline without modifying the upstream pass.
Legacy PM co-existence: both Legacy PM and New PM registrations exist for the same passes, with slightly different names (e.g., "memory-space-opt-pass" vs "memory-space-opt"). This dual registration is necessary during the LLVM Legacy→New PM migration — cicc v13.0 appears to be in the middle of this transition.

Key Global Variables

Variable	Purpose
`qword_4FBB3B0`	Phase counter TLS: 1=Phase I, 2=Phase II, 3=done
`qword_4FBB370`	Feature flag register (value 6 = barrier opt + memspace opt)
`qword_4FBB410`	Tier execution tracker
`qword_4FBB430`	Optimization level store
`qword_4FBB510`	Debug/trace verbosity level
`byte_3F871B3`	NVIDIA global flag byte (empty/null string in .rodata)
`byte_4F99740`	CUTLASS optimization enable flag

NVVMPassOptions Deep Dive

Memory Layout

The 4,512-byte NVVMPassOptions struct is allocated on the heap via sub_22077B0(4512) at the start of each compilation. The layout divides into four regions:

Offset 0x000 [8B]  : int32 opt_level (from config+112) + 4B padding
Offset 0x008 [8B]  : qword ptr to PassOptionRegistry (hash table source)
Offset 0x010 [4464B]: 221 option slots (indices 1-221)
Offset 0x1180[32B] : 4 qwords zeroed (sentinel/trailer)

The slots start at offset 16 and are packed contiguously. Each slot occupies a fixed size depending on its type, but the stride varies: string options take 24 bytes, boolean options take 16 bytes, integer options take 16 bytes, and the single string-pointer option (slot 181) takes 28 bytes. The overall packing is not uniform-stride; the offset of each slot must be computed from the cumulative widths of all preceding slots.

Slot Type Formats

Five distinct slot types exist, each written by a dedicated helper:

// TYPE A: String option (114 instances)
// Written by sub_12D6090 (writeStringOption)
struct StringSlot {        // 24 bytes
    char*   value_ptr;     // +0: pointer to string value
    int32_t option_index;  // +8: 1-based slot index
    int32_t flags;         // +12: from PassDef byte+40
    int32_t opt_level;     // +16: optimization level context
    int32_t pass_id;       // +20: resolved via sub_1691920
};

// TYPE B: Boolean compact (83 instances)
// Written by sub_12D6100 (writeBoolOption)
struct BoolCompactSlot {   // 16 bytes
    uint8_t value;         // +0: 0 or 1
    uint8_t pad[3];        // +1: padding
    int32_t option_index;  // +4
    int32_t flags;         // +8
    int32_t pass_id;       // +12
};

// TYPE C: Boolean inline (17 instances)
// Written directly as byte + int32 fields
struct BoolInlineSlot {    // 16 bytes
    uint8_t value;         // +0: 0 or 1
    uint8_t pad[3];        // +1
    int32_t option_index;  // +4: from sub_12D6240 return hi32
    int32_t opt_level;     // +8
    int32_t pass_id;       // +12: resolved inline
};

// TYPE D: Integer (6 instances)
// Value parsed by sub_16D2BB0 (parseInt)
struct IntegerSlot {       // 16 bytes
    int32_t value;         // +0: parsed integer
    int32_t option_index;  // +4
    int32_t opt_level;     // +8
    int32_t pass_id;       // +12
};

// TYPE E: String pointer (1 instance, slot 181 only)
struct StringPtrSlot {     // 28 bytes
    char*   char_ptr;      // +0: raw string data pointer
    int64_t str_length;    // +8: length of string
    int32_t option_index;  // +16
    int32_t opt_level;     // +20
    int32_t pass_id;       // +24
};

Helper Function Chain

The initialization function sub_12D6300 populates the struct by iterating all 221 slot indices and calling a chain of helpers for each:

sub_12D6170 (PassOptionRegistry::lookupOption) -- looks up a slot index in the hash table at registry+120. Returns a pointer to an OptionNode struct: [+40] int16 flags, [+48] qword* value_array_ptr, [+56] int value_count. Returns null if the option was not set on the command line.
sub_12D6240 (getBoolOption) -- resolves a boolean option. Calls sub_12D6170 to find the option, then if a string value exists, lowercases it via sub_16D2060 and tests if the first char is '1' (0x31) or 't' (0x74). If the option was not found, defaults to true (enabled). Returns the boolean packed with the flags in the low 40 bits.
sub_1691920 (PassDefTable::getPassDef) -- looks up a PassDef entry in a table where each entry is 64 bytes. Computes: table[0] + (index - 1) * 64. The PassDef at [+32] holds the pass_id, at [+36] a has_overrides flag, and at [+40] an override index.

Initial Slots (1-6): Global Configuration

The first six slots are all string types at a uniform 24-byte stride, starting at offset 16. They do not follow the pair pattern and represent global pipeline parameters rather than per-pass knobs:

Slot	Offset	Likely Content
1	16	`ftz` (flush-to-zero mode string)
2	40	`prec-div` (precise division setting)
3	64	`prec-sqrt` (precise square root setting)
4	88	`fmad` (fused multiply-add policy)
5	112	`opt-level` (optimization level string)
6	136	`sm-arch` (target SM architecture string)

CLI Interface

Users interact with NVVMPassOptions via the -opt flag, which appends key=value pairs to the PassOptionRegistry before sub_12D6300 flattens them:

cicc -opt "-do-ip-msp=0"            # disable memory space propagation
cicc -opt "-do-licm=0"              # disable LICM
cicc -opt "-remat-max-live-limit=50" # set rematerialization threshold
cicc -opt "-dump-remat"             # enable remat dump output

The registry is a hash table populated from these CLI strings. Each -opt argument is parsed into a key (the option name) and value (the string after =). When sub_12D6300 runs, it queries the registry for each of the 221 slot indices. If a CLI override exists, it takes precedence; otherwise the compiled-in default is used.

Option Anomalies

Several regions break the standard string/boolean pair pattern:

Slots 160-162: Three consecutive string slots with no interleaved boolean. [LOW confidence] This represents a pass (likely MemorySpaceOpt or the CSSA pass) that takes three string configuration parameters followed by a single boolean enable flag at slot 163. The pass identity is uncertain because neither MemorySpaceOpt nor CSSA has been confirmed to consume three string parameters; the association is based on pipeline position proximity only.
Slots 192-193: Two consecutive boolean slots. One is the main enable toggle; the other appears to be a sub-feature flag (both default to disabled).
Slot 181 (offset 3648): The only STRING_PTR type. Its default is byte_3F871B3 (an empty string in .rodata). The raw pointer + length storage suggests this holds a file path or regex pattern for pass filtering.
Slots 196-207: Alternating string + integer slots instead of string + boolean. [LOW confidence] This high-numbered region contains all six integer options, likely controlling late-pipeline passes with numeric thresholds (unroll counts, live-variable limits, iteration bounds). The specific pass-to-slot associations are unconfirmed; the "unroll counts, live-variable limits, iteration bounds" interpretation is based on typical LLVM integer-valued pass options, not direct evidence.

Complete Slot-to-Offset Map with Known Consumers

The following table maps NVVMPassOptions slot indices to struct byte offsets, types, defaults, and -- where the cross-reference to the pipeline assembler's a4[offset] guards could be established -- the consuming pass(es). Offsets marked with * are confirmed by cross-referencing a4[offset] guards in sub_12E54A0 and sub_12DE8F0.

Slot	Offset	Type	Default	Known Knob Name	Consuming Pass
1	16	STRING		`ftz`	Global: flush-to-zero mode
2	40	STRING		`prec-div`	Global: precise division
3	64	STRING		`prec-sqrt`	Global: precise sqrt
4	88	STRING		`fmad`	Global: fused multiply-add
5	112	STRING		`opt-level`	Global: optimization level
6	136	STRING		`sm-arch`	Global: target SM architecture
7	160	BOOL	0
8	176	STRING
9	200*	INTEGER	1		Opt level for `sub_12DFE00` codegen
10	216	STRING
11	240	BOOL	0
13	280*	BOOL	0	`no-dce`	`sub_18DEFF0` (DCE)
15	320*	BOOL	0	`no-tailcallelim`	`sub_1833EB0` (TailCallElim)
17	360*	BOOL	0	`no-late-opt`	`sub_1C46000` (NVVMLateOpt)
19	400*	BOOL	1	`no-inline-a`	Inlining variant A
21	440*	BOOL	0	`no-inline-b`	`sub_1C4B6F0` (AlwaysInliner)
23	480*	BOOL	0	`no-inline-c`	`sub_1C4B6F0` in `sub_12DE8F0`
25	520*	BOOL	1		`sub_1AAC510` (NVIDIA pass A)
27	560*	BOOL	0		`sub_1AAC510` (NVIDIA pass B)
29	600*	BOOL	0	`no-nvvm-verify`	`sub_12D4560` (NVVMVerifier)
33	680*	BOOL	0	`no-func-attrs`	`sub_1841180` (FunctionAttrs)
35	720*	BOOL	0	`no-sccp`	`sub_1842BC0` (SCCP)
37	760*	BOOL	0	`no-dse`	`sub_18F5480` (DSE)
43	880*	BOOL	0	`no-nvvm-reflect`	`sub_1857160` (NVVMReflect)
45	920*	BOOL	0	`no-ipconst`	`sub_185D600` (IPConstProp)
47	960*	BOOL	0	`no-simplifycfg`	`sub_190BB10` (SimplifyCFG)
49	1000*	BOOL	0	`no-instcombine`	`sub_19401A0` (InstCombine)
51	1040*	BOOL	0	`no-sink`	`sub_1869C50` (Sink/MemSSA)
53	1080*	BOOL	0	`no-dump`	`sub_17060B0` (PrintModulePass)
55	1120*	BOOL	0	`no-predopt`	`sub_18A3430` (NVVMPredicateOpt)
57	1160*	BOOL	0	`no-loopindexsplit`	`sub_1952F90` (LoopIndexSplit)
59	1200*	BOOL	0	`no-simplifycfg-b`	SimplifyCFG variant B
61	1240*	BOOL	0	`do-licm` (inverted)	`sub_195E880` (LICM)
63	1280*	BOOL	0	`no-reassoc`	`sub_1B7FDF0` (Reassociate)
65	1320*	BOOL	0	`no-adce-a`	`sub_1C76260` (ADCE variant)
67	1360*	BOOL	0	`no-loopunroll`	`sub_19C1680` (LoopUnroll)
69	1400*	BOOL	0	`no-sroa`	`sub_1968390` (SROA)
71	1440*	BOOL	0	`no-earlycse`	`sub_196A2B0` (EarlyCSE)
73	1480*	BOOL	0	`no-adce-b`	ADCE variant B
75	1520*	BOOL	0	`no-loopsimplify`	`sub_198DF00` (LoopSimplify)
83	1680*	BOOL	0		`sub_19CE990` (NVIDIA pass)
87	1760*	BOOL	0	`do-ip-msp` (inverted)	`sub_1C8E680` (MemorySpaceOpt)
91	1840*	BOOL	0	`no-adce-c`	`sub_1C6FCA0` (ADCE)
93	1880	BOOL	1		NVVMReduction param A
95	1920	BOOL	1		NVVMReduction param B
97	1960*	BOOL	0	`no-constmerge`	`sub_184CD60` (ConstantMerge)
99	2000*	BOOL	0	`no-intrin-lower`	`sub_1CB4E40` (NVVMIntrinsicLowering)
101	2040*	BOOL	0	`no-memcpyopt`	`sub_1B26330` (MemCpyOpt)
105	2120*	BOOL	0	`no-branchdist-b`	`sub_1CB73C0` (NVVMBranchDist B)
109	2200*	BOOL	0	`no-generic2nvvm`	`sub_1A02540` (GenericToNVVM)
113	2280*	BOOL	0	`no-loweralloca-b`	NVVMLowerAlloca B
115	2320*	BOOL	0	`do-remat` (inverted)	`sub_1A13320` (NVVMRemat)
117	2360	BOOL	1		`sub_1CC3990` (NVVMUnreachBlockElim)
121	2440*	BOOL	0	`no-sinking2`	`sub_1CC60B0` (NVVMSinking2)
127	2560*	BOOL	0	`no-genericaddropt`	`sub_1CC71E0` (NVVMGenericAddrOpt)
129	2600*	BOOL	0	`no-irverify`	`sub_1A223D0` (NVVMIRVerification)
131	2640*	BOOL	0	`no-loopopt`	`sub_18B1DE0` (NVVMLoopOpt)
133	2680*	BOOL	0	`no-memspaceopt-b`	MemorySpaceOpt in `sub_12DE8F0`
135	2720*	BOOL	0	`no-instsimplify`	`sub_1A7A9F0` (InstructionSimplify)
141	2840*	BOOL	1		Enable ADCE (`sub_1C6FCA0`, reversed)
143	2880*	BOOL	1	`do-licm`	Enable LICM (reversed logic)
149	3000*	BOOL	0		Extra DeadArgElim trigger
151	3040	BOOL	1		Enable CorrelatedValuePropagation
155	3120*	BOOL	1		Address space optimization flag
157	3160*	BOOL	1	`dump-*` master	Debug dump mode (PrintModulePass)
159	3200*	BOOL	1		Enable advanced NVIDIA passes group
165	3328*	BOOL	1		Enable SM-specific warp/reduction/sinking
173	3488*	BOOL	0		Enable barrier optimization
175	3528*	BOOL	0		Tier 1 optimization enable
177	3568*	BOOL	0		Tier 2 optimization enable
179	3608*	BOOL	0		Tier 3 optimization enable
181	3648*	STR_PTR	`""`		Language string (`"ptx"/"mid"/"idn"`)
183	3704*	BOOL	0		Late optimization / address-space mode
193	3904*	BOOL	0		Debug: verify after each plugin pass
195	3944*	BOOL	0		Debug: rename BBs to `"F%d_B%d"`
197	3984	INTEGER	20		Limit/threshold (e.g., unroll count)
203	4104	INTEGER	-1		Sentinel: unlimited/auto
205	4144	INTEGER	-1		Sentinel: unlimited/auto
207	4184	INTEGER	-1		Sentinel: unlimited/auto
209	4224*	BOOL	0		Master optimization switch
211	4264	BOOL	1
213	4304*	BOOL	0		Device-code / separate-compilation
215	4344	INTEGER	0		Disabled counter
217	4384*	BOOL	0		Fast-compile / bypass LLVM pipeline
219	4424	BOOL	1
221	4464*	BOOL	0		Disable late CFG cleanup variant B

Slots not listed have no confirmed cross-reference to pipeline assembler guards. The full 221-slot table is in the NVVMPassOptions Reference.

Complete Option Name Inventory

The following option names were extracted from binary string references in .rodata. They are set via -opt "-name=value" on the cicc command line (requires NVVMCCWIZ=553282 in non-release builds).

Boolean toggles (do-X / no-X):

Name	Effect
`do-ip-msp`	Enable inter-procedural memory space propagation
`do-licm`	Enable LICM (loop-invariant code motion)
`do-remat`	Enable NVVMRematerialization
`do-clone-for-ip-msp`	Enable function cloning for IPMSP
`do-cssa`	Enable Conventional SSA construction
`do-scev-cgp`	Enable SCEV-based CodeGenPrepare
`do-function-scev-cgp`	Enable function-level SCEV-CGP
`do-scev-cgp-aggresively`	Aggressive SCEV-CGP mode [sic]
`do-base-address-strength-reduce`	Enable base address strength reduction
`do-base-address-strength-reduce-chain`	Enable chained base address SR
`do-comdat-renaming`	Enable COMDAT group renaming
`do-counter-promotion`	Enable counter promotion
`do-lsr-64-bit`	Enable 64-bit loop strength reduction
`do-sign-ext-expand`	Enable sign extension expansion
`do-sign-ext-simplify`	Enable sign extension simplification

Parametric knobs:

Name	Type	Purpose
`remat-for-occ`	string	Rematerialization occupancy target
`remat-gep-cost`	string	GEP rematerialization cost
`remat-ignore-single-cost`	string	Skip single-use cost analysis
`remat-lli-factor`	string	Live-interval factor
`remat-load-param`	string	Parameter load remat policy
`remat-loop-trip`	string	Loop trip count for remat decisions
`remat-max-live-limit`	string	Maximum live variable count
`remat-maxreg-ceiling`	string	Register ceiling for remat
`remat-move`	string	Rematerialization move policy
`remat-single-cost-limit`	string	Single-value cost limit
`remat-use-limit`	string	Use count limit for remat
`branch-dist-block-limit`	string	Block count limit for branch distribution
`branch-dist-func-limit`	string	Function-level branch dist limit
`branch-dist-norm`	string	Normalization factor
`scev-cgp-check-latency`	string	Latency check threshold
`scev-cgp-control`	string	CGP control mode
`scev-cgp-cross-block-limit`	string	Cross-block analysis limit
`scev-cgp-idom-level-limit`	string	Immediate dominator depth limit
`scev-cgp-inst-limit`	string	Instruction count limit
`scev-cgp-norm`	string	Normalization factor
`scev-cgp-old-base`	string	Legacy base address mode
`scev-cgp-tid-max-value`	string	Thread ID maximum value
`base-address-strength-reduce-iv-limit`	string	IV count limit for base addr SR
`base-address-strength-reduce-max-iv`	string	Maximum IV for base addr SR
`cssa-coalesce`	string	CSSA coalescing mode
`cssa-verbosity`	string	CSSA debug verbosity

Dump/debug flags:

Name	Purpose
`dump-ip-msp`	Dump IPMSP analysis results
`dump-ir-before-memory-space-opt`	Dump IR before MemorySpaceOpt
`dump-ir-after-memory-space-opt`	Dump IR after MemorySpaceOpt
`dump-memory-space-warnings`	Dump address space warnings
`dump-remat`	Dump rematerialization decisions
`dump-remat-add`	Dump remat additions
`dump-remat-iv`	Dump remat induction variables
`dump-remat-load`	Dump remat load decisions
`dump-branch-dist`	Dump branch distribution analysis
`dump-scev-cgp`	Dump SCEV-CGP analysis
`dump-base-address-strength-reduce`	Dump base address SR
`dump-sink2`	Dump Sinking2 pass output
`dump-before-cssa`	Dump IR before CSSA
`dump-phi-remove`	Dump PHI node removal
`dump-normalize-gep`	Dump GEP normalization
`dump-simplify-live-out`	Dump live-out simplification
`dump-process-restrict`	Dump restrict processing
`dump-process-builtin-assume`	Dump builtin assume processing
`dump-conv-dot`	Dump convergence as DOT graph
`dump-conv-func`	Dump convergence per function
`dump-conv-text`	Dump convergence as text
`dump-nvvmir`	Dump NVVM IR
`dump-va`	Dump value analysis

Tier-Based Pass Ordering

The Threshold Dispatch Mechanism

NVIDIA's tier system is a priority-driven scheduling mechanism that interleaves optimization sub-pipelines with external plugin passes. The master pipeline function sub_12E54A0 iterates over a pass registration array at a4[4488] (16-byte stride entries: [+0] vtable_ptr, [+8] phase_id). As it processes each entry, it checks whether the entry's phase_id exceeds a threshold. When it does, the corresponding tier sub-pipeline fires once:

// Pseudocode for the main loop in sub_12E54A0
for (entry = a4[4488]; entry < a4[4496]; entry += 16) {
    int phase_id = *(int*)(entry + 8);

    if (opt_enabled && phase_id > opt_threshold) {
        sub_12DE330(PM, opts);      // Tier 0: full optimization
        opt_enabled = 0;            // fire once
    }
    if (tier1_flag && phase_id > tier1_threshold) {
        sub_12DE8F0(PM, 1, opts);   // Tier 1
        tier1_flag = 0;
    }
    if (tier2_flag && phase_id > tier2_threshold) {
        sub_12DE8F0(PM, 2, opts);   // Tier 2
        tier2_flag = 0;
    }
    if (tier3_flag && phase_id > tier3_threshold) {
        sub_12DE8F0(PM, 3, opts);   // Tier 3
        tier3_flag = 0;
    }

    // Insert the plugin/external pass itself
    pass = vtable_call(entry, +72);  // entry->createPass()
    AddPass(PM, pass, 1, 0);
}

// Any tier that didn't fire during the loop fires now
if (opt_enabled)  sub_12DE330(PM, opts);
if (tier1_flag)   sub_12DE8F0(PM, 1, opts);
if (tier2_flag)   sub_12DE8F0(PM, 2, opts);
if (tier3_flag)   sub_12DE8F0(PM, 3, opts);

This design means tier placement is data-driven: the thresholds stored at config offsets 4224/4228 (Tier 0), 3528/3532 (Tier 1), 3568/3572 (Tier 2), and 3608/3612 (Tier 3) determine exactly where in the plugin pass sequence each tier's sub-pipeline gets inserted. Changing the threshold shifts an entire tier of ~40 passes to a different position relative to the external passes. After each tier fires, its flag is cleared so it cannot fire again.

Tier 0 Ordering Strategy

Tier 0 (sub_12DE330) is the most comprehensive sub-pipeline at ~40 passes. Its ordering reflects NVIDIA's optimization philosophy for GPU code:

Phase A -- Value Simplification (passes 1-8): BreakCriticalEdges normalizes the CFG, then the CGSCC inliner framework runs first to create optimization opportunities. NVVMReflect resolves __nvvm_reflect() calls to compile-time constants (GPU architecture queries), and SCCP propagates those constants. GVN and NewGVN/GVNHoist eliminate redundant computations.

Phase B -- NVIDIA-Specific Cleanup (passes 9-12): NVVMVerifier catches NVVM-specific IR errors early. NVVMPredicateOpt optimizes predicate expressions. ConstantMerge reduces module size.

Phase C -- Loop Transformations (passes 13-27): This is the core loop optimization sequence. Sink/MemSSA moves code out of hot paths. LoopIndexSplit divides loops at index boundaries. LICM hoists invariants. LoopUnroll with factor 3 expands small loops. LoopUnswitch moves conditionals out of loops. ADCE removes dead code exposed by loop transformations.

Phase D -- Register Pressure Management (passes 28-40): InstCombine and SROA simplify the IR further. NVVMRematerialization recomputes values to reduce register pressure -- critical for GPU occupancy. DSE and DCE clean up dead stores and code. The final CGSCC pass and FunctionAttrs prepare for per-function Phase II processing.

Tier 1/2/3 Incremental Additions -- `sub_12DE8F0`


Address	`0x12DE8F0`
Size	17,904 bytes
Signature	`int64 sub_12DE8F0(int64 passMgr, int tier, int64 opts)`

sub_12DE8F0 adds passes incrementally based on the tier value (1, 2, or 3). Its first action stores the tier into qword_4FBB410 (the tier tracker global), then checks qword_4FBB3B0 (phase counter) for phase-dependent behavior. Nearly every pass insertion is gated by a boolean in the NVVMPassOptions struct.

The full pass list for sub_12DE8F0 (all tiers combined, with tier-specific gates):

sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering (level=1)
sub_1A223D0()  [!opts[2600]]            NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering (barrier=1)
sub_18E4A00()  [opts[3488]]             NVVMBarrierAnalysis
sub_1C98160(0) [opts[3488]]             NVVMLowerBarriers
sub_12D4560()  [!opts[600]]             NVVMVerifier
sub_185D600()  [opts[3200]&&!opts[920]] IPConstPropagation         [advanced group]
sub_1857160()  [opts[3200]&&!opts[880]] NVVMReflect                [advanced group]
sub_18A3430()  [opts[3200]&&!opts[1120]] NVVMPredicateOpt          [advanced group]
sub_1842BC0()  [opts[3200]&&!opts[720]] SCCP                       [advanced group]
sub_12D4560()  [!opts[600]]             NVVMVerifier
sub_18A3090()  [opts[3200]&&!opts[2160]] NVVMPredicateOpt variant  [advanced group]
sub_184CD60()  [opts[3200]&&!opts[1960]] ConstantMerge             [advanced group]
sub_190BB10(1,0)[tier!=1 && guards]     SimplifyCFG                [TIER 2/3 ONLY]
sub_1952F90(-1)[tier!=1 && guards]      LoopIndexSplit             [TIER 2/3 ONLY]
sub_12D4560()  [tier!=1 && !opts[600]]  NVVMVerifier               [TIER 2/3 ONLY]
sub_195E880(0) [opts[3704]&&opts[2880]] LICM
sub_1C8A4D0(v) [v=1 if opts[3704]]     EarlyCSE
sub_1869C50(1,0,1)[tier!=1&&!opts[1040]] Sink                     [TIER 2/3 ONLY]
sub_1833EB0(3) [tier==3 && !opts[320]]  TailCallElim              [TIER 3 ONLY]
sub_1CC3990()  [!opts[2360]]            NVVMUnreachableBlockElim
sub_18EEA90()  [opts[3040]]             CorrelatedValuePropagation
sub_12D4560()  [!opts[600]]             NVVMVerifier
sub_1A223D0()  [!opts[2600]]            NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering
sub_1C4B6F0()  [!opts[440]&&!opts[480]] Inliner
sub_1A7A9F0()  [!opts[2720]]            InstructionSimplify
sub_12D4560()  [!opts[600]]             NVVMVerifier
sub_1A02540()  [!opts[2200]]            GenericToNVVM
sub_198DF00(-1)[!opts[1520]]            LoopSimplify
sub_1C76260()  [!opts[1320]&&!opts[1480]] ADCE
sub_195E880(0) [opts[2880]&&!opts[1240]] LICM
sub_1C98160(v) [opts[3488]]             NVVMLowerBarriers
sub_19C1680(0,1)[!opts[1360]]           LoopUnroll
sub_19401A0()  [!opts[1000]]            InstCombine
sub_196A2B0()  [!opts[1440]]            EarlyCSE
sub_1968390()  [!opts[1400]]            SROA
sub_19B73C0(t,...)[tier!=1]             LoopUnswitch (SM-dependent) [TIER 2/3 ONLY]
sub_1A62BF0(1,...)[!opts[600]]          LLVM standard pipeline #1
sub_1A223D0()  [!opts[2600]]            NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering
sub_190BB10(0,0)[!opts[960]]            SimplifyCFG
sub_1922F90()  [opts[3080]]             NVIDIA-specific loop pass
sub_195E880(0) [opts[2880]&&!opts[1240]] LICM
sub_1A13320()  [!opts[2320]]            NVVMRematerialization
sub_1968390()  [!opts[1400]]            SROA
sub_18EEA90()  [opts[3040]]             CorrelatedValuePropagation
sub_18F5480()  [!opts[760]]             DSE
sub_18DEFF0()  [!opts[280]]             DCE
sub_1A62BF0(1,...)[!opts[600]]          LLVM standard pipeline #1
sub_1AAC510()  [!opts[520]&&!opts[560]] NVIDIA-specific pass
sub_1A223D0()  [!opts[2600]]            NVVMIRVerification
sub_1CB4E40(1) [!opts[2000]]            NVVMIntrinsicLowering
sub_1C8E680()  [!opts[2680]]            MemorySpaceOpt (from opts[3120])
sub_1CC71E0()  [!opts[2560]]            NVVMGenericAddrOpt
sub_1C98270(1,v)[opts[3488]]            NVVMLowerBarriers variant
sub_1C6FCA0()  [opts[2840]&&!opts[1840]] ADCE
sub_18B1DE0()  [opts[3200]&&!opts[2640]] LoopOpt/BarrierOpt        [advanced group]
sub_1857160()  [opts[3200]&&tier==3]    NVVMReflect                [TIER 3 ONLY]
sub_1841180()  [opts[3200]&&!opts[680]] FunctionAttrs              [advanced group]
sub_1C46000()  [tier==3&&!opts[360]]    NVVMLateOpt                [TIER 3 ONLY]
sub_1841180()  [opts[3200]&&!opts[680]] FunctionAttrs (2nd call)   [advanced group]
sub_1CBC480()  [!opts[2240]&&!opts[2280]] NVVMLowerAlloca
sub_1CB73C0()  [!opts[2080]&&!opts[2120]] NVVMBranchDist
sub_1C7F370(1) [opts[3328]&&!opts[1640]] NVVMWarpShuffle           [SM-specific]
sub_1CC5E00()  [opts[3328]&&!opts[2400]] NVVMReduction             [SM-specific]
sub_1CC60B0()  [opts[3328]&&!opts[2440]] NVVMSinking2              [SM-specific]
sub_1CB73C0()  [opts[3328]&&guards]     BranchDist (2nd call)      [SM-specific]
sub_1B7FDF0(3) [opts[3328]&&!opts[1280]] Reassociate               [SM-specific]

Tier 1 (baseline) adds the passes above EXCEPT those gated by tier!=1: SimplifyCFG, LoopIndexSplit, Sink, and LoopUnswitch are all skipped. This is a conservative set focused on NVIDIA-specific cleanup without expensive LLVM optimization.

Tier 2 adds everything Tier 1 has plus the tier!=1-gated passes. The LoopUnswitch parameters are SM-architecture-dependent: sub_19B73C0 receives different vector widths based on the target subtarget.

Tier 3 adds TailCallElim (gated tier==3), NVVMReflect at a late position (gated tier==3), and NVVMLateOpt (gated tier==3). Critically, it also triggers feature flag escalation (see below).

Feature Flag Escalation

A notable pattern occurs only in Tier 3: if BYTE4(qword_4FBB370[2]) is zero (no advanced features enabled), the tier handler allocates a new integer with value 6 and stores it via sub_16D40E0. The value 6 (binary 110) enables two feature gates used by later passes: barrier optimization and memory-space optimization. This means Tier 3 (O3) automatically enables optimization features that lower tiers leave disabled, without requiring explicit CLI flags.

O-Level Pipeline Comparison

Pipeline Selection

The new-PM driver sub_226C400 selects pipeline name strings based on config flags:

byte[888]  set  →  "nvopt<O0>"
byte[928]  set  →  "nvopt<O1>"
byte[968]  set  →  "nvopt<O2>"
byte[1008] set  →  "nvopt<O3>"

These strings are passed to sub_2277440 (the new-PM text pipeline parser). The nvopt prefix is registered as a pipeline element in both sub_225D540 (new PM) and sub_12C35D0 (legacy PM), with vtables at 0x4A08350 and 0x49E6A58 respectively.

O0: No Optimization

O0 skips the full pipeline entirely. The code falls through to LABEL_159 which calls only sub_1C8A4D0(0) (NVVMFinalCleanup), then proceeds directly to finalization. No Tier 0/1/2/3 sub-pipelines fire. The result is ~5-8 passes total: TargetLibraryInfo, TargetTransformInfo, Verifier, AssumptionCache, ProfileSummary, NVVMFinalCleanup, and codegen setup.

O1/O2/O3: Full Pipeline with Tier Differentiation

All three levels call sub_12DE330 for the same ~40-pass Tier 0 sub-pipeline. The differences manifest through four mechanisms:

1. Tier sub-pipeline gating. sub_12DE8F0 is called with the tier number corresponding to the O-level. O1 gets tier=1 (conservative, skips several passes). O2 gets tier=2 (full set). O3 gets tier=3 (aggressive + feature flag escalation).

2. CGSCC iteration counts. The CGSCC pass manager wrapper sub_1A62BF0 takes an iteration count as its first argument. In the O1/O2/O3 base pipeline, it is called with 1 (single inliner pass). In the "mid" fast-compile path, it is called with 5 iterations. In the default path, it varies from 1 to 8 depending on pipeline position, allowing more aggressive devirtualization and inlining at higher optimization levels.

3. Loop unroll factor. sub_1833EB0 is called with factor 3 in the standard pipeline. Tier 3 adds an additional call to TailCallElim and more aggressive LoopUnswitch parameters (the sub_19B73C0 call receives SM-arch-dependent vector widths at Tier 2/3).

4. Vectorizer parameters. sub_19B73C0 receives different arguments based on tier:

Tier 0: (2, -1, -1, -1, -1, -1, -1) -- conservative vector width 2, all thresholds unlimited
"mid" path: (3, -1, -1, 0, 0, -1, 0) -- vector width 3, some thresholds zeroed (disabled)
Tier 2/3: Parameters vary by SM architecture via config struct lookups

Fast-Compile Levels vs O-Levels

Pipeline	Entry Path	Passes	LSA	MemSpaceOpt	Key Difference
`nvopt<O0>`	LABEL_159	~5-8	off	off	No optimization
`nvopt<Ofcmax>`	LABEL_196	~12-15	forced 0	forced 0	Sinking2(fast) + minimal canonicalization
`nvopt<Ofcmid>`	LABEL_297	~25-30	normal	enabled	CGSCC(5), LoopVectorize(conservative)
`nvopt<Ofcmin>`	LABEL_297	~30-35	normal	enabled	Like Ofcmid but more aggressive loop settings
`nvopt<O1>`	sub_12DE330	~35	normal	enabled	Tier 1: conservative set
`nvopt<O2>`	sub_12DE330	~35+	normal	enabled	Tier 2: full optimization set
`nvopt<O3>`	sub_12DE330	~35+	normal	enabled	Tier 3: aggressive + feature escalation

Ofcmax is architecturally distinct: it forces -lsa-opt=0 and -memory-space-opt=0 in the optimizer flags (confirmed in both sub_9624D0 line 1358 and sub_12CC750 line 2025). This means two of NVIDIA's most important proprietary passes -- LSA optimization and MemorySpaceOpt -- are unconditionally disabled regardless of what the user requests.

Pipeline Text Strings and `nvopt<>` Dispatch

The `nvopt<>` Naming Convention

NVIDIA replaces LLVM's standard default<O2> pipeline naming with a proprietary nvopt<> prefix. The new-PM driver sub_226C400 (35KB, at 0x226C400) selects one of exactly seven pipeline name strings based on optimization level and fast-compile flags. These strings are passed verbatim to sub_2277440 (60KB, at 0x2277440) -- NVIDIA's equivalent of LLVM's PassBuilder::buildDefaultPipeline().

nvopt<O0>       Optimization disabled. ~5-8 infrastructure passes only.
nvopt<O1>       Standard optimization, Tier 1 (conservative).
nvopt<O2>       Standard optimization, Tier 2 (full).
nvopt<O3>       Standard optimization, Tier 3 (aggressive + feature escalation).
nvopt<Ofcmax>   Fast-compile maximum speed. Forces -lsa-opt=0, -memory-space-opt=0.
nvopt<Ofcmid>   Fast-compile medium. MemorySpaceOpt enabled, CGSCC(5) iterations.
nvopt<Ofcmin>   Fast-compile minimum. Like Ofcmid but more aggressive loop settings.

Selection Algorithm (`sub_226C400`)

The config struct encodes O-level flags at fixed byte offsets. The fast-compile level string (if present) is at qwords 131/132 (offset 1048/1056), encoded as a 3-byte sequence compared via 2-byte word + 1-byte suffix:

// sub_226C400, lines 828-874 (pseudocode)
char* select_pipeline_name(Config* cfg) {
    if (cfg->byte[928])   return "nvopt<O1>";     // 9 chars
    if (cfg->byte[968])   return "nvopt<O2>";     // 9 chars
    if (cfg->byte[1008])  return "nvopt<O3>";     // 9 chars

    char* fc = cfg->qword[131];
    int fc_len = cfg->qword[132];
    if (fc_len == 3) {
        // Word comparison: *(uint16_t*)fc, then byte fc[2]
        if (*(uint16_t*)fc == 24941 && fc[2] == 120)  // "max" = 'a','m' + 'x'
            return "nvopt<Ofcmax>";   // 14 chars
        if (*(uint16_t*)fc == 26989 && fc[2] == 100)  // "mid" = 'i','m' + 'd'
            return "nvopt<Ofcmid>";   // 14 chars
        if (*(uint16_t*)fc == 26989 && fc[2] == 110)  // "min" = 'i','m' + 'n'
            return "nvopt<Ofcmin>";   // 14 chars
    }
    return "nvopt<O0>";              // 9 chars
}

The nvopt prefix is registered as a pipeline element in sub_225D540 (new PM, vtable 0x4A08350) and sub_12C35D0 (legacy PM, vtable 0x49E6A58). Both route into an nvopt pipeline builder class that creates a 512-byte pipeline object via sub_12EC960.

Mutual Exclusion

Combining -O# with --passes= or --foo-pass is an error:

Cannot specify -O#/-Ofast-compile=<min,mid,max> and --passes=/--foo-pass,
use -passes='default<O#>,other-pass' or -passes='default<Ofcmax>,other-pass'

Pipeline Text Parser (`sub_2277440`)

sub_2277440 (60KB) is the new-PM buildDefaultPipeline() equivalent. It tokenizes the pipeline name string via sub_2352D90, then dispatches to the appropriate pipeline builder based on the nvopt<> parameter. NVIDIA custom passes are injected via extension point callbacks at [PassBuilder+2208] (stride 32 bytes per entry, count at [PassBuilder+2216]). Each callback entry has a guard pointer at [+16] and a callback function at [+24].

Fast-Compile Level Encoding

In the libnvvm config struct, offset 1640 holds an integer encoding:

Value	CLI Source	Pipeline Name	Notes
0	(no `-Ofast-compile`)	normal O-level	Default
1	`-Ofast-compile=0`	reset to 0	Treated as "off"
2	`-Ofc=max`	`nvopt<Ofcmax>`	Forces `-lsa-opt=0`, `-memory-space-opt=0`
3	`-Ofc=mid`	`nvopt<Ofcmid>`	MemorySpaceOpt enabled
4	`-Ofc=min`	`nvopt<Ofcmin>`	Closest to full optimization

Any other value produces: "libnvvm : error: -Ofast-compile called with unsupported level, only supports 0, min, mid, or max".

Pass Registration Architecture

Dual Pass Manager Support

cicc v13.0 maintains registrations for both the Legacy Pass Manager and the New Pass Manager simultaneously. This dual support is necessary during the LLVM Legacy-to-New PM migration. The Legacy PM path is taken when a4[4384] != 0 (the fast-compile/bypass flag), while the New PM path handles normal compilation.

Legacy PM registration occurs in pass constructor functions scattered throughout the binary. For example, MemorySpaceOpt registers as "memory-space-opt-pass" via sub_1C97F80. Each Legacy PM pass calls RegisterPass<> with a pass ID and description string.

New PM registration is centralized in sub_2342890 -- a single 2,816-line function that registers every analysis, pass, and printer. It calls sub_E41FB0(pm, class_name, len, pass_name, len) for each pass, inserting into a StringMap with open-addressing and linear probing.

New PM Registration Structure

sub_2342890 registers passes in a strict ordering by pipeline level:

Section	Lines	Count	Content
Module analyses	514-596	~18	CallGraph, ProfileSummary, LazyCallGraph, etc.
Module passes	599-1153	~95	AlwaysInline, GlobalOpt, NVIDIA module passes
CGSCC analyses	1155-1163	~5	FunctionAnalysisManagerCGSCC, etc.
CGSCC passes	1170-1206	~15	Inliner, Attributor, ArgumentPromotion
Function analyses	1208-1415	~65	DominatorTree, LoopInfo, MemorySSA, rpa, merge-sets
Function passes	1420-2319	~185	SROA, GVN, LICM, all NVIDIA function passes
LoopNest passes	2320-2339	~8	LoopInterchange, LoopFlatten
Loop analyses	2340-2362	~10	LoopAccessAnalysis, IVUsers
Loop passes	2367-2482	~40	IndVarSimplify, LICM, LoopUnroll, loop-index-split
Machine analyses	2483-2580	~30	LiveIntervals, SlotIndexes
Machine passes	2581-2815	~80	ExpandPostRAPseudos, BranchFolding

Parameterized Pass Parsing

When the pipeline text parser encounters a pass name with angle-bracket parameters (e.g., memory-space-opt<first-time;warnings>), a registered callback parses the parameter string. The parsing flow:

sub_2337DE0 matches the pass name via a starts_with comparison
sub_234CEE0 extracts the <...> parameter string
The parameter-parsing callback (e.g., sub_23331A0 for MemorySpaceOpt) is invoked
The parser splits on ; and matches each token against known parameter names
A configured pass options struct is returned and used to construct the pass

For MemorySpaceOpt, the parameter parser (sub_23331A0) recognizes four tokens:

Token	Length	Effect
`first-time`	10	Sets `first_time = true` (default)
`second-time`	11	Sets `first_time = false`
`warnings`	8	Enables address-space warnings
`no-warnings`	11	Disables warnings

Invalid parameters produce: "invalid MemorySpaceOpt pass parameter '{0}'".

Pass Serialization

Each parameterized NVIDIA pass also registers a serializer for pipeline text output (used by --print-pipeline-passes). The serializers write the pass class name followed by the current parameter state:

Pass	Serializer	Output Format
MemorySpaceOpt	`sub_2CE0440`	`MemorySpaceOptPass]<first-time;...>`
BranchDist	`sub_2311040`	`BranchDistPass]`
Sinking2	`sub_2315E20`	`llvm::Sinking2Pass]`
Remat	`sub_2311820`	`RematerializationPass]`
NVVMPeephole	`sub_2314DA0`	`NVVMPeepholeOptimizerPass]`
LoopIndexSplit	`sub_2312380`	`LoopIndexSplitPass]`

Pipeline Construction Flow

The AddPass Mechanism -- `sub_12DE0B0`


Address	`0x12DE0B0`
Size	3,458 bytes
Signature	`int64 sub_12DE0B0(int64 passMgr, int64 passObj, uint8 flags, char barrier)`
Call count	~137 direct calls from `sub_12E54A0`, ~40 from `sub_12DE330`, ~50+ per tier

sub_12DE0B0 is the sole entry point for adding passes to the pipeline. Every pass factory call in the entire pipeline assembler funnels through this function. It performs three operations atomically: hash-table insertion for O(1) lookup, flag encoding for the pass scheduler, and append to the ordered pass array.

// Detailed pseudocode for sub_12DE0B0
int64 AddPass(PassManager* PM, Pass* pass, uint8_t flags, char barrier) {
    // --- Step 1: Hash the pass pointer ---
    // Uses a custom shift-XOR hash, NOT a standard hash function.
    // The two shifts (9 and 4) spread pointer bits across the table.
    uint64_t hash = ((uint64_t)pass >> 9) ^ ((uint64_t)pass >> 4);

    // --- Step 2: Open-addressing insert into hash table at PM+80 ---
    // The hash table is a flat array of 16-byte entries at PM+80:
    //   [+0] uint64 pass_pointer (0 = empty slot)
    //   [+8] uint8  combined_flags
    // Table capacity is stored at PM+72 (initial: derived from 0x800000000 mask).
    // Collision resolution: linear probing with step 1.
    uint8_t combined = flags | (barrier ? 2 : 0);
    //   Bit 0 (0x01): 1 = FunctionPass, 0 = ModulePass/AnalysisPass
    //   Bit 1 (0x02): 1 = barrier (scheduling fence)
    //   Remaining bits: reserved

    size_t capacity = PM->ht_capacity;       // at PM+72
    size_t idx = hash & (capacity - 1);      // power-of-2 masking
    Entry* table = (Entry*)(PM + 80);

    while (table[idx].pass != 0) {
        if (table[idx].pass == pass) {
            // Pass already inserted -- update flags only
            table[idx].flags = combined;
            return 0;  // dedup: no second insertion
        }
        idx = (idx + 1) & (capacity - 1);   // linear probe
    }
    table[idx].pass = pass;
    table[idx].flags = combined;

    // --- Step 3: Append to ordered pass array at PM[0] ---
    // PM[0] = pointer to dynamic array of 8-byte pass pointers
    // PM[1] = count of passes (PM+8)
    // Growth: geometric reallocation (not shown here)
    uint64_t* array = (uint64_t*)PM->passes; // PM[0]
    array[PM->count] = (uint64_t)pass;
    PM->count++;                              // PM+8

    return 0;
}

The flags parameter encodes the pass type: 0 for module/analysis passes, 1 for function passes. The barrier parameter (bit 1) is a scheduling fence that tells the pass manager all preceding passes must complete before this pass runs -- used for passes that require the module in a globally consistent state (e.g., after whole-module inlining).

The hash table serves two purposes: (a) deduplication -- if the same pass factory is called twice (which happens for NVVMReflect, NVVMIntrinsicLowering, etc.), the second call updates flags rather than inserting a duplicate; and (b) O(1) flag lookup during the codegen dispatch phase (sub_12DFE00), where each pass's type and barrier status must be queried efficiently.

The pass manager container is initialized at line 390 of sub_12E54A0 with inline storage: v270 = v272 (stack buffer), v271 = 0x800000000 (capacity/flags encoding with 33-bit sentinel).

Complete 8-Phase Construction Algorithm

The full pipeline construction in sub_12E54A0 proceeds through eight phases. The pseudocode below is reconstructed from the decompiled 49.8KB function at lines 300-757 of the decompilation output. All a4 offsets refer to the CompilerOptions struct (parameter 4, ~4500 bytes).

Phase 0: Infrastructure (lines 396-420, always runs)

// Phase 0: Analysis infrastructure required by all subsequent passes
#01  TLI = sub_149CCE0(malloc(368), sub_14A04B0(triple));
     AddPass(PM, TLI, 0, 0);     // TargetLibraryInfoWrapperPass [Module]

#02  TTI = sub_1BFB520(malloc(208), sub_1BFB9A0(dataLayout));
     AddPass(PM, TTI, 1, 0);     // TargetTransformInfoWrapperPass [Function]

#03  verifier = sub_14A7550();
     AddPass(PM, verifier, 0, 0); // VerifierPass / BasicAliasAnalysis [Module]

#04  assumptions = sub_1361950();
     AddPass(PM, assumptions, 0, 0); // AssumptionCacheTracker [Module]

#05  profile = sub_1CB0F50();
     AddPass(PM, profile, 1, 0); // ProfileSummaryInfoWrapperPass [Function]

These five analysis passes have no upstream-LLVM equivalent in terms of initialization ordering. NVIDIA always adds them first regardless of optimization level, language, or fast-compile mode.

Phase 1: Language Dispatch (lines 421-488)

Phase 1 reads the language string at a4[3648] (pointer) with length at a4[3656]. Three language paths exist; each produces a fundamentally different pass sequence. See the Language Path Differences section below for the complete per-path pass lists.

// Phase 1: Language-based pipeline branching
char* lang = *(char**)(a4 + 3648);
int lang_len = *(int*)(a4 + 3656);

bool opt_enabled = *(bool*)(a4 + 4224);
bool fc_max = false, fc_mid = false;
int v238 = *(int*)(a4 + 4304);  // device-code / additional-opt flag

if (lang_len == 3) {
    uint16_t w = *(uint16_t*)lang;
    if (w == 0x7470 && lang[2] == 0x78) {        // "ptx"
        goto PATH_A_PTX;
    }
    if (w == 0x696D && lang[2] == 0x64) {         // "mid"
        goto PATH_B_MID;
    }
    // "idn" (w == 0x696D && lang[2] == 0x6E) shares the default path
}
// Fall through to PATH_C_DEFAULT

// Fast-compile dispatch (within the language check):
// fc="max" AND !v238 → v244=1, v238=1, goto LABEL_191 (minimal + O0)
// fc="max" AND v238  → goto LABEL_196 → LABEL_188 (Sinking2 + common)
// fc="mid"           → goto LABEL_297 (mid pipeline)
// fc="min"           → goto LABEL_297 (min pipeline, differs via v238)
// no fc, no O-level  → LABEL_159 (O0 minimal pipeline)
// O-level set        → LABEL_38 → LABEL_39 (process pass list + tiers)

Phase 2: Pre-Optimization (lines 442-480)

Only when optimization is not completely skipped. Each pass is gated by a per-pass disable flag in the NVVMPassOptions struct.

// Phase 2: Early passes before the main optimization loop
if (!a4[1960] || a4[3000])                           // not disabled OR extra trigger
    AddPass(PM, sub_1857160(), 1, 0);                // NVVMReflect

if (a4[3000])                                        // extra DeadArgElim trigger
    AddPass(PM, sub_18FD350(0), 1, 0);              // DeadArgElimination

if (!a4[1680])                                       // NVIDIA pass not disabled
    AddPass(PM, sub_19CE990(), 1, 0);               // LoopStrengthReduce (NVIDIA)

AddPass(PM, sub_1CB4E40(0), 1, 0);                  // NVVMIntrinsicLowering(level=0)

if (!a4[2040])
    AddPass(PM, sub_1B26330(), 1, 0);                // MemCpyOpt

AddPass(PM, sub_12D4560(), 1, 0);                    // NVVMVerifier

if (!a4[1960])
    AddPass(PM, sub_184CD60(), 1, 0);                // ConstantMerge

if (!a4[440] && !a4[400])
    AddPass(PM, sub_1C4B6F0(), 1, 0);               // AlwaysInliner

if (a4[3160])                                        // debug dump enabled
    AddPass(PM, sub_17060B0(1, 0), 1, 0);           // PrintModulePass

Phase 3: Main Optimization Loop (lines 481-553)

The tier-threshold-driven loop iterates over the plugin/external pass array at a4[4488]. Each entry is 16 bytes (vtable pointer + phase_id). When a threshold is crossed, the corresponding tier sub-pipeline fires once and never again.

// Phase 3: Tier dispatch within the main plugin pass loop
uint64_t* entry = *(uint64_t**)(a4 + 4488);
uint64_t* end   = *(uint64_t**)(a4 + 4496);

while (entry < end) {
    int phase_id = *(int*)((char*)entry + 8);

    // Tier 0: full optimization sub-pipeline
    if (*(bool*)(a4+4224) && phase_id > *(int*)(a4+4228)) {
        sub_12DE330(PM, opts);        // ~40 passes
        *(bool*)(a4+4224) = false;    // fire once
    }
    // Tier 1: conservative
    if (*(bool*)(a4+3528) && phase_id > *(int*)(a4+3532)) {
        sub_12DE8F0(PM, 1, opts);
        *(bool*)(a4+3528) = false;
    }
    // Tier 2: full
    if (*(bool*)(a4+3568) && phase_id > *(int*)(a4+3572)) {
        sub_12DE8F0(PM, 2, opts);
        *(bool*)(a4+3568) = false;
    }
    // Tier 3: aggressive
    if (*(bool*)(a4+3608) && phase_id > *(int*)(a4+3612)) {
        sub_12DE8F0(PM, 3, opts);
        *(bool*)(a4+3608) = false;
    }

    // Insert the plugin/external pass itself
    Pass* plugin = vtable_call(entry, +72);  // entry->createPass()
    AddPass(PM, plugin, 1, 0);

    // Optional debug verification after each plugin pass
    if (a4[3904]) {
        sub_12D3E60();  // insert verification/print pass
        sub_16E8CB0();
        sub_15E9F00();
    }

    entry = (uint64_t*)((char*)entry + 16);  // next entry (16-byte stride)
}

// Any tier that didn't fire during the loop fires unconditionally now
if (*(bool*)(a4+4224))  sub_12DE330(PM, opts);
if (*(bool*)(a4+3528))  sub_12DE8F0(PM, 1, opts);
if (*(bool*)(a4+3568))  sub_12DE8F0(PM, 2, opts);
if (*(bool*)(a4+3608))  sub_12DE8F0(PM, 3, opts);

Phase 4: Post-Optimization Language Paths (lines 580-1371)

After the main loop, language-specific post-optimization runs. This is where the three paths diverge most significantly. Each path ends by falling through to LABEL_84 (Phase 5). See Language Path Differences for complete pass lists per path.

Phase 5: Finalization (LABEL_84, lines 640-653)

Always runs after the language-specific optimization path completes.

// Phase 5: Finalization -- barriers, cleanup, codegen
if (!v244 && a4[3488])                              // barrier optimization enabled
    AddPass(PM, sub_1C98160(a4[2920]!=0), 1, 0);   // NVVMLowerBarriers

AddPass(PM, sub_1CEBD10(), 1, 0);                   // NVVMFinalLowering (cleanup)

if (!a4[2800] && !a4[4464])                         // late CFG cleanup not disabled
    AddPass(PM, sub_1654860(1), 1, 0);              // BreakCriticalEdges

sub_12DFE00(PM, subtargetInfo, a4);                  // Codegen dispatch (see below)

Phase 6: Phase II Codegen Check (lines 654-693)

Reads the compilation phase counter and conditionally enters a special codegen extension block for multi-phase compilation.

// Phase 6: Phase-II-specific codegen extensions
int phase = sub_16D40F0(qword_4FBB3B0);  // read cl::opt<int> phase counter
if (phase == 2 && (*(int*)(a4 + 4480) & 4)) {
    // Enter special Phase II codegen block
    // Calls vtable at v245+56 (TargetMachine::addPhaseIIPasses)
    // Passes SubtargetInfo (v253) and CodeGenOpt config (v262)
    target_machine->addPhaseIIPasses(subtarget, codegen_config);
}

Phase 7: Pipeline Execution (lines 694-698)

// Phase 7: Run the assembled pipeline
sub_160FB70(PM, *output, output[1]);   // PassManager::run(Module, outputs)
sub_1619BD0(PM, module);               // PassManager::finalize(Module)
free(v274);                            // cleanup allocations
sub_160FE50(PM);                       // PassManager::destroy()

Phase 8: Basic Block Naming (lines 700-757)

Only when a4[3944] (debug/naming mode) is set. Produces deterministic block names for debugging.

// Phase 8: Debug block naming for IR dump readability
if (a4[3944]) {
    int funcIdx = 0;
    for (Function* F = module->functions; F; F = F->next) {
        if (sub_15E4F60(F))  continue;  // skip declarations
        funcIdx++;
        int blockIdx = 0;
        for (BasicBlock* BB = F->blocks; BB; BB = BB->next) {
            blockIdx++;
            char name[32];
            sprintf(name, "F%d_B%d", funcIdx, blockIdx);
            sub_164B780(BB, &name);  // BB->setName()
        }
    }
}

Language Path Differences

The three language paths in Phase 1/4 represent fundamentally different IR maturity levels. The a4[3648] string pointer determines which path is taken, with length at a4[3656].

Path A: `"ptx"` -- Light Pipeline (~15 passes)

PTX text input has already been lowered by an earlier compilation stage. This path applies only light cleanup and canonicalization:

sub_1CEF8F0()               NVVMPeephole
sub_215D9D0()               NVVMAnnotationsProcessor
sub_1857160()  [!a4[880]]   NVVMReflect
sub_1A62BF0(1,0,0,1,0,0,1)  LLVM standard pipeline #1
sub_1B26330()  [!a4[2040]]  MemCpyOpt
sub_17060B0(0,0)            PrintModulePass (debug)
sub_18DEFF0()  [!a4[280]]   DCE
sub_1A62BF0(1,0,0,1,0,0,1)  LLVM standard pipeline #1 (repeat)
sub_18B1DE0()  [!a4[2640]]  LoopPass / BarrierOpt
sub_1C8E680(0) [!a4[1760]]  MemorySpaceOptimization
 --> LABEL_84 (finalization)

Key difference: no SROA, no GVN, no loop transformations, no CGSCC inlining. The PTX path trusts that the earlier compilation already optimized the code.

Path B: `"mid"` -- Full Optimization (~45 passes)

The primary path for standard CUDA compilation. The IR comes from the EDG frontend through IR generation and is at "mid-level" maturity (high-level constructs lowered, but not yet optimized).

sub_184CD60()  [!a4[1960]]    ConstantMerge
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering (1st of 4)
sub_1B26330()  [!a4[2040]]    MemCpyOpt
sub_198E2A0()                  SROA
sub_1CEF8F0()                  NVVMPeephole
sub_215D9D0()                  NVVMAnnotationsProcessor
sub_198DF00(-1)[!a4[1520]]     LoopSimplify
sub_1C6E800()                  GVN
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification (1st of 5+)
sub_190BB10(0,0)               SimplifyCFG
sub_1832270(1)                 InstructionCombining
sub_1A62BF0(5,0,0,1,0,0,1)    CGSCC pipeline (5 iterations)
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering (2nd)
sub_18FD350(0)                 DeadArgElim
sub_1841180()  [!a4[680]]     FunctionAttrs
sub_18DEFF0()  [!a4[280]]     DCE
sub_184CD60()  [!a4[1960]]    ConstantMerge
sub_195E880(0) [!a4[1240]]    LICM
sub_1C98160(0)                 NVVMLowerBarriers
sub_1C8E680(0) [!a4[1760]]    MemorySpaceOpt (1st invocation)
sub_1B7FDF0(3) [!a4[1280]]    Reassociate
sub_1A62BF0(8,0,0,1,1,0,1)    CGSCC pipeline (8 iterations)
sub_1857160()  [!a4[880]]     NVVMReflect (2nd of 3)
sub_1C6FCA0()  [!a4[1840]]    ADCE
sub_1A7A9F0()  [!a4[2720]]    InstructionSimplify
sub_18FD350(0)                 DeadArgElim
sub_1833EB0(3) [!a4[320]]     TailCallElim
sub_18FD350(0)                 DeadArgElim
sub_18EEA90()                  CorrelatedValuePropagation
sub_1869C50(1,0,1)             Sink (MemorySSA-based)
sub_190BB10(0,0)[!a4[960]]     SimplifyCFG
sub_18F5480()  [!a4[760]]     DSE
sub_1CC60B0()  [!a4[2440]]    NVVMSinking2
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification
sub_1C8A4D0(0)                 EarlyCSE
sub_1857160()  [!a4[880]]     NVVMReflect (3rd)
sub_1A62BF0(8,0,0,1,1,0,1)    CGSCC pipeline (8 iterations)
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering (3rd)
sub_185D600()  [!a4[920]]     IPConstPropagation
sub_195E880(0) [!a4[1240]]    LICM
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering (4th)
sub_1CB73C0()  [!a4[2120]]    NVVMBranchDist
sub_1A13320()  [!a4[2320]]    NVVMRematerialization
 --> LABEL_84 (finalization)

Key pattern: NVVMIntrinsicLowering runs 4 times, NVVMReflect runs 3 times, NVVMIRVerification runs 5+ times. The CGSCC pipeline is called with 5 and 8 iteration counts (aggressive devirtualization).

Path C: Default -- General Pipeline (~40 passes)

Used for bitcode from external sources (not marked as "ptx" or "mid"). Balances optimization breadth with conservative assumptions about IR maturity.

sub_1A62BF0(4,0,0,1,0,0,1)    LLVM standard pipeline #4
sub_1857160()  [!a4[880]]     NVVMReflect (1st)
sub_1CB4E40(0) [!a4[2000]]    NVVMIntrinsicLowering
sub_1857160()  [!a4[880]]     NVVMReflect (2nd)
sub_1CEF8F0()                  NVVMPeephole
sub_215D9D0()                  NVVMAnnotationsProcessor
sub_1A7A9F0()  [!a4[2720]]    InstructionSimplify
sub_1A62BF0(5,0,0,1,0,0,1)    LLVM standard pipeline #5
sub_185D600()  [!a4[920]]     IPConstPropagation
sub_1B26330()  [!a4[2040]]    MemCpyOpt
sub_184CD60()  [!a4[1960]]    ConstantMerge
sub_1A13320()  [!a4[2320]]    NVVMRematerialization
sub_1833EB0(3) [!a4[320]]     TailCallElim
sub_1C6E800()                  GVN
sub_1842BC0()  [!a4[720]]     SCCP
sub_18DEFF0()  [!a4[280]]     DCE
sub_184CD60()  [!a4[1960]]    ConstantMerge
sub_18FD350(0)                 DeadArgElim
sub_18EEA90()                  CorrelatedValuePropagation
sub_1A62BF0(1,0,0,1,0,0,1)    LLVM standard pipeline #1
sub_197E720()                  LoopUnroll
sub_19401A0()  [!a4[1000]]    InstCombine
sub_1857160()  [!a4[880]]     NVVMReflect (3rd)
sub_1A62BF0(7,0,0,1,0,0,1)    LLVM standard pipeline #7
sub_1C8A4D0(0)                 EarlyCSE
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification
sub_1832270(1)                 InstructionCombining
sub_1869C50(1,0,1)             Sink
sub_1A68E70()                  LoopIdiomRecognize
sub_198DF00(-1)[!a4[1520]]     LoopSimplify
sub_195E880(0) [!a4[1240]]    LICM
sub_190BB10(0,0)[!a4[960]]     SimplifyCFG
sub_19B73C0(3,-1,-1,0,0,-1,0)  LoopUnswitch
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification
sub_1C98160(0)                 NVVMLowerBarriers
sub_1C8E680(0) [!a4[1760]]    MemorySpaceOpt
sub_1B7FDF0(3) [!a4[1280]]    Reassociate
sub_18B1DE0()  [!a4[2640]]    LoopPass
sub_1952F90(-1)[!a4[1160]]     LoopIndexSplit
sub_18FD350(0)                 DeadArgElim
sub_1CC60B0()  [!a4[2440]]    NVVMSinking2
sub_1A62BF0(2,0,0,1,0,0,1)    LLVM standard pipeline #2
sub_1A223D0()  [!a4[2600]]    NVVMIRVerification
sub_18A3430()  [!a4[1120]]    NVVMPredicateOpt
sub_1A62BF0(4,0,0,1,1,0,1)    LLVM standard pipeline #4 (inlining)
 --> LABEL_84 (finalization)

Key difference from "mid": default path uses LLVM standard pipeline wrappers (IDs 1,2,4,5,7) more heavily, runs SCCP explicitly, includes LoopIdiomRecognize, and uses a conservative LoopUnswitch with zeroed thresholds (3,-1,-1,0,0,-1,0).

Codegen Dispatch -- `sub_12DFE00`


Address	`0x12DFE00`
Size	20,729 bytes
Signature	`int64 sub_12DFE00(int64 passMgr, int64 subtargetInfo, int64 opts)`
Called from	Phase 5 of `sub_12E54A0` (LABEL_84, line 640)

The codegen dispatch does not simply append passes to the pipeline. It performs a full dependency analysis over every pass already inserted, constructs an ordering graph, and then emits codegen passes in topologically-sorted order. This is necessary because machine-level passes (register allocation, instruction scheduling, frame lowering) have strict ordering dependencies that the flat AddPass model cannot express.

// Pseudocode for sub_12DFE00 (codegen dispatch with dependency analysis)
void CodegenDispatch(PassManager* PM, SubtargetInfo* STI, CompilerOpts* opts) {
    // Step 1: Read optimization level to determine analysis depth
    int opt_level = *(int*)(opts + 200);  // opts[200] = optimization level
    bool do_deps = (opt_level > 1);       // dependency tracking for O2+

    // Step 2: Classify existing passes
    // Iterates PM->passes[0..PM->count], calling two vtable methods per pass
    HashTable dep_graph;   // secondary hash table for dependencies (v134..v137)
    init_hashtable(&dep_graph);

    for (int i = 0; i < PM->count; i++) {
        Pass* p = PM->passes[i];

        // 2a. Check if pass is codegen-only (vtable+112)
        bool is_codegen = p->vtable->isCodeGenOnly(p);   // vtable offset +112
        if (is_codegen)
            continue;  // already classified, skip

        // 2b. Check registration status
        int status = sub_163A1D0(p);   // pass registry check
        sub_163A340(p, &status);       // update status

        // 2c. If pass needs codegen support, mark it in the hash table
        if (pass_needs_codegen(p)) {
            // Set flag |= 2 in the AddPass hash table entry
            // This marks the pass as "codegen-interacting"
            Entry* e = hashtable_find(PM + 80, p);
            if (e) e->flags |= 2;
        }

        // 2d. Build dependency edges (getAnalysisUsage)
        if (do_deps) {
            AnalysisUsage AU;
            p->vtable->getAnalysisUsage(p, &AU);   // vtable offset +16

            // For each required analysis, create an ordering edge
            // in the dependency hash table
            for (AnalysisID* req = AU.required; req; req = req->next) {
                dep_graph_add_edge(&dep_graph, p, req->pass);
            }
        }
    }

    // Step 3: Emit codegen passes in dependency-respecting order
    // Calls the SubtargetInfo hook to get the ordered codegen pass list
    // vtable+16 at STI -> STI->emitCodeGenPasses(PM, dep_graph)
    STI->vtable->emitCodeGenPasses(STI, PM, &dep_graph);
    // Each emitted pass gets a flag:
    //   0 = normal pass (no special ordering)
    //   1 = pass with codegen requirement (flag bit 0 from AddPass)
}

The dependency graph construction is what makes this function 20KB: it must handle the full LLVM analysis dependency model, including transitive dependencies and analysis preservation. The getAnalysisUsage calls return Required, RequiredTransitive, and Preserved sets that define the ordering constraints between passes.

For O0 compilation (opt_level == 0), the dependency tracking is skipped entirely -- codegen passes are emitted in a fixed default order since no optimization passes exist that could create ordering conflicts.

Pass Iteration and Convergence

CGSCC Fixed-Point Iteration

The CGSCC (Call Graph Strongly Connected Component) pass manager sub_1A62BF0 wraps a standard LLVM InlinerWrapper with a configurable iteration count. The first parameter controls how many times the CGSCC pipeline iterates over the call graph:

Pipeline Position	Iteration Count	Context
O1/O2/O3 base (sub_12DE330)	1	Standard inlining: one pass over the call graph
"mid" path (Ofcmid/Ofcmin)	5	Aggressive: 5 iterations to resolve indirect calls
Default path (general IR)	1, 2, 4, 5, 7, or 8	Varies by position in pipeline

Higher iteration counts allow the CGSCC framework to resolve more indirect calls through devirtualization. After each iteration, newly-inlined code may expose new call targets, which the next iteration can inline. The diminishing returns typically plateau after 3-5 iterations, which explains NVIDIA's choice of 5 for the "mid" fast-compile path (balancing compile time against code quality).

NVVMReflect Multi-Run Pattern

NVVMReflect (sub_1857160) runs multiple times in the pipeline because NVVM IR may contain __nvvm_reflect("__CUDA_ARCH") calls at different nesting depths. The first run resolves top-level reflect calls to constants. Subsequent optimization passes (inlining, constant propagation, loop unrolling) may expose new reflect calls that were hidden inside inlined functions or unrolled loop bodies. Running NVVMReflect again after these transformations catches these newly-exposed calls.

In the "mid" path, NVVMReflect appears at three distinct positions:

Early (before GVN) -- resolves top-level architecture queries
Mid (after CGSCC inlining and DeadArgElim) -- catches reflect calls exposed by inlining
Late (after LoopSimplify and second CGSCC) -- catches reflect calls exposed by loop transformations

NVVMIntrinsicLowering Repetition

Similarly, NVVMIntrinsicLowering (sub_1CB4E40) runs 4 times in the "mid" path. Each invocation lowers a different subset of NVVM intrinsics based on what the preceding optimization passes have simplified. The pass takes a level parameter (0 or 1) that controls which lowering rules are active. Level 0 handles basic intrinsic lowering; level 1 handles barrier-related lowering that only becomes safe after certain control flow transformations.

NVVMIRVerification as a Convergence Check

NVVMIRVerification (sub_1A223D0) runs after every major transformation group -- not for optimization, but as a correctness invariant check. In the "mid" path it appears at 5+ positions. In the tier 1/2/3 sub-pipeline it appears 4 times (after NVVMIntrinsicLowering, after barrier lowering, after GenericToNVVM, and after the late optimization sequence). If any transformation violates NVVM IR constraints (invalid address space usage, malformed intrinsic signatures, broken metadata), this pass reports the error immediately rather than allowing it to propagate to codegen where diagnosis would be much harder.

The Repeat-Until-Clean Philosophy

NVIDIA's pipeline does not use explicit fixed-point loops (run passes until IR stops changing). Instead, it achieves convergence through strategic repetition: the same pass appears at multiple carefully-chosen pipeline positions, with different optimization passes running between repetitions. This is more predictable than a true fixed-point approach because compilation time is bounded by the static pipeline length rather than by how many iterations are needed for convergence. The tradeoff is that the pipeline may not reach a true fixed point -- some optimization opportunities exposed by late passes may not be caught -- but in practice, the multi-position placement catches the vast majority of cases.

LLVM Standard Pass Pipeline Factory -- `sub_1A62BF0`

The LLVM standard pass pipeline is invoked multiple times throughout the optimizer via sub_1A62BF0. The first parameter is a pipeline ID that selects which LLVM extension point to inject passes at:

Pipeline ID	LLVM Extension Point	Usage Context
1	`EP_EarlyAsPossible` / basic cleanup	Tier 0, default path
2	`EP_LoopOptimizerEnd`	Default path late
4	`EP_ScalarOptimizerLate`	Default path, Tier sub-pipeline
5	`EP_VectorizerStart`	"mid" path, default path
7	`EP_OptimizerLast`	Default path
8	`EP_CGSCCOptimizerLate`	"mid" path (with opt flag = 1 for inlining)

The signature is sub_1A62BF0(pipelineID, 0, 0, 1, optFlag, 0, 1, outBuf), where optFlag at position 5 enables inlining within the CGSCC sub-pipeline (observed as 1 for pipeline IDs 4 and 8 in the "mid" path: sub_1A62BF0(8,0,0,1,1,0,1)).

Each call potentially returns a cleanup callback stored in v298, invoked as v298[0](s, s, 3) for destructor/finalization. The factory is called 9+ times across the three language paths.

CompilerOptions Struct Flag Map

The a4 parameter to sub_12E54A0 is a ~4500-byte CompilerOptions struct. The following offsets have been confirmed through cross-referencing guards in the pipeline assembler and tier sub-pipelines.

Offset	Type	Purpose	Cross-Reference
+200	int	Optimization level (0-3)	`sub_12DFE00` codegen depth
+280	bool	Disable DCE	`sub_18DEFF0` guard
+320	bool	Disable TailCallElim	`sub_1833EB0` guard
+360	bool	Disable NVVMLateOpt	`sub_1C46000` guard
+400	bool	Disable inlining variant A
+440	bool	Disable inlining variant B	`sub_1C4B6F0` guard
+480	bool	Disable inlining variant C	`sub_12DE8F0` guard
+520	bool	Disable NVIDIA pass A	`sub_1AAC510` guard
+560	bool	Disable NVIDIA pass B	`sub_1AAC510` guard
+600	bool	Disable NVVMVerifier	`sub_12D4560` guard
+680	bool	Disable FunctionAttrs	`sub_1841180` guard
+720	bool	Disable SCCP	`sub_1842BC0` guard
+760	bool	Disable DSE	`sub_18F5480` guard
+880	bool	Disable NVVMReflect	`sub_1857160` guard
+920	bool	Disable IPConstPropagation	`sub_185D600` guard
+960	bool	Disable SimplifyCFG	`sub_190BB10` guard
+1000	bool	Disable InstCombine	`sub_19401A0` guard
+1040	bool	Disable Sink/MemSSA	`sub_1869C50` guard
+1080	bool	Disable PrintModulePass	`sub_17060B0` guard
+1120	bool	Disable NVVMPredicateOpt	`sub_18A3430` guard
+1160	bool	Disable LoopIndexSplit	`sub_1952F90` guard
+1240	bool	Disable LICM	`sub_195E880` guard
+1280	bool	Disable Reassociate	`sub_1B7FDF0` guard
+1320	bool	Disable ADCE variant A	`sub_1C76260` guard
+1360	bool	Disable LoopUnroll	`sub_19C1680` guard
+1400	bool	Disable SROA	`sub_1968390` guard
+1440	bool	Disable EarlyCSE	`sub_196A2B0` guard
+1520	bool	Disable LoopSimplify	`sub_198DF00` guard
+1680	bool	Disable NVIDIA pass	`sub_19CE990` guard
+1760	bool	Disable MemorySpaceOpt	`sub_1C8E680` guard
+1840	bool	Disable ADCE C	`sub_1C6FCA0` guard
+1960	bool	Disable ConstantMerge	`sub_184CD60` guard
+2000	bool	Disable NVVMIntrinsicLowering	`sub_1CB4E40` guard
+2040	bool	Disable MemCpyOpt	`sub_1B26330` guard
+2120	bool	Disable NVVMBranchDist B	`sub_1CB73C0` guard
+2200	bool	Disable GenericToNVVM	`sub_1A02540` guard
+2320	bool	Disable NVVMRematerialization	`sub_1A13320` guard
+2440	bool	Disable NVVMSinking2	`sub_1CC60B0` guard
+2560	bool	Disable NVVMGenericAddrOpt	`sub_1CC71E0` guard
+2600	bool	Disable NVVMIRVerification	`sub_1A223D0` guard
+2640	bool	Disable NVVMLoopOpt	`sub_18B1DE0` guard
+2720	bool	Disable InstructionSimplify	`sub_1A7A9F0` guard
+2840	bool	Enable ADCE (reversed logic)	`sub_1C6FCA0`
+2880	bool	Enable LICM (reversed logic)	`sub_195E880`
+2920	bool	NVVMLowerBarriers param	`sub_1C98160`
+3000	bool	Extra DeadArgElim trigger	`sub_18FD350`
+3040	bool	Enable CVP	`sub_18EEA90`
+3080	bool	Enable NVIDIA loop pass	`sub_1922F90`
+3120	bool	Address space optimization flag	`sub_1C8E680` param
+3160	bool	Debug dump mode	`sub_17060B0` enable
+3200	bool	Enable advanced NVIDIA group	IPConst/Reflect/SCCP/etc.
+3328	bool	Enable SM-specific passes	Warp/Reduction/Sinking2
+3488	bool	Enable barrier optimization	`sub_1C98160`, `sub_18E4A00`
+3528	bool	Tier 1 enable	Phase 3 loop
+3532	int	Tier 1 phase threshold	Phase 3 loop
+3568	bool	Tier 2 enable	Phase 3 loop
+3572	int	Tier 2 phase threshold	Phase 3 loop
+3608	bool	Tier 3 enable	Phase 3 loop
+3612	int	Tier 3 phase threshold	Phase 3 loop
+3648	ptr	Language string (`"ptx"/"mid"/"idn"`)	Phase 1 dispatch
+3656	int	Language string length	Phase 1 dispatch
+3704	bool	Late optimization mode	`sub_195E880`, `sub_1C8A4D0`
+3904	bool	Debug: verify after plugins	Phase 3 loop
+3944	bool	Debug: BB naming `"F%d_B%d"`	Phase 8
+4224	bool	Optimization master switch	Tier 0 gate
+4228	int	Optimization phase threshold	Tier 0 gate
+4304	bool	Device-code flag	Phase 1 `v238`
+4384	bool	Fast-compile / bypass pipeline	Top branch Pipeline A vs B
+4464	bool	Disable late CFG cleanup B	Phase 5 `sub_1654860`
+4480	ptr	SM feature capability	Phase 6: `& 4` = codegen ext
+4488	ptr	Plugin pass array start	Phase 3 loop
+4496	ptr	Plugin pass array end	Phase 3 loop

Pass Factory Address Inventory

All unique pass factory addresses called from the pipeline assembler and tier sub-pipelines:

Function	Address	Size	Role
NVVMVerifier	`sub_12D4560`	many (tiers)	many (tiers)
AssumptionCacheTracker	`sub_1361950`	1	1
TargetLibraryInfoWrapperPass	`sub_149CCE0`	1	1
VerifierPass / BasicAA	`sub_14A7550`	1	1
BreakCriticalEdges	`sub_1654860`	2	2
PrintModulePass (debug dump)	`sub_17060B0`	~30+	~30+
InstructionCombining	`sub_1832270`	2	2
TailCallElim / JumpThreading	`sub_1833EB0`	3	3
FunctionAttrs	`sub_1841180`	3	3
SCCP	`sub_1842BC0`	2	2
NVVMReflect	`sub_1857160`	~8	~8
IPConstantPropagation	`sub_185D600`	3	3
Sink (MemorySSA-based)	`sub_1869C50`	3	3
NVVMPredicateOpt variant	`sub_18A3090`	2	2
NVVMPredicateOpt / SelectionOpt	`sub_18A3430`	2	2
NVVMLoopOpt / BarrierOpt	`sub_18B1DE0`	3	3
Sinking2Pass (fast=1 for fc mode)	`sub_18B3080`	1	1
DCE	`sub_18DEFF0`	4	4
NVVMBarrierAnalysis	`sub_18E4A00`	1	1
CorrelatedValuePropagation	`sub_18EEA90`	3	3
DSE	`sub_18F5480`	2	2
DeadArgElimination	`sub_18FD350`	5	5
SimplifyCFG	`sub_190BB10`	4	4
NVIDIA-specific loop pass	`sub_1922F90`	1	1
LoopIndexSplit	`sub_1952F90`	3	3
LICM / LoopRotate	`sub_195E880`	4	4
SROA	`sub_1968390`	2	2
EarlyCSE	`sub_196A2B0`	2	2
LoopUnroll	`sub_197E720`	1	1
LoopSimplify	`sub_198DF00`	3	3
SROA (variant)	`sub_198E2A0`	1	1
InstCombine	`sub_19401A0`	2	2
LoopUnswitch (7 params)	`sub_19B73C0`	3	3
LoopUnroll variant	`sub_19C1680`	2	2
NVIDIA custom pass	`sub_19CE990`	1	1
GenericToNVVM	`sub_1A02540`	1	1
NVVMRematerialization	`sub_1A13320`	3	3
NVVMIRVerification	`sub_1A223D0`	5+	5+
LLVM StandardPassPipeline	`sub_1A62BF0`	~9	~9
LoopIdiomRecognize	`sub_1A68E70`	1	1
InstructionSimplify	`sub_1A7A9F0`	3	3
NVIDIA-specific pass	`sub_1AAC510`	1	1
MemCpyOpt	`sub_1B26330`	4	4
Reassociate	`sub_1B7FDF0`	3	3
TTIWrapperPass	`sub_1BFB520`	1	1
NVVMLateOpt	`sub_1C46000`	1	1
Inliner / AlwaysInline	`sub_1C4B6F0`	2	2
NewGVN / GVNHoist	`sub_1C6E560`	1	1
GVN	`sub_1C6E800`	2	2
ADCE	`sub_1C6FCA0`	2	2
ADCE variant	`sub_1C76260`	2	2
NVVMWarpShuffle	`sub_1C7F370`	1	1
EarlyCSE / GVN variant	`sub_1C8A4D0`	3	3
MemorySpaceOptimization	`sub_1C8E680`	4	4
NVVMLowerBarriers	`sub_1C98160`	4	4
NVVMLowerBarriers variant	`sub_1C98270`	1	1
ProfileSummaryInfo	`sub_1CB0F50`	1	1
NVVMIntrinsicLowering	`sub_1CB4E40`	~10	~10
NVVMBranchDist	`sub_1CB73C0`	3	3
NVVMLowerAlloca	`sub_1CBC480`	1	1
NVVMUnreachableBlockElim	`sub_1CC3990`	1	1
NVVMReduction	`sub_1CC5E00`	1	1
NVVMSinking2	`sub_1CC60B0`	3	3
NVVMGenericAddrOpt	`sub_1CC71E0`	1	1
NVVMFinalLowering	`sub_1CEBD10`	1	1
NVVMPeephole	`sub_1CEF8F0`	2	2
NVVMAnnotationsProcessor	`sub_215D9D0`	2	2

Total unique pass factory addresses: ~55.

Function Map

Function	Address	Size	Role
NVVMPassOptions::init	`sub_12D6300`	125KB	Populates 4,512-byte options struct
writeStringOption	`sub_12D6090`	~100B	Writes 24-byte string slot
writeBoolOption	`sub_12D6100`	~80B	Writes 16-byte boolean slot
PassOptionRegistry::lookupOption	`sub_12D6170`	~200B	Hash table lookup
getBoolOption	`sub_12D6240`	~300B	Boolean resolution with default
PassDefTable::getPassDef	`sub_1691920`	~50B	64-byte stride table lookup
parseInt	`sub_16D2BB0`	~100B	String to int64
Pipeline assembler (master)	`sub_12E54A0`	49.8KB	8-phase pipeline construction
AddPass	`sub_12DE0B0`	3.5KB	Hash-table-based insertion
Tier 0 sub-pipeline	`sub_12DE330`	4.8KB	~40 passes, full optimization
Tier 1/2/3 sub-pipeline	`sub_12DE8F0`	17.9KB	Phase-conditional, incremental
Codegen dispatch	`sub_12DFE00`	20.7KB	Dependency-ordered codegen
Phase I/II orchestrator	`sub_12E7E70`	9.4KB	Two-phase state machine
New PM registration	`sub_2342890`	~50KB	2,816 lines, 35 NVIDIA + ~350 LLVM
registerPass (hash insert)	`sub_E41FB0`	~300B	StringMap insertion
Pass name prefix matcher	`sub_2337DE0`	~100B	starts_with comparison
Parameterized pass parser	`sub_234CEE0`	~200B	Extracts `<params>`
MemorySpaceOpt param parser	`sub_23331A0`	~300B	first-time/second-time/warnings
New PM pipeline driver	`sub_226C400`	35KB	nvopt<O0/O1/O2/O3/Ofcmax/Ofcmid/Ofcmin> selection
New PM text parser (`buildDefaultPipeline`)	`sub_2277440`	60KB	Parses pipeline name strings
nvopt registration (new PM)	`sub_225D540`	~32KB	Pipeline element vtable at `0x4A08350`
nvopt registration (legacy PM)	`sub_12C35D0`	~500B	Pipeline element vtable at `0x49E6A58`
nvopt object initializer	`sub_12EC960`	~100B	Creates 512-byte pipeline object
LLVM standard pipeline factory	`sub_1A62BF0`	varies	Pipeline IDs 1,2,4,5,7,8
Pass registry check	`sub_163A1D0`	~100B	Pass registration status
Pass status update	`sub_163A340`	~100B	Used in codegen dispatch
Pipeline text tokenizer	`sub_2352D90`	~200B	Tokenizes `nvopt<>` strings

Reimplementation Checklist

Two-phase compilation model. Implement a TLS phase variable (values 1=Phase I, 2=Phase II, 3=done) read by individual passes to skip themselves when the current phase does not match their intended execution phase. Phase I runs whole-module analysis; Phase II runs per-function codegen-oriented passes.
Pipeline assembly function (~150 AddPass calls). Build the master pipeline at runtime using hash-table-based pass insertion (AddPass), with language-specific dispatch (paths for "ptx", "mid", and default), tier-based interleaving (Tiers 0--3 fired by accumulated pass-count thresholds), and phase-conditional pass inclusion.
NVVMPassOptions system (4,512-byte struct, 221 slots). Implement the proprietary per-pass enable/disable and parametric knob system with 114 string + 100 boolean + 6 integer + 1 string-pointer option slots, parsed from CLI flags and routed to individual passes.
Concurrent per-function compilation. After Phase I completes on the whole module, split Phase II across a thread pool sized to get_nprocs() or GNU Jobserver token count, with per-function bitcode extraction, independent compilation, and re-linking of results.
GNU Jobserver integration. Parse --jobserver-auth=R,W from MAKEFLAGS environment variable, create a token management pipe, and spawn a pthread to throttle concurrent compilations to the build system's -j level.
Split-module compilation. Implement the -split-compile=N mechanism: decompose multi-function modules into per-function bitcode blobs via filter callbacks, compile each independently (potentially in parallel), re-link results, and restore linkage attributes from a hash table.
Tier 0 full optimization sub-pipeline. Assemble the ~40-pass Tier 0 sequence: BreakCriticalEdges, GVN, NVVMReflect, SCCP, NVVMVerifier, LoopIndexSplit, ADCE, LICM, LoopUnroll, InstCombine, SROA, EarlyCSE, LoopUnswitch, SimplifyCFG, NVVMRematerialization, DSE, DCE, with per-pass NVVMPassOptions gating.

Cross-References

Optimization Levels -- detailed O0/O1/O2/O3 and fast-compile pipeline construction
Memory Space Optimization -- the MemorySpaceOpt pass (first-time/second-time parameterization)
Rematerialization -- NVVMRematerialization pass and its register-pressure knobs
Loop Strength Reduction -- NVIDIA's custom LSR overlay with 11 GPU-specific knobs
Sinking2 -- NVIDIA's enhanced sinking pass
CGSCC & LazyCallGraph -- the inliner framework and iteration model
Pipeline Entry -- top-level compilation entry and two-phase orchestration
SROA, EarlyCSE, JumpThreading -- scalar pass details (hub: scalar-passes)

OptiX IR Generation

When cicc receives the --emit-optix-ir flag, it activates an alternate compilation path that produces OptiX IR instead of PTX. OptiX IR is the intermediate representation consumed by NVIDIA's OptiX ray tracing engine, which uses a continuation-based execution model fundamentally different from the standard CUDA kernel launch model. Rather than compiling all the way down to PTX machine code, the OPTIXIR pipeline stage serializes the optimized LLVM module in a form that the OptiX runtime can later JIT-compile, link with ray tracing shaders, and schedule across the RT cores' hardware intersection pipeline.

The OptiX path is the third of four stages in cicc's internal pipeline (LNK -> OPT -> OPTIXIR -> LLC), but it is mutually exclusive with LLC in practice: when OptiX mode is active, the pipeline bitmask enables OPTIXIR (0x40) and disables certain optimizations that would be incorrect for continuation-based code. The flag also forces the EDG frontend to emit lifetime intrinsics (--emit-lifetime-intrinsics, EDG option id 132), which mark the live ranges of local variables -- essential information for the OptiX runtime's continuation frame layout.


Pipeline stage	OPTIXIR (stage 3 of 4)
Stage bit	Bit 6 (`0x40`) in pipeline bitmask
Mode bitmask	`0x43` = `(a13 & 0x300) \| 0x43`
Core function	`sub_12F9270` (~6 KB)
Timer name	`"OPTIXIR"` / `"LibNVVM Optix IR step."`
Container IR level	`NVVM_IR_LEVEL_OPTIX` (value 2)
CLI flag	`--emit-optix-ir` (15 bytes, inline-matched)
Input extension	`.optixir` (recognized at `0x8FC001`)
Callback slot	`CompilationState+144` (function), `+152` (user data)
Availability	CUDA (`0xABBA`) and OpenCL (`0xDEED`) modes only

Flag Processing

`--emit-optix-ir` in Real Main (`sub_8F9C90`)

In the standalone entry point, --emit-optix-ir is matched at 0x8FAD00 by a 15-byte inline comparison (split across three immediate compares: "--emit-o" + "ptix" + "-ir"). When matched, it performs three actions:

Pushes three strings to the v266 pass-through vector:
- "--emit-optix-ir" (literal, 15 bytes via explicit strcpy)
- An 18-byte target string from xmmword_3C23B30 + "28" (likely target-related configuration)
- A 20-byte GPU name string from xmmword_3C23B40 + "t128" (likely target capability)
Sets v243 = 1 (the OptiX IR mode flag)
Sets v258 = 1 (the NVC flag, also set by -nvc)

`--emit-optix-ir` in Flag Catalog (`sub_9624D0`)

In the 3-column flag fan-out system, --emit-optix-ir is processed at line 2415 of the decompiled flag catalog. Its behavior:

// Only valid when a4 == 0xDEED (OpenCL) or a4 == 0xABBA (CUDA)
if (a4 == 0xDEED || a4 == 0xABBA) {
    // Route to optimizer: disable IP-MSP and LICM
    append_to_opt_vector("-do-ip-msp=0");
    append_to_opt_vector("-do-licm=0");

    // Set mode bitmask: preserve 64/32-bit mode bits, set OptiX mode
    a13 = (a13 & 0x300) | 0x43;
}

The 0x43 value decomposes to:

Bits [1:0] = 0x03 -- all standard phases enabled (LNK + LLC)
Bit 6 = 0x40 -- OPTIXIR stage enabled

3-Column Fan-Out

The flag translation table maps --emit-optix-ir across all three compilation columns:

Column	Forwarded As
nvcc -> EDG	`--emit-lifetime-intrinsics`
nvcc -> cicc (optimizer)	`--emit-optix-ir` + `-do-ip-msp=0` + `-do-licm=0`
cicc internal	Mode bitmask `0x43`

This is notable because a single user-facing flag triggers a different flag in the EDG frontend (--emit-lifetime-intrinsics, EDG option id 132) while also routing the OptiX flag itself to the cicc optimizer. The EDG side-effect ensures that lifetime markers (llvm.lifetime.start / llvm.lifetime.end) are present in the generated LLVM IR, which the OptiX runtime needs to compute continuation frame sizes.

Pipeline Stage

Bitmask and Gating

The pipeline orchestrator sub_12C35D0 (41 KB, the nvvmCompileProgram internal) reads the pipeline stage bitmask from sub_12D2AA0 during initialization. This function parses the architecture code and options into four stage descriptors:

Stage	Descriptor Pair	Bitmask Bit
LNK	`(&v195, &v200)`	Bit 0 (`0x01`)
OPT	`(&v196, &v201)`	Bit 7 (`0x80`)
OPTIXIR	`(&v197, &v202)`	Bit 6 (`0x40`)
LLC	`(&v198, &v203)`	Bit 2 (`0x04`)

The OPTIXIR stage executes at lines 1093--1150 of the decompiled orchestrator, after OPT and before LLC:

// STAGE 3 -- OPTIXIR
if (v87 & 0x40) {
    // Start timer
    sub_16D8B50(timer_ctx, "OPTIXIR", 7,
                "LibNVVM Optix IR step.", 22, ...);

    // Generate OptiX IR from the optimized LLVM module
    err = sub_12F9270(arch_code,      // a3: SM architecture code
                      llvm_ctx,       // a4: LLVM context
                      module,         // current LLVM Module*
                      state + 6,      // output buffer for OptiX IR
                      &error_str);    // error string out

    if (err) {
        // Append error to state[10] error log
        ...
    }

    // Close timer
    sub_16D7950(timer_ctx);
}

Callback Mechanism

Like the other three stages, OPTIXIR has a callback slot in the CompilationState structure:

Offset	Field
`+112`	LNK callback function pointer
`+120`	LNK callback user data
`+128`	OPT callback function pointer
`+136`	OPT callback user data
`+144`	OPTIXIR callback function pointer
`+152`	OPTIXIR callback user data
`+160`	LLC callback function pointer
`+168`	LLC callback user data

In the standalone pipeline entry (sub_1265970), the OPTIXIR callback is registered when both verbose and keep-temps modes are active (the logical AND of -v and -keep, which requires wizard mode). The callback ID is 64222, registered via sub_1268040 through sub_12BC0F0.

`sub_12F9270` -- OptiX IR Generator

Field	Value
Address	`0x12F9270`
Size	~6 KB
Parameters	`(uint arch_code, LLVMContext ctx, Module module, OutputBuffer out, char *error_str)`
Return	`unsigned int` (0 = success)

This function takes the fully optimized LLVM module and serializes it into OptiX IR format. The output goes into the state+6 output buffer in the CompilationState, not into the PTX output buffer at state+80. The architecture code and LLVM context are passed through from the pipeline orchestrator's arguments.

The function is relatively small (~6 KB) compared to the LLC stage (sub_12F5100, ~12 KB), consistent with it being primarily a serialization step rather than a full code generation pipeline. It does not run SelectionDAG, register allocation, or instruction scheduling -- those are the domain of the LLC stage, which is typically skipped when OptiX mode is active.

IR Level and Container Marking

When the NVVM container format wraps an OptiX IR payload, the IRLevel field in the binary header is set to NVVM_IR_LEVEL_OPTIX (value 2):

IRLevel Value	Enum Name	Meaning
0	`NVVM_IR_LEVEL_UNIFIED_AFTER_DCI`	Default: IR after Device-Code-Interface unification
1	`NVVM_IR_LEVEL_LTO`	Link-Time Optimization IR (partially optimized)
2	`NVVM_IR_LEVEL_OPTIX`	OptiX pipeline IR

In the binary header, this is stored as a uint16_t at offset 0x0C:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  IRLevel = 0x0002 (OPTIX)    |   0x0C in NvvmContainerBinaryHeader
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

In the XML serialization path (used for debugging), this appears as the "IRLevel" element with the symbolic name "NVVM_IR_LEVEL_OPTIX".

The .optixir file extension is recognized as an input format by cicc's argument parser (matched at 0x8FC001 by comparing the last 8 characters of the filename). This allows round-tripping: cicc can both produce and consume OptiX IR files.

Optimization Pipeline Differences

When OptiX mode is active, the flag catalog forces two critical optimizer changes via the pass-through vector to the OPT stage:

LICM Disabled (`-do-licm=0`)

Loop Invariant Code Motion is completely disabled when compiling for OptiX. The do-licm NVVMPassOption (at a known offset in the 4,512-byte options struct) gates the LICM pass insertion in the pipeline assembler sub_12E54A0. When set to 0, the sub_195E880(0) LICM pass at position 22 of the Tier 0 pipeline is skipped entirely.

The rationale is that OptiX uses a continuation-based execution model where functions can be suspended and resumed at hardware-defined continuation points (ray-surface intersection, any-hit shader invocation, etc.). LICM hoisting moves computations out of loops and into dominating blocks, which can move them across implicit continuation boundaries. If a hoisted value is live across a continuation point, the OptiX runtime must save it to the continuation frame -- potentially increasing frame size and reducing performance. Worse, the hoisting may move side-effecting operations across points where the program could be suspended, violating the continuation semantics. Disabling LICM avoids these correctness and performance hazards entirely.

IP-MSP Disabled (`-do-ip-msp=0`)

Interprocedural Memory Space Propagation is also disabled. IP-MSP (sub_12E6160, the NVVMMemorySpacePropagation pass) propagates memory space annotations (generic -> shared/local/global) across function boundaries. This optimization is meaningless for OptiX IR because the OptiX runtime performs its own memory space analysis during JIT compilation, and the intermediate representation must remain generic to allow runtime binding of hit attributes, payload data, and SBT (Shader Binding Table) records to their final memory spaces.

Forced Inlining (`nv-inline-all`)

The nv-inline-all knob (registered at constructor ctor_186_0 at 0x4DBEC0 in the NVIDIA custom inliner) bypasses cost analysis entirely and forces inlining of every call. This mode is used for OptiX compilation where the entire call graph must be flattened for the hardware intersection pipeline. The OptiX runtime requires monolithic shader functions because the RT core hardware executes individual ray tracing programs as atomic units -- there is no call stack during hardware intersection traversal.

From the inliner cost model (sub_1864060, 75 KB):

The nv-inline-all knob bypasses cost analysis entirely and forces inlining of every call. This is used for specific compilation modes (e.g., OptiX ray tracing where the entire call graph must be flattened for the hardware intersection pipeline).

The standard inline-budget (default 20,000) and inline-total-budget are irrelevant when nv-inline-all is active -- every call site is inlined unconditionally regardless of cost.

Continuation-Based Execution Model

OptiX IR exists because NVIDIA's ray tracing hardware uses a fundamentally different execution model than standard CUDA kernels. Understanding this model explains every design decision in the OPTIXIR pipeline stage.

Standard CUDA vs. OptiX Execution

In standard CUDA, a kernel is a single function that runs to completion on an SM. The compiler produces PTX, which ptxas assembles into SASS machine code. The entire call graph is resolved at compile time, and the GPU executes instructions sequentially (modulo warp divergence and memory latency hiding).

In OptiX, a ray tracing pipeline consists of multiple programs (ray generation, closest-hit, any-hit, miss, intersection, callable) that are compiled separately and linked at runtime by the OptiX driver. When a ray-surface intersection occurs, the hardware suspends the current program, saves its live state to a continuation frame in device memory, and launches the appropriate hit shader. When the hit shader completes, execution resumes from the continuation point.

This model has several consequences for compilation:

No cross-function calls during intersection. The RT core hardware does not support a general call stack. All function calls within a single program must be fully inlined before the OptiX runtime receives the IR -- hence nv-inline-all.
Lifetime intrinsics are critical. The OptiX runtime uses llvm.lifetime.start / llvm.lifetime.end markers to determine which local variables are live at each potential continuation point. Variables that are provably dead at a continuation point do not need to be saved to the continuation frame. Without these markers, the runtime must conservatively assume all locals are live, inflating frame sizes and reducing performance.
LICM is unsafe. Hoisting computations out of loops can move them across implicit continuation points, creating live ranges that span suspension/resumption boundaries. The OptiX runtime cannot reconstruct the hoisted value after resumption unless it is saved, but the compiler does not know where the continuation points will be (they are determined at runtime by the ray tracing pipeline topology).
Memory space must remain generic. OptiX IR is JIT-compiled at runtime with knowledge of the full pipeline configuration. Memory space decisions that depend on the pipeline topology (shared memory for hit attributes, global memory for payload) cannot be made at cicc compile time.
The output is IR, not machine code. Unlike the LLC stage which produces PTX text, the OPTIXIR stage serializes the LLVM module in a form suitable for the OptiX JIT. This is why sub_12F9270 is only ~6 KB -- it is a serializer, not a code generator.

Configuration

CLI Activation

# Standard OptiX compilation via nvcc
nvcc --emit-optix-ir -arch=sm_89 -o kernel.optixir kernel.cu

# Direct cicc invocation
cicc --emit-optix-ir -arch sm_89 -o kernel.optixir kernel.bc

# The flag also accepts .optixir input files for round-tripping
cicc -arch sm_89 -o kernel.ptx kernel.optixir

When --emit-optix-ir is specified, the following configuration is implicitly applied:

Setting	Value	Source
`v243` (OptiX flag)	1	Real main `sub_8F9C90`
`v258` (NVC flag)	1	Real main `sub_8F9C90`
Pipeline bitmask	`0x43`	Flag catalog `sub_9624D0`
`do-licm`	0	Flag catalog, routed to OPT
`do-ip-msp`	0	Flag catalog, routed to OPT
EDG: `emit-lifetime-intrinsics` (id 132)	enabled	3-column fan-out
Container `IRLevel`	2 (`NVVM_IR_LEVEL_OPTIX`)	Container serializer
`nv-inline-all`	true	OptiX mode forces all inlining

Bitmask Decomposition

The 0x43 mode value preserves the 64/32-bit mode bits (mask 0x300) from any previously-set a13 value:

a13 = (a13 & 0x300) | 0x43

Bit field:
  [9:8] = preserved (0x100 = 64-bit, 0x200 = 32-bit)
  [7]   = 0  (OPT stage -- controlled separately)
  [6]   = 1  (OPTIXIR stage enabled)
  [5:3] = 0  (no LTO, no verification override)
  [2]   = 0  (LLC stage -- typically not run in OptiX mode)
  [1:0] = 11 (LNK + base phase control)

Note that bit 2 (LLC) is 0 in the 0x43 bitmask, confirming that the LLC stage is not activated when OptiX mode is the primary output. The pipeline runs LNK -> OPT -> OPTIXIR and stops.

Diagnostic Strings

String	Length	Context
`"OPTIXIR"`	7	Timer phase name (passed to `sub_16D8B50`)
`"LibNVVM Optix IR step."`	22	Timer description string
`"--emit-optix-ir"`	15	CLI flag literal (inline-matched in real main)
`"--emit-lifetime-intrinsics"`	27	EDG flag routed from `--emit-optix-ir`
`".optixir"`	8	Input file extension (matched at `0x8FC001`)
`"-do-ip-msp=0"`	13	Optimizer option routed when OptiX active
`"-do-licm=0"`	12	Optimizer option routed when OptiX active

Function Map

Function	Address	Size	Role
OptiX IR generator (core OPTIXIR stage)	`sub_12F9270`	~6 KB	--
Pipeline orchestrator (`nvvmCompileProgram` internal)	`sub_12C35D0`	~41 KB	--
Bitmask / stage descriptor parser	`sub_12D2AA0`	—	--
Flag catalog (routes `--emit-optix-ir`)	`sub_9624D0`	~75 KB	--
Real main (matches `--emit-optix-ir` at `0x8FAD00`)	`sub_8F9C90`	~10 KB	--
OPTIXIR callback registration (callback ID `64222`)	`sub_1268040`	—	--
Pipeline callback dispatcher	`sub_12BC0F0`	—	--
Inliner cost model (`nv-inline-all` bypass)	`sub_1864060`	~75 KB	--
CGSCC inliner core (`inlineCallsImpl`)	`sub_186CA00`	~61 KB	--
Timer start (receives `"OPTIXIR"` phase name)	`sub_16D8B50`	—	--
Timer close	`sub_16D7950`	—	--
Pipeline assembler (skips LICM when `do-licm=0`)	`sub_12E54A0`	~49.8 KB	--

Cross-References

Entry Point & CLI -- --emit-optix-ir flag parsing and v243 variable
LLVM Optimizer -- do-licm and do-ip-msp NVVMPassOptions, pipeline assembler
NVVM Container Binary Format -- NVVM_IR_LEVEL_OPTIX (value 2) in IRLevel enum
EDG 6.6 Frontend -- --emit-lifetime-intrinsics (EDG option id 132)
Code Generation -- LLC stage that is skipped in OptiX mode
LICM -- the pass disabled by OptiX mode

Code Generation

NVPTX backend: SelectionDAG lowering, instruction selection, register allocation, and machine-level passes. Address range 0x1700000–0x35EFFFF (~37 MB of code) -- the largest address range in the binary. This page is the hub for the entire code generation pipeline; each stage has a dedicated deep-dive page linked below.


SelectionDAG pipeline	SelectionDAG & ISel — build, legalize, combine, select
Type legalization	Type Legalization — 348KB monolithic dispatch
ISel patterns	ISel Pattern Matching — three-level dispatch, 900KB
Register allocation	Register Allocation — pressure-driven greedy RA
Register classes	NVPTX Register Classes — nine classes, ID map
Scheduling	Instruction Scheduling — MRPA, pipeliner, post-RA
Machine passes	Machine-Level Passes — MRPA, remat, LDG, peephole
StructurizeCFG	StructurizeCFG — mandatory structured control flow
CodeGenPrepare	CodeGenPrepare & SCEV-CGP — IR-level backend prep
KnownBits	KnownBits & DemandedBits — fused analysis with GPU SR oracle
Tensor core codegen	MMA Code Generation — HMMA/IMMA/WGMMA/tcgen05 lowering pipeline
Tensor core builtins	Tensor / MMA Builtins — per-ID reference, validation rules
Atomics	Atomic Builtins — scope-aware atom lowering
Target infrastructure	NVPTX Target Infrastructure — TargetMachine, TTI, SubtargetFeatures
Live range calc	LiveRangeCalc — dual-bitvector liveness
Rematerialization	Rematerialization — IR-level + machine-level remat
InstrEmitter	InstrEmitter — DAG-to-MachineInstr conversion
DAG node layout	SelectionDAG Node Structure — 104-byte SDNode

Architecture

The code generation pipeline runs after the LLVM optimizer and produces MachineIR that the PTX emission stage serializes to text. The pipeline follows upstream LLVM's SelectionDAG architecture with NVIDIA-specific passes inserted at key points.

LLVM IR
  │
  ├─ CodeGenPrepare (IR-level backend prep)
  │    sub_1D70000-1D7FFFF: sunkaddr, sunk_phi, block splitting
  │
  ├─ SelectionDAG Build
  │    sub_2065D30 (visit dispatcher)
  │    sub_2056920 (major worker, 69KB)
  │    sub_2077400 (NVVM tex/surf handle lowering) ★ NVIDIA
  │    sub_2072590 (NVPTX argument passing, 38KB) ★ NVIDIA
  │
  ├─ LegalizeTypes
  │    sub_20019C0 (348KB main loop)
  │    sub_201E5F0 (opcode dispatch, 81KB)
  │    sub_201BB90 (expand integer, 75KB)
  │
  ├─ LegalizeOp
  │    sub_1FFB890 (169KB, type action dispatch)
  │    sub_1FF6F70 (43KB, atomic target-specific lowering) ★ NVIDIA
  │
  ├─ DAG Combining
  │    sub_F681E0 (65KB, top-level orchestrator)
  │    sub_F20C20 (64KB, visitNode main)
  │
  ├─ Instruction Selection
  │    sub_3090F90 (91KB, NVPTXDAGToDAGISel::Select) ★ NVIDIA
  │    sub_33D4EF0 (complex addressing, calls sub_969240 399×)
  │
  ├─ Instruction Scheduling
  │    sub_355F610 (64KB, ScheduleDAGMILive post-RA)
  │    sub_3563190 (58KB, MachinePipeliner)
  │
  ├─ Register Allocation
  │    sub_2F49070 (82KB, RAGreedy::selectOrSplit)
  │    sub_2F2D9F0 (93KB, LiveRangeSplitter)
  │
  ├─ Machine-Level Passes
  │    MRPA, Block Remat, Mem2Reg, LDG, Peephole, etc.
  │
  └─ StructurizeCFG
       sub_35CC920 (95KB, mandatory for PTX structured control flow)

Items marked ★ NVIDIA are NVIDIA-proprietary additions not present in upstream LLVM.

Stage Overview

CodeGenPrepare (detail) -- last IR-level pass before ISel. Sinks address computations, creates PHI nodes for sunk values, and splits critical edges. NVIDIA adds an optional SCEV-CGP extension.

SelectionDAG Build (detail) -- converts LLVM IR into a target-independent DAG. NVPTX intercepts for .param-space argument passing and texture/surface handle lowering.

Type Legalization (detail) -- rewrites every illegal type into legal equivalents via promote, expand, soften, or split-vector actions.

Operation Legalization -- processes nodes whose opcodes are illegal for the target. Atomic operations receive NVIDIA-specific scope-aware lowering (CTA/GPU/SYS) with per-SM feature gates.

DAG Combining -- folds redundant operations, canonicalizes patterns, and reduces the DAG before instruction selection. The KnownBits analysis feeds into combining decisions.

Instruction Selection (detail) -- matches DAG nodes against PTX instruction patterns via a three-level dispatch hierarchy. A compressed per-SM-variant legality table gates which opcodes exist on which GPU architecture.

Instruction Scheduling (detail) -- post-RA scheduling plus an optional software pipeliner. NVIDIA's custom MRPA provides incremental register pressure tracking.

Register Allocation (detail) -- pressure-driven greedy allocator adapted for PTX's virtual register model. Works with nine typed register classes; live range splitting and rematerialization reduce spill pressure.

Machine-Level Passes (detail) -- NVIDIA-proprietary and stock LLVM passes that optimize register pressure, promote stack objects back to registers, and prepare clean PTX for ptxas.

StructurizeCFG (detail) -- mandatory pass that converts arbitrary CFGs into the structured form PTX requires, rejecting irreducible CFGs and EH funclets.

Two-Stage Compilation: cicc + ptxas

CUDA compilation is a two-stage process. cicc (this binary) compiles CUDA/NVVM IR down to PTX assembly text -- a virtual ISA with unlimited registers and structured control flow. ptxas then compiles the PTX into SASS machine code for a specific SM target. This split means that many of cicc's code generation decisions (register allocation, instruction scheduling, peephole optimization) are revisited by ptxas with full hardware knowledge. cicc's code generation pipeline therefore optimizes for two audiences simultaneously: (1) reducing register pressure and producing clean PTX that gives ptxas maximum optimization freedom, and (2) performing target-aware lowering (type legalization, instruction selection, structured CFG) that ptxas cannot undo. The practical consequence is that cicc's backend is pressure-driven rather than latency-driven -- scheduling for low register count matters more than scheduling for pipeline throughput, because ptxas will re-schedule for the hardware but cannot reduce register demand below what cicc emitted.

Cross-References

NVPTX Subtarget & feature flags -- SM processor table, type legality offsets
GPU target feature gates -- per-SM architecture feature matrix
DAG node structure -- SDNode 104-byte layout, operand stride
Pattern database -- ISel pattern table format
NVPTX machine opcodes -- opcode reference
Address spaces -- global, shared, local, param encoding
PTX emission -- downstream consumer of machine-level output
Register coalescing -- pre-RA copy elimination
PrologEpilogInserter -- .local frame layout

PTX Emission

PTX assembly output, function headers, stack frames, register declarations, special registers, atomic instructions, barriers, debug info, and output modes. Address range 0x2140000--0x21FFFFF for NVPTX-specific emission, 0x31E0000--0x3240000 for AsmPrinter.


AsmPrinter::emitFunctionBody	`sub_31EC4F0` (72KB)
Function header orchestrator	`sub_215A3C0` (.entry/.func, .param, kernel attrs, .pragma)
Kernel attribute emission	`sub_214DA90` (.reqntid, .maxntid, .minnctapersm, cluster)
Stack frame setup	`sub_2158E80` (17KB, .local, .reg, `__local_depot`)
Register class map	`sub_2163730` + `sub_21638D0` (9 classes)
GenericToNVVM	`sub_215DC20` / `sub_215E100` (36KB, addrspace rewriting)
Special registers	`sub_21E86B0` (%tid, %ctaid, %ntid, %nctaid)
Cluster registers	`sub_21E9060` (15 registers, SM 90+)
Atomic emission	`sub_21E5E70` (13 opcodes) + `sub_21E6420` (L2 cache hints)
Memory barriers	`sub_21E94F0` (membar.cta/gpu/sys, fence.sc.cluster)
Cluster barriers	`sub_21E8EA0` (barrier.cluster.arrive/wait)
Global variable emission	`sub_2156420` (texref/surfref/samplerref/data)
Global variable ordering	`sub_2157D50` (5.9KB, topological sort with circular dependency detection)
Bitcode producer	`"LLVM7.0.1"` (NVVM IR compat marker, despite LLVM 20.0.0)

Function Header Emission -- `sub_215A3C0`

Emits a complete PTX function prologue in this exact order:

Step	Output	Condition
(a)	`.pragma "coroutine";\n`	Metadata node type `'N'` linked to current function
(b)	CUDA-specific attributes	`*(a1+232)->field_952 == 1`
(c)	`.entry` or `.func`	`sub_1C2F070` (isKernelFunction)
(d)	Return type spec	`.func` only, via `sub_214C940`
(e)	Mangled function name	`sub_214D1D0`
(f)	`.param` declarations	`sub_21502D0` (monotonic counter `_param_0`, `_param_1`, ...)
(g)	Kernel attributes	`.entry` only, via `sub_214DA90`
(h)	Additional attributes	`sub_214E300`
(i)	`.noreturn`	Non-kernel with noreturn attribute (metadata attr 29)
(j)	`{\n`	Open function body
(k)	Stack frame + registers	`sub_2158E80`
(l)	DWARF debug info	If enabled

Kernel Attributes -- `sub_214DA90`

Reads NVVM metadata and emits performance-tuning directives. Attribute emission order:

Order	Attribute	Source Metadata	Condition
1	`.blocksareclusters`	`nvvm.blocksareclusters`	Fatal if reqntid not set
2	`.reqntid X, Y, Z`	`nvvm.reqntid` + `sub_1C2EDB0`	Comma-separated strtol parse
3	`.maxntid X, Y, Z`	`sub_1C2EC00` / structured	Unspecified dims default to 1
4	`.minnctapersm N`	`sub_1C2EF70`	--
5	`.explicitcluster`	`nvvm.cluster_dim`	SM > 89 only
6	`.reqnctapercluster X, Y, Z`	Cluster dim readers	SM > 89 only
7	`.maxclusterrank N`	`sub_1C2EF50`	SM > 89 only
8	`.maxnreg N`	`sub_1C2EF90`	--

Cluster attributes (5--7) gated by *(a1+232)->field_1212 > 0x59 (SM > 89, i.e., SM 90+).

Stack Frame -- `sub_2158E80`

Field	Value
Address	`0x2158E80`
Size	17KB

Emission Steps

Local depot (if *(frame_info+48) != 0):
```
.local .align 16 .b8 __local_depot0[256];
```
Where alignment = *(frame_info+60), index = function index, size = frame size.
Stack pointer registers:
```
.reg .b64 %SP;    // stack pointer
.reg .b64 %SPL;   // stack pointer local
```
Uses .b32 in 32-bit mode (checked via *(a2+8)->field_936).

Virtual register declarations -- iterates register map at *(a1+800), deduplicates via hash table at a1+808:

.reg .pred  %p<5>;
.reg .b16   %rs<12>;
.reg .b32   %r<47>;
.reg .b64   %rd<8>;
.reg .f32   %f<20>;
.reg .f64   %fd<3>;

Register Class Map

The complete 9-class register table (vtable addresses, PTX type suffixes, prefixes, encoded IDs, copy opcodes, and coalescing constraints) is in Register Classes. The encoding scheme (sub_21583D0: class_encoded_id | (register_index & 0x0FFFFFFF), fatal "Bad register class" on unrecognized vtable) is documented in Register Encoding Scheme.

Special Registers -- `sub_21E86B0`

Switch on operand value (ASCII-encoded):

Opcode	Char	Register	Description
`0x26`	`&`	`%tid.x`	Thread ID, X
`0x27`	`'`	`%tid.y`	Thread ID, Y
`0x28`	`(`	`%tid.z`	Thread ID, Z
`0x29`	`)`	`%ntid.x`	Block dim, X
`0x2A`	`*`	`%ntid.y`	Block dim, Y
`0x2B`	`+`	`%ntid.z`	Block dim, Z
`0x2C`	`,`	`%ctaid.x`	Block ID, X
`0x2D`	`-`	`%ctaid.y`	Block ID, Y
`0x2E`	`.`	`%ctaid.z`	Block ID, Z
`0x2F`	`/`	`%nctaid.x`	Grid dim, X
`0x30`	`0`	`%nctaid.y`	Grid dim, Y
`0x31`	`1`	`%nctaid.z`	Grid dim, Z
`0x5E`	`^`	(dynamic)	Via `sub_3958DA0(0, ...)` -- %warpid/%laneid
`0x5F`	`_`	(dynamic)	Via `sub_3958DA0(1, ...)`

Cluster Registers -- `sub_21E9060` (SM 90+)

Value	Register	Description
0	`%is_explicit_cluster`	Explicit cluster flag
1	`%cluster_ctarank`	CTA rank within cluster
2	`%cluster_nctarank`	CTAs in cluster
3--5	`%cluster_nctaid.{x,y,z}`	Cluster grid dimensions
6--8	`%cluster_ctaid.{x,y,z}`	CTA ID within cluster
9--11	`%nclusterid.{x,y,z}`	Number of clusters
12--14	`%clusterid.{x,y,z}`	Cluster ID

Fatal: "Unhandled cluster info operand" on invalid value.

Atomic Instruction Emission

Operand Encoding

The atomic instruction word packs scope and operation into a single integer read from the operand array at *(operand_array + 16*a2 + 8):

Bit layout:
  [3:0]   — reserved
  [7:4]   — scope: 0=gpu (implicit), 1=cta, 2=sys
  [15:8]  — reserved
  [23:16] — atomic opcode (BYTE2)

The scope field emits a prefix before the atomic suffix: scope 0 produces no prefix (implicit .gpu), scope 1 emits ".cta", scope 2 emits ".sys". The complete PTX instruction format is atom[.scope].op.type.

Base Atomics -- `sub_21E5E70`

13-operation dispatch table. The switch on BYTE2(v4) selects both the operation suffix and its type class:

Opcode	Suffix	Type Class	PTX Semantics
`0x00`	`.exch.b`	bitwise	Exchange -- atomically swap value
`0x01`	`.add.u`	unsigned	Unsigned integer addition
`0x03`	`.and.b`	bitwise	Bitwise AND
`0x05`	`.or.b`	bitwise	Bitwise OR
`0x06`	`.xor.b`	bitwise	Bitwise XOR
`0x07`	`.max.s`	signed	Signed integer maximum
`0x08`	`.min.s`	signed	Signed integer minimum
`0x09`	`.max.u`	unsigned	Unsigned integer maximum
`0x0A`	`.min.u`	unsigned	Unsigned integer minimum
`0x0B`	`.add.f`	float	Floating-point addition
`0x0C`	`.inc.u`	unsigned	Unsigned increment (wrapping)
`0x0D`	`.dec.u`	unsigned	Unsigned decrement (wrapping)
`0x0E`	`.cas.b`	bitwise	Compare-and-swap

Opcodes 0x02 and 0x04 are intentionally absent -- the PTX ISA has no signed atomic add at that slot, and no bitwise operation occupies slot 4. The 13 operations exactly match the PTX atom instruction repertoire.

The type width suffix (.b32, .b64, .u32, .u64, .s32, .s64, .f32, .f64) is appended separately by the instruction printer after the operation suffix, based on the register class of the destination operand.

L2 Cache-Hinted Atomics -- `sub_21E6420` (Ampere+)

A parallel emission function that inserts L2::cache_hint between the operation and type suffix to produce the extended format:

atom[.scope].op.L2::cache_hint.type

All 13 atomic operations are supported with L2 hints. The hint instructs the GPU L2 cache controller to retain (or evict) the target cache line after the atomic completes -- a data-locality optimization introduced with Ampere (SM 80).

The function uses SSE xmmword loads from precomputed string constants at addresses xmmword_435F590 through xmmword_435F620 to fast-copy 16-byte prefixes of each suffix string. This avoids per-character string construction: each atomic variant's complete suffix (e.g., .exch.L2::cache_hint.b at 22 bytes) is assembled from a 16-byte SSE load of the prefix plus a patched tail. The compiler optimized this into aligned vector moves rather than memcpy calls.

Atomic Emission Pseudocode

void emitAtomicOp(raw_ostream &OS, unsigned operand) {
    unsigned scope = (operand >> 4) & 0xF;
    unsigned opcode = (operand >> 16) & 0xFF;  // BYTE2

    OS << "atom";
    if (scope == 1) OS << ".cta";
    else if (scope == 2) OS << ".sys";
    // scope 0 = implicit .gpu, no suffix

    switch (opcode) {
    case 0x00: OS << ".exch.b"; break;
    case 0x01: OS << ".add.u";  break;
    // ... 0x02, 0x04 absent ...
    case 0x03: OS << ".and.b";  break;
    case 0x05: OS << ".or.b";   break;
    case 0x06: OS << ".xor.b";  break;
    case 0x07: OS << ".max.s";  break;
    case 0x08: OS << ".min.s";  break;
    case 0x09: OS << ".max.u";  break;
    case 0x0A: OS << ".min.u";  break;
    case 0x0B: OS << ".add.f";  break;
    case 0x0C: OS << ".inc.u";  break;
    case 0x0D: OS << ".dec.u";  break;
    case 0x0E: OS << ".cas.b";  break;
    }
    // Type width appended by caller
}

The L2-hinted variant (sub_21E6420) follows identical dispatch logic but emits .op.L2::cache_hint.type instead of .op.type.

Memory Barriers -- `sub_21E94F0`

Value	Instruction	Scope
0	`membar.gpu`	Device
1	`membar.cta`	Block
2	`membar.sys`	System
4	`fence.sc.cluster`	Cluster (SM 90+)
3	--	Fatal: `"Bad membar op"`

Cluster Barriers -- `sub_21E8EA0` (SM 90+)

Encoding: bits[3:0] = operation (0=arrive, 1=wait), bits[7:4] = ordering (0=default, 1=relaxed).

Instruction	Meaning
`barrier.cluster.arrive`	Signal arrival
`barrier.cluster.arrive.relaxed`	Relaxed-memory arrival
`barrier.cluster.wait`	Wait for all CTAs
`barrier.cluster.wait.relaxed`	Relaxed-memory wait

GenericToNVVM -- `sub_215DC20` / `sub_215E100`

Pass Registration

Field	Value
Pass name	`"generic-to-nvvm"`
Description	`"Ensure that the global variables are in the global address space"`
Pass ID	`unk_4FD155C`
Factory	`sub_215D530` (allocates 320-byte state)
Disable knob	`NVVMPassOptions[2200]` (bool)
Pipeline position	After InstructionSimplify, before LoopSimplify (position ~22 in optimizer)

Registration uses a once-init pattern guarded by dword_4FD1558. The 80-byte pass descriptor stores the description at offset 0, pass kind 64 (ModulePass) at offset 8, the name string at offset 16, its length 15 at offset 24, the pass ID pointer at offset 32, flags 0 at offset 40, and the factory function pointer at offset 72. Registration dispatches through sub_163A800 (the LLVM pass registration infrastructure).

A new-pass-manager version also exists: GenericToNVVMPass, registered at sub_305ED20 / sub_305E2C0 with CLI name "generic-to-nvvm".

Algorithm -- `sub_215E100` (36KB)

The pass body at sub_215E100 is 36KB because it must rewrite every address-space-dependent use of every affected global. The factory function sub_215D530 allocates a 320-byte state object containing two DenseMap-like hash tables:

Table	Offset	Purpose	Initial Capacity
GVMap	+168	Old GlobalVariable -> New GlobalVariable	128 buckets, 48 bytes/bucket
ConstMap	+248	Old Constant -> New Constant (for constant expressions)	128 buckets, 48 bytes/bucket

The algorithm proceeds in three phases:

Phase 1 -- Clone globals. Iterate over all GlobalVariable objects in the module. For each global in addrspace(0) (the LLVM generic address space):

Create a new GlobalVariable in addrspace(1) (NVPTX global memory) with identical initializer, linkage, alignment, and section attributes.
Store the old-to-new mapping in GVMap.

Phase 2 -- Rewrite uses. For each cloned global:

Create an addrspacecast instruction from the new global (addrspace(1)*) back to the original pointer type (addrspace(0)*). This preserves type compatibility with all existing uses.
Call RAUW (replaceAllUsesWith) on the original global, substituting the addrspacecast value. All instructions, constant expressions, and metadata references that pointed to the original global now point through the cast.
The ConstMap table handles the tricky case of constant expressions that embed a global reference: ConstantExpr::getAddrSpaceCast, ConstantExpr::getGetElementPtr, and similar must be reconstructed with the new global. This is the bulk of the 36KB function body -- a recursive walk over the constant expression tree, rebuilding each node.

Phase 3 -- Erase originals. Iterate GVMap and erase each original global from the module. The cleanup helper sub_215D780 iterates the map, properly managing LLVM Value reference counts during deletion.

The destructor at sub_215D1A0 / sub_215CE20 frees both hash tables and all stored Value references.

// Pseudocode for GenericToNVVM::runOnModule
bool runOnModule(Module &M) {
    for (GlobalVariable &GV : M.globals()) {
        if (GV.getAddressSpace() != 0) continue;  // skip non-generic
        if (GV.isDeclaration()) continue;

        // Phase 1: Clone to addrspace(1)
        GlobalVariable *NewGV = new GlobalVariable(
            M, GV.getValueType(), GV.isConstant(),
            GV.getLinkage(), GV.getInitializer(),
            GV.getName(), /*InsertBefore=*/nullptr,
            GV.getThreadLocalMode(), /*AddressSpace=*/1);
        NewGV->copyAttributesFrom(&GV);
        GVMap[&GV] = NewGV;
    }

    for (auto &[OldGV, NewGV] : GVMap) {
        // Phase 2: addrspacecast + RAUW
        Constant *Cast = ConstantExpr::getAddrSpaceCast(NewGV,
            OldGV->getType());
        OldGV->replaceAllUsesWith(Cast);
    }

    for (auto &[OldGV, NewGV] : GVMap) {
        // Phase 3: Erase originals
        OldGV->eraseFromParent();
    }
    return !GVMap.empty();
}

Why this exists. The CUDA frontend (EDG) generates globals in addrspace(0) (LLVM's generic/default address space). The NVPTX backend requires device globals to reside in addrspace(1) (GPU global memory) for correct PTX emission. GenericToNVVM bridges this mismatch. Upstream LLVM has an equivalent NVPTXGenericToNVVM pass, but cicc's version carries the additional ConstMap machinery for handling nested constant expression trees that reference relocated globals -- a case that upstream handles differently through its GenericToNVVM + NVPTXAssignValidGlobalAddresses split.

Global Constructor Rejection -- `sub_215ACD0`

if (lookup("llvm.global_ctors") && type_tag == ArrayType && count != 0)
    fatal("Module has a nontrivial global ctor, which NVPTX does not support.");
if (lookup("llvm.global_dtors") && type_tag == ArrayType && count != 0)
    fatal("Module has a nontrivial global dtor, which NVPTX does not support.");

GPU kernels have no "program startup" phase -- no __crt_init equivalent. Static initialization with non-trivial constructors is incompatible with the GPU execution model.

Global Variable Emission -- `sub_2156420`

Overview

The function sub_2156420 (20KB, printModuleLevelGV) handles PTX emission for individual global variables. It processes each global in the module, categorizing it by type (texture reference, surface reference, sampler reference, or data variable) and emitting the appropriate PTX declaration.

Skipped globals: "llvm.metadata", "llvm.*", "nvvm.*".

Global Type	PTX Output
Texture reference	`.global .texref NAME;`
Surface reference	`.global .surfref NAME;`
Sampler reference	`.global .samplerref NAME = { ... }`
Managed memory	`.attribute(.managed)`
Demoted (addrspace 3)	`// NAME has been demoted` (comment only)

Sampler Reference Initializer

Sampler references receive a structured initializer block with addressing mode, filter mode, and normalization settings. The emission format:

.global .samplerref my_sampler = {
    addr_mode_0 = clamp_to_edge,
    addr_mode_1 = wrap,
    addr_mode_2 = mirror,
    filter_mode = linear,
    force_unnormalized_coords = 1
};

The addressing mode values are selected from four string literals:

Value	String
0	`"wrap"`
1	`"clamp_to_border"`
2	`"clamp_to_edge"`
3	`"mirror"`

Filter mode selects between "nearest" and "linear". The force_unnormalized_coords field is emitted only when the sampler uses unnormalized texture coordinates (integer addressing).

Address Space Qualifiers

sub_214FA80 maps NVPTX address space numbers to PTX qualifier strings (0=no qualifier, 1=.global, 3=.shared, 4=.const, 5+=.local). See Address Spaces for the complete mapping including tensor memory, shared cluster, and param spaces.

Additional attributes emitted by sub_214FEE0:

.attribute(.managed) for CUDA managed memory globals
.attribute(.unified) or .attribute(.unified(N)) for unified addressing

Data Type Emission

For aggregate or large types, the emitter uses .b8 NAME[SIZE] (byte array). For pointer types with initializers, it selects .u32 or .u64 arrays depending on the pointer width flag at *(a1+232)->field_936. Simple scalar types use the type from sub_214FBF0 (.u32, .u64, .f32, .f64, etc.).

Invalid Address Space Detection

If a global has an initializer in an address space that does not support static initialization:

fatal("initial value of 'NAME' is not allowed in addrspace(N)");

This diagnostic is emitted via sub_1C3F040.

Global Variable Ordering -- `sub_2157D50` (Topological Sort)

Problem

Global variables with initializers can reference other globals. If global A's initializer contains a reference to global B, then B must be emitted before A in the PTX output. Circular dependencies are illegal and must be detected.

Algorithm -- DFS Topological Sort

sub_2157D50 (5.9KB) implements a depth-first topological sort over the global use-def chains. The algorithm:

Build dependency graph. For each global variable in the emission set, walk its initializer constant expression tree. Every GlobalVariable reference found in the initializer creates a directed edge from the referencing global to the referenced global.
DFS with three-color marking. Each global is in one of three states:
- White (unvisited): not yet processed.
- Gray (in progress): currently on the DFS stack -- its subtree is being explored.
- Black (finished): all dependents have been emitted.
Visit procedure. For each white global, mark it gray and recurse into its dependencies. When all dependencies return, mark it black and push it onto the output ordering (post-order).
Cycle detection. If the DFS encounters a gray node, a back-edge has been found, which means a circular dependency. The pass emits the fatal diagnostic:

"Circular dependency found in global variable set"

This is a hard error -- cicc cannot emit globals with mutual references. The PTX format requires a linear declaration order, and there is no forward-declaration mechanism for global variable initializers.

Pseudocode

// sub_2157D50 — topological sort of globals for PTX emission
void orderGlobals(SmallVectorImpl<GlobalVariable *> &Ordered,
                  ArrayRef<GlobalVariable *> Globals) {
    enum Color { White, Gray, Black };
    DenseMap<GlobalVariable *, Color> color;

    for (GlobalVariable *GV : Globals)
        color[GV] = White;

    std::function<void(GlobalVariable *)> visit =
        [&](GlobalVariable *GV) {
        if (color[GV] == Black) return;
        if (color[GV] == Gray)
            fatal("Circular dependency found in global variable set");
        color[GV] = Gray;

        // Walk initializer for GlobalVariable references
        if (Constant *Init = GV->getInitializer())
            for (GlobalVariable *Dep : globalsReferencedBy(Init))
                if (color.count(Dep))
                    visit(Dep);

        color[GV] = Black;
        Ordered.push_back(GV);
    };

    for (GlobalVariable *GV : Globals)
        if (color[GV] == White)
            visit(GV);
}

Interaction with Sampler References

Sampler reference globals can have structured initializers that reference other sampler state. These initializers are walked by the same DFS traversal. The topological sort ensures that any sampler whose initializer references another sampler or texture object appears after its dependencies in the PTX output.

Call Context

sub_2157D50 is called from the module-level emission entry (sub_215ACD0 -> sub_214F370) after all globals have been collected but before any global PTX text is written. The ordered list is then iterated by sub_2156420 to emit each global in dependency order.

Output Mode Selection

Compilation output mode is controlled by a bitmask in the a13 mode flags parameter, passed through the pipeline from the CLI flag parser (sub_95C880). The low bits encode the output format, while bits 8--9 encode the address width (32/64-bit).

Mode Flag Bitmask

Bits	Value	Mode	Description
`[2:0]`	`0x07`	Phase control	Default = 7 (all phases: lnk + opt + llc)
`[4]`	`0x10`	Debug	Debug compile or line-info enabled
`[5]`	`0x20`	LTO gen	LTO generation enabled
combined	`0x21`	gen-lto	Generate LTO bitcode for later linking
combined	`0x23`	full LTO	Complete LTO compilation (lnk + opt + lto)
combined	`0x26`	link-lto	Link-time LTO phase (consume LTO bitcode)
combined	`0x43`	OptiX IR	Emit `.optixir` format
`[7]`	`0x80`	gen-opt-lto	Lowering flag for LTO
`[8]`	`0x100`	nvvm-64	64-bit pointer mode
`[9]`	`0x200`	nvvm-32	32-bit pointer mode

CLI Flag to Mode Mapping

CLI Flag	Mode Bits Set	Pipeline Effect
(default)	`0x07`	All phases run, PTX text output
`--emit-llvm-bc`	(EDG flag id=59)	Emit raw LLVM bitcode `.bc` after optimization
`--emit-optix-ir`	`(a13 & 0x300) \| 0x43`	Disables IP-MSP and LICM, emits `.optixir`
`-gen-lto`	`(a13 & 0x300) \| 0x21`	Generates LTO-compatible bitcode
`-gen-lto-and-llc`	`a13 \| 0x20`	LTO generation plus LLC codegen
`-link-lto`	`(a13 & 0x300) \| 0x26`	Consumes LTO bitcode for final compilation
`-lto`	`(a13 & 0x300) \| 0x23`	Full LTO mode (all phases)
`-split-compile=N`	(stored at offset+1480)	Per-function compilation, `F%d_B%d` output naming

OptiX IR Mode

The --emit-optix-ir flag is valid only when the compilation mode is CUDA (a4 == 0xABBA) or OpenCL (a4 == 0xDEED). It forces two optimizer passes to be disabled by routing "-do-ip-msp=0" and "-do-licm=0" to the opt phase. The output is an .optixir file containing NVVM IR in a format consumable by the OptiX ray-tracing runtime for JIT compilation. See OptiX IR for the full format details.

Split Compilation

The -split-compile=N flag (stored at options offset +1480, with a sentinel at +1488 to detect double-definition) enables per-function or per-block compilation for large kernels. The pipeline assembler at sub_12E54A0 generates output identifiers using the "F%d_B%d" format string (function index, block index). Each split unit is compiled independently and the results are linked back together. An extended variant -split-compile-extended=N sets the additional flag at offset +1644.

When split-compile is active, the optimization level is set to negative (typically -1), triggering special handling in sub_12E1EF0: each compiled function's bitcode is re-read via sub_153BF40, validated against the "<split-module>" identifier, and linked back through sub_12F5610 with linkage attributes restored from a hash table.

LTO Modes

Three LTO modes interact with emission:

gen-lto (0x21): Runs optimization but skips LLC. Output is optimized LLVM bitcode suitable for later link-time optimization. The -gen-lto string is forwarded to the LTO phase.
link-lto (0x26): Consumes bitcode produced by gen-lto. Runs the LTO linker and optimizer, then proceeds to LLC for final codegen. The -link-lto string is forwarded.
full LTO (0x23): Single-invocation LTO that runs all phases including linking and codegen.

Bitcode Producer ID

The bitcode writer at sub_1538EC0 (58KB, writeModule) stamps "LLVM7.0.1" as the producer identification string in the IDENTIFICATION_BLOCK of every output bitcode file. This is despite cicc being built on LLVM 20.0.0 internally.

Dual-Constructor Mechanism

Two separate global constructors manage producer version strings, both reading the same environment variable but with different defaults:

Constructor	Address	Default	Stored At	Purpose
`ctor_036`	`0x48CC90`	`"20.0.0"`	`qword_4F837E0`	True LLVM version (internal use)
`ctor_154`	`0x4CE640`	`"7.0.1"`	(separate global)	NVVM IR compatibility marker

Both constructors execute this logic:

char *result = getenv("LLVM_OVERRIDE_PRODUCER");
if (!result) result = default_string;  // "20.0.0" or "7.0.1"
producer_global = result;

The bitcode writer uses the ctor_154 value, producing "LLVM" + "7.0.1" = "LLVM7.0.1" in the output. Setting LLVM_OVERRIDE_PRODUCER in the environment overrides both constructors to the same value.

Why "LLVM7.0.1"

The "LLVM7.0.1" string is the NVVM IR compatibility marker. It signals that the bitcode format conforms to the NVVM IR specification originally based on LLVM 7.0.1's bitcode structure. Even though cicc's internal passes operate at LLVM 20.0.0 capability, the output bitcode format (record encoding, metadata layout, type table) is constrained to be readable by older NVVM toolchain components (libNVVM, nvdisasm, Nsight) that expect LLVM 7.x-era bitcode. The writer achieves this by:

Using the IDENTIFICATION_BLOCK producer string to declare compatibility.
Constraining the MODULE_BLOCK record types to the LLVM 7.x repertoire.
Enforcing nvvmir.version metadata with major == 3, minor <= 2.

The disable-bitcode-version-upgrade cl::opt (registered in ctor_036) controls whether the bitcode reader accepts version mismatches during ingestion.

NVVM_IR_VER_CHK=0 bypasses the NVVM IR version validation at sub_157E370 and sub_12BFF60, which normally enforces major == 3, minor <= 2 and fatals with "Broken module found, compilation aborted!" on mismatch.

Address Space Operations -- `sub_21E7FE0`

Multi-purpose helper for cvta, MMA operands, and address space qualifiers:

Query	Values	Output
`"addsp"`	0=generic, 1=.global, 3=.shared, 4+=.local	cvta address space suffix
`"ab"`	0="a", 1="b"	cvta direction
`"rowcol"`	0="row", 1="col"	MMA layout
`"mmarowcol"`	0--3	"row.row"/"row.col"/"col.row"/"col.col"
`"satf"`	0=(none), 1=".satfinite"	MMA saturation
`"abtype"`	0--6	"u8"/"s8"/"u4"/"s4"/"b1"/"bf16"/"tf32"
`"trans"`	0=(none), 1=".trans"	WGMMA transpose

Architecture-Gated Features

Feature	Min Architecture	Evidence
Basic atomics (all 13 ops)	SM 20+ (all)	`sub_21E5E70`, no arch check
Atomic scopes (.cta/.sys)	SM 60+ (Pascal)	Scope bits in operand
L2 cache-hinted atomics	SM 80+ (Ampere)	`sub_21E6420` separate function
membar.cta/gpu/sys	SM 20+ (all)	`sub_21E94F0`, no arch check
fence.sc.cluster	SM 90+ (Hopper)	Opcode 4 in membar handler
barrier.cluster.arrive/wait	SM 90+ (Hopper)	`sub_21E8EA0` entire function
Cluster special registers (15)	SM 90+ (Hopper)	`sub_21E9060` entire function
MMA row/col layout	SM 70+ (Volta)	mmarowcol in `sub_21E7FE0`
MMA abtype: bf16/tf32	SM 80+ (Ampere)	Ampere-class MMA formats
.trans modifier (WGMMA)	SM 90+ (Hopper)	WGMMA transpose

Key Global Variables

Variable	Purpose
`byte_4FD17C0`	Pass configuration flag
`byte_4FD16E0`	ISel dump enable
`byte_4FD2160`	Extra ISel pass enable
`dword_4FD26A0`	Scheduling mode (1=simple, else=full pipeline)
`unk_4FD155C`	GenericToNVVM pass ID
`dword_4FD1558`	GenericToNVVM once-init guard
`qword_4F837E0`	True LLVM producer version ("20.0.0")

ptxas Interaction

The PTX text emitted by cicc is not executed directly -- it is consumed by ptxas, which parses the PTX back into an internal IR, applies its own optimization and scheduling passes (195+ knobs), performs hardware register allocation, and emits SASS machine code. Every formatting decision in emission (register naming with %r<N> angle-bracket counts, .pragma annotations, kernel attribute placement) must conform to what ptxas's PTX parser expects. The "LLVM7.0.1" producer string exists specifically because ptxas gates certain parsing behaviors on the declared producer version. Emission quality directly affects ptxas optimization scope: cleaner PTX with fewer redundant moves gives ptxas more freedom to schedule and allocate efficiently.

Cross-References

OptiX IR -- OptiX IR output format details
Bitcode I/O -- Bitcode reader/writer and "LLVM7.0.1" producer
Register Classes -- Consolidated register class reference
Address Spaces -- Consolidated address space reference
AsmPrinter -- AsmPrinter infrastructure
nvcc Interface -- CLI flag routing from nvcc to cicc

Debug Info Pipeline

Debug information in cicc follows a four-stage lifecycle: generation in the EDG/IR-generation frontend, preservation and selective stripping in the optimizer, verification after each pass, and emission as .loc/.file directives in the PTX backend. This page traces the full journey of debug metadata from CUDA source to PTX output, covering the three compilation modes (-g, -generate-line-info, neither), the five stripping passes, the NVIDIA-custom verification infrastructure, and the backend emission format with its non-standard inlined-at extension. Understanding this flow is essential for anyone reimplementing cicc's debug info contract, because the NVPTX target's debug model is fundamentally different from x86 DWARF: PTX is a virtual ISA with no physical registers, no real stack, and no fixed instruction encoding, so the debug metadata cicc emits is consumed by ptxas rather than directly by a debugger.


Debug info generation	`sub_9433F0` (per-parameter), `sub_943430` (per-global), `sub_941230` (source location)
Debug version module flag	`sub_915400` -- emits `"Debug Info Version"` = 3
Flag filter	`sub_12C6910` -- checks `-debug-compile`, `-g`, `-generate-line-info`
Verification pass	`sub_29C8000` (12,480B, 434 BBs) -- runs after each optimization pass
Per-instruction verifier	`sub_29C3AB0` (5,592B)
Debugify injector	`sub_29C1CB0`
Stripping passes	`#110`--`#114` in the pipeline parser
`.loc` emission	`sub_31D55F0` (per-instruction), `sub_31E4280` (function-scope `.file`/`.loc`)
DWARF section emission	`sub_399B1E0` (29KB, `DwarfDebug::beginModule`)
NVVM container field	`DebugInfo` at container offset +12 (enum: NONE/LINE_INFO/DWARF)
cl::opt registration	`ctor_043` at `0x48D7F0` -- `debug-compile`, `generate-line-info`, `line-info-inlined-at`

Three Compilation Modes

cicc supports three debug info levels. The mode is selected at the CLI layer and propagated through the flag dispatch table into both the optimizer and the backend. The flag filter function sub_12C6910 reads the CLI flags and routes them to the appropriate pipeline stages.

CLI flag	Flag struct offset	Routing	NVVM container `DebugInfo`	DICompileUnit emission kind
`-g`	`+296`	`-debug-compile` to LNK and OPT stages	`NVVM_DEBUG_INFO_DWARF` (2)	`FullDebug`
`-generate-line-info`	`+328`	`-generate-line-info` to OPT stage only	`NVVM_DEBUG_INFO_LINE_INFO` (1)	`LineTablesOnly`
(neither)	--	--	`NVVM_DEBUG_INFO_NONE` (0)	`NoDebug`

The distinction between -g and -generate-line-info is critical and non-obvious:

-g routes as -debug-compile to both the linker (LNK) and optimizer (OPT) stages. The linker stage needs the flag because libdevice linking must preserve debug info from the user module when merging with the stripped libdevice bitcode. The optimizer preserves all metadata: DICompileUnit, DISubprogram, DILocalVariable, DIType, scope chains, dbg.value()/dbg.declare() intrinsics -- everything. The backend emits complete DWARF sections. cuda-gdb can step through source, inspect variables, and reconstruct inlined call stacks.
-generate-line-info routes only to the OPT stage (not the linker). Early in the optimizer, StripNonLineTableDebugInfoPass strips all metadata except DILocation / DISubprogram / DICompileUnit with LineTablesOnly emission kind. This is enough for profiler source correlation (Nsight Compute maps .loc directives back to source lines) but not enough for variable inspection or source-level debugging in cuda-gdb.
Neither flag: no debug metadata is generated. The IR-generation frontend skips all debug calls (the dword_4D046B4 / [ctx+0x170] guards prevent emission), and the module has no llvm.dbg.cu named metadata. The verification pass detects this in Phase 1 and returns immediately.

Stage 1: Frontend Debug Metadata Generation

EDG IL-to-IR Layer

The IR generation frontend creates debug metadata when the debug info flag is active. Two independent guards control this:

dword_4D046B4: a global flag checked at parameter and statement codegen entry points. When set, the function prolog emitter (sub_938240 / Path B equivalent) calls sub_9433F0 to emit DILocalVariable metadata for each parameter, and the statement emitter (sub_9363D0) calls sub_941230 to set the IR builder's debug location from the EDG source position.
[ctx+0x170]: a pointer to the DICompileUnit object in the codegen context. When non-null, the global variable emitter (sub_916430 and friends) calls sub_943430 to attach debug metadata to each GlobalVariable, and the module finalizer (sub_915400) emits the "Debug Info Version" module flag with value 3.

The metadata hierarchy created during IR generation:

DICompileUnit
  [ctx+0x170], emission kind: FullDebug or LineTablesOnly
  ├── DIFile (per source file)
  ├── DISubprogram (per __global__ / __device__ function)
  │     ├── DILocalVariable (per parameter, via sub_9433F0)
  │     │     arg: 1-based index from v10 in the parameter iteration loop
  │     │     scope: parent DISubprogram
  │     │     file, line, type: from EDG declaration node
  │     ├── DILocalVariable (per auto variable, via statement codegen)
  │     └── DILocation (per instruction, via sub_941230)
  │           line, column: from EDG source position
  │           scope: nearest enclosing DILexicalBlock or DISubprogram
  └── DIGlobalVariable (per device-side global, via sub_943430)
        [gv+0xAD] < 0 indicates debug info present on the GlobalVariable

The module finalizer sub_915400 runs after all globals and functions have been code-generated. Its debug-relevant actions:

Calls sub_9151E0 to emit nvvmir.version metadata. When [ctx+0x170] is non-null, the version tuple has 4 operands instead of 2, including address-space-qualified indices.
Calls sub_914410 to emit nvvm.annotations metadata.
If [ctx+0x170] != 0: calls sub_BA93D0 (Module::addModuleFlag) with ("Debug Info Version", 3). This module flag is mandatory -- without it, LLVM's DWARF backend refuses to emit debug sections.

DIBuilder Infrastructure

The actual metadata node creation uses LLVM's DIBuilder infrastructure at 0xAD0000--0xAF0000 (Zone 2 of the type system module). This includes DIBasicType / DIDerivedType / DICompositeType uniquing, scope chain construction, and the standard LLVM !dbg attachment API. cicc uses the standard LLVM DIBuilder without modifications -- the NVIDIA-specific aspects are in the calling patterns (which EDG nodes map to which DI metadata), not in the metadata creation API itself.

Stage 2: Optimizer Preservation and Stripping

The StripNonLineTableDebugInfoPass

When -generate-line-info is active (but not -g), the optimizer runs StripNonLineTableDebugInfoPass ("strip-nonlinetable-debuginfo", pipeline parser slot #114) early in the pipeline. This pass:

Strips all DILocalVariable and DIGlobalVariable metadata
Removes all dbg.value() and dbg.declare() intrinsics
Strips DIType nodes, imported entities, and retained nodes
Downgrades DICompileUnit emission kind from FullDebug to LineTablesOnly
Preserves DISubprogram, DILocation, DIFile, and DICompileUnit (the minimum needed for .loc directives)

After this pass, the module has enough metadata for line-table-based profiling but not for source-level debugging.

The Five Stripping Passes

cicc registers five debug stripping passes in the pipeline parser, all standard LLVM passes:

Pipeline name	Slot	LLVM pass class	What it strips	What survives
`"strip-dead-debug-info"`	#110	`StripDeadDebugInfoPass`	Debug info for dead functions/globals	Everything for live code
`"strip-debug-declare"`	#112	`StripDebugDeclarePass`	`dbg.declare()` intrinsics only	`dbg.value()`, all metadata
`"strip-nondebug"`	#113	`StripNonDebugSymbolsPass`	Non-debug symbols	All debug metadata
`"strip-nonlinetable-debuginfo"`	#114	`StripNonLineTableDebugInfoPass`	Everything except line tables	`DILocation`, `DISubprogram`, `DIFile`
(core stripping at `0xAE0000`)	--	`stripDebugInfo()`	All `llvm.dbg.*` intrinsics	Nothing

The core debug stripping implementation at 0xAE0000 (Zone 3 of the type system module) is the nuclear option -- it calls stripDebugInfo() to remove everything. The four named passes provide finer granularity.

Optimizer Pass Behavior with Debug Info

Every standard LLVM optimization pass is expected to preserve debug metadata it does not intentionally modify. In practice, some passes degrade debug info quality:

Passes that preserve debug info well:

InstCombine: updates dbg.value() when simplifying instructions, uses replaceAllDbgUsesWith
SROA: splits dbg.declare() into multiple dbg.value() fragments when decomposing allocas
GVN: preserves debug locations on replacement instructions
SimplifyCFG: maintains DILocation through block merging

Passes that commonly degrade debug info:

Inlining: creates new DISubprogram for inlined functions, must maintain inlined-at chains. Failure to do so triggers the verifier's "did not generate DISubprogram" diagnostic.
LoopUnroll: duplicates instructions without always duplicating DILocation scope context
LICM: moves instructions out of loops, potentially detaching them from their original scope
Dead code elimination: removes instructions along with their dbg.value() references
Tail merging / BranchFolding: merges basic blocks from different source scopes

The verification pass (sub_29C8000) runs after each optimization pass and tracks exactly which passes degrade debug info. When the debugify-each knob is active, the full Debugify-then-CheckDebugify cycle runs around every pass, injecting synthetic debug metadata before the pass and verifying it survived afterward.

Stage 3: Debug Info Verification

The verification pass sub_29C8000 is documented in detail on the Debug Info Verification page. Here we summarize its role in the pipeline.

Pipeline Integration Protocol

The pipeline runner invokes the verifier as a sandwich around each optimization pass:

// Pseudocode for the verification protocol
snapshot_debug_metadata(M);          // Phase 2 of sub_29C8000: 8 hash tables
run_optimization_pass(M, "instcombine");
sub_29C8000(M, errs(), dbgCU, hashMap, "instcombine", 11, file, fileLen, jsonOut);
// Returns: true = PASS, false = FAIL (debug info degraded)

The pass name argument lets the JSON report attribute degradation to the specific pass responsible. The eight-table metadata snapshot captures DISubprogram, DIScope, DIGlobalVariable, DILocalVariable, DIType, DIImportedEntity, DILabel, and retained nodes -- far more comprehensive than upstream LLVM's CheckDebugInfoPass, which only tracks subprograms and debug variable intrinsics.

Verification Modes

Three modes of debug verification exist, controlled by LLVM knobs:

Mode	Knob	What runs
Standard	`verify-each` or `verify-after-all`	`sub_29C8000` after every pass
Debugify	`debugify-each`	`sub_29C1CB0` (inject) + pass + `sub_29C8000` (check)
Selective	`verify-debuginfo-preserve`	Lighter-weight preservation checking

The Debugify mode is especially powerful: it first injects synthetic debug metadata via sub_29C1CB0 (ensuring every instruction has a DILocation and every variable has dbg.value()), then runs the optimization pass, then checks whether the synthetic metadata survived. This detects passes that drop debug info even when the original module had sparse or no debug metadata.

Behavior in `-generate-line-info` Mode

When the module is in LineTablesOnly mode (after StripNonLineTableDebugInfoPass has run), the verifier still executes but its scope is narrower. Phase 5 (per-function debug variable checking) skips variable intrinsic validation because dbg.value()/dbg.declare() were intentionally stripped. Only Phase 6 (per-instruction DILocation verification via sub_29C3AB0) remains fully active, checking that:

Every instruction with a DebugLoc has a valid DILocation
DILocation scope chains resolve to a valid DISubprogram
No orphaned debug locations reference deleted subprograms
BB-level consistency is maintained

Stage 4: Backend Emission

The `.loc` Directive

The AsmPrinter emits DWARF .loc directives as inline annotations in the PTX instruction stream. The per-instruction emitter sub_31D55F0 runs after each real (non-meta) instruction when HasDebugInfo (r15+0x1E8) is set. It reads the DebugLoc attached to each MachineInstr and emits:

.loc 1 42 0
ld.param.u64 %rd1, [_Z6kernelPf_param_0];
.loc 1 43 5
mul.wide.u32 %rd2, %r1, 4;

The function-scope emitter sub_31E4280 handles .file directives that establish the file index table, and sub_31E6100 (insertDebugLocEntry) maintains a file/line-to-MCSymbol mapping for MBB boundaries used in DWARF line table construction.

The NVIDIA Inlined-At Extension

Standard LLVM .loc emits only file line column. cicc extends .loc with function_name and inlined_at attributes that encode the full inlining chain:

.loc 1 42 0, function_name _Z6kernelPf, inlined_at 2 15 3

This allows ptxas to reconstruct the complete call stack at any point in inlined code, so cuda-gdb can show the user which function was inlined and where. The implementation in the AsmPrinter:

Reads the DebugLoc from the MachineInstr
Walks the inlined-at chain via DebugLoc::getInlinedAt()
Builds a work list (SmallVector<DebugLoc, 8>) of the full chain
Emits in reverse order (outer locations before inner) so ptxas sees the outermost caller first
Tracks already-emitted inlined-at locations in an InlinedAtLocs set to prevent duplicates

The line-info-inlined-at LLVM knob (registered at 0x48D7F0, cl::opt<bool>) controls whether this extension is active. The CLI flag -no-lineinfo-inlined-at disables it by setting -line-info-inlined-at=0 on the backend command line. When disabled, only the immediate source location is emitted, losing inlining context but producing smaller PTX.

The `dwarf-extended-loc` Knob

The dwarf-extended-loc knob (enum: Default/Enable/Disable, registered at 0x490000 area) controls whether extended flags appear in .loc directives:

Value	Effect
`Default` (0)	Platform-dependent behavior
`Enable` (1)	Emit `is_stmt`, `prologue_end`, `discriminator` extensions
`Disable` (2)	Bare `.loc file line column` only

The Disable mode exists for compatibility with older ptxas versions that do not parse extended .loc flags. When enabled, the extended flags allow cuda-gdb to identify statement boundaries (is_stmt), function entry points (prologue_end), and distinguish between multiple code paths at the same source line (discriminator).

Source Interleaving

The -show-src CLI flag (flag struct offset +808, routed to the backend as -nvptx-emit-src) enables the InterleaveSrcInPtx mode. When active, the AsmPrinter reads source file lines and emits them as comments interleaved with the PTX:

// kernel.cu:42    float val = input[idx];
.loc 1 42 0
ld.global.f32 %f1, [%rd2];
// kernel.cu:43    val = val * val;
.loc 1 43 0
mul.f32 %f2, %f1, %f1;

This is purely a readability feature -- the comments are ignored by ptxas and have no effect on debug quality. The nvptx-emit-src LLVM knob description string is "Emit source line in ptx file".

`.file` Directive Emission

The .file directives are emitted by emitDwarfFileEntries during doFinalization (sub_3972F10, 24KB). They map source filenames to numeric file indices referenced by .loc:

.file 1 "/path/to/kernel.cu"
.file 2 "/usr/local/cuda/include/cuda_runtime.h"

The file table is built incrementally as .loc directives reference new files during instruction emission. The DWARF line section symbols are created via sub_E808D0 (createTempSymbol for DwarfLineSection) and bound via sub_E81A00 (emitDwarfLineSection).

DWARF Section Emission

When full debug info (-g) is active, a separate DWARF emission module at 0x3990000--0x39DF000 generates complete DWARF debug sections. This is standard LLVM DWARF emission with no significant NVIDIA modifications to the section format:

Address	Size	Function
`sub_399B1E0`	29KB	`DwarfDebug::beginModule()` -- initializes from `llvm.dbg.cu`, strings: `"DWARF Debug Writer"`, `"DWARF Emission"`
`sub_3997B50`	33KB	`.debug_aranges` emission -- address range tables
`sub_399D1D0`	12KB	Range list emission (`DW_RLE_base_address`, `DW_RLE_offset_pair`, `DW_RLE_start_length`)
`sub_399EB70`	12KB	Register location expressions -- strings: `"no DWARF register encoding"`, `"sub-register"`
`sub_39BDF60`	38KB	`.debug_names` accelerator table -- bucket count, name count, augmentation string
`sub_39B6390`	33KB	DWARF form size calculator -- switch on `DW_FORM_*` codes
`sub_215ACD0`	8.1KB	Module-level emission entry (NVPTX Debug Info Emission)

The module-level entry sub_215ACD0 checks *(a1+240)->field_344 to determine if DWARF is enabled, then looks up the "NVPTX DWARF Debug Writer" / "NVPTX Debug Info Emission" pass info. The NVPTX backend does not emit physical register locations -- GPUs have no DWARF register numbering scheme that maps to hardware. Instead, it emits virtual register references that ptxas resolves through SASS-level debug info.

The DWARF string/enum tables at 0xE00000--0xE0FFFF (tag-to-string conversion, attribute-to-string, operation encoding) are stock LLVM 20 BinaryFormat/Dwarf.cpp utilities with no visible NVIDIA modifications.

`.target` Debug Suffix

The header emission function sub_214F370 appends , debug to the .target directive when MCAsmInfo::doesSupportDebugInformation() returns true:

.target sm_90, texmode_independent, debug

This suffix tells ptxas that the PTX contains debug information and should be processed accordingly. Without it, ptxas ignores .loc and .file directives.

NvvmDebugVersion

The NVVM container format includes a debug version field at header bytes 0x08--0x09:

Offset	Size	Field
`0x08`	1 byte	`NvvmDebugVersion.Major`
`0x09`	1 byte	`NvvmDebugVersion.Minor`

Current version: Major=3, Minor<=2. The version check logic in sub_CD41B0:

Major must equal 3 (hard fail on mismatch: "not compatible" error, returns NULL)
Minor > 2: warning printed, parse continues
If absent: default {3, 2} is assumed

This version tracks the debug metadata schema independently of the NVVM IR version (NvvmIRVersion at 0x06--0x07, current Major=2, Minor<=0x62). The separation allows debug format evolution without breaking IR compatibility -- NVIDIA can add new debug metadata fields (e.g., for new SM features) without requiring a full IR version bump.

The container's DebugInfo field (at deserialized struct offset +12) also encodes the debug level as an enum that must be consistent with the module metadata:

enum NvvmDebugInfo {
    NVVM_DEBUG_INFO_NONE      = 0,  // no debug info
    NVVM_DEBUG_INFO_LINE_INFO = 1,  // -generate-line-info
    NVVM_DEBUG_INFO_DWARF     = 2   // -g
};

The standalone pipeline validates this at IR intake: if debug_info_present AND debug_mode_flag AND NOT debug_version_validated, the function returns error code 3 (incompatible).

Debug Records Format

cicc v13.0 inherits LLVM 20's support for the new debug records format (DbgRecord) as an alternative to the traditional dbg.value() / dbg.declare() intrinsics. Three knobs control this:

Knob	Type	Default	Effect
`write-experimental-debuginfo`	bool	true	Write debug info in new non-intrinsic format
`write-experimental-debuginfo-iterators-to-bitcode`	bool	true	Serialize debug records to bitcode
`preserve-input-debuginfo-format`	boolOrDefault	false	When true, preserve whatever format the input uses

The write-experimental-debuginfo default of true means cicc v13.0 uses the new DbgRecord format internally by default. This is an LLVM 20 feature where debug info is stored as DbgVariableRecord and DbgLabelRecord objects attached directly to instructions rather than as separate dbg.value() intrinsic calls. The format change is transparent to the optimizer and backend -- the verification pass and AsmPrinter handle both formats identically.

End-to-End Flow Diagram

CUDA Source (.cu / .cup)
    │
    ▼
EDG 6.6 Frontend (IL tree)
    │  dword_4D046B4 / [ctx+0x170] guards debug emission
    │  sub_9433F0: per-parameter DILocalVariable
    │  sub_943430: per-global DIGlobalVariable
    │  sub_941230: per-instruction DILocation
    │  sub_915400: "Debug Info Version" = 3 module flag
    ▼
LLVM Module with debug metadata
    │  llvm.dbg.cu → DICompileUnit → DISubprogram → ...
    │
    ├─ If -generate-line-info:
    │    StripNonLineTableDebugInfoPass (#114)
    │    strips variables, types, scopes; keeps DILocation/DISubprogram
    │
    ▼
LLVM Optimizer (sub_12E54A0)
    │  ┌─────────────────────────────────────────────┐
    │  │  For each pass:                              │
    │  │    snapshot = sub_29C8000 Phase 2 (8 tables) │
    │  │    run_pass(M);                              │
    │  │    sub_29C8000(M, ..., passName, ...);       │
    │  │    if FAIL: JSON report + diagnostic         │
    │  └─────────────────────────────────────────────┘
    ▼
Optimized LLVM Module
    │
    ▼
NVPTX Backend (SelectionDAG → MachineInstr)
    │  DebugLoc attached to each MachineInstr
    │
    ▼
AsmPrinter (sub_31EC4F0)
    │  sub_31D55F0: per-instruction .loc emission
    │  sub_31E4280: .file/.loc at function scope
    │  inlined-at chain walking → function_name, inlined_at extensions
    │  InterleaveSrcInPtx: source line comments
    │
    ├─ If -g:
    │    sub_399B1E0: DwarfDebug::beginModule()
    │    sub_3997B50: .debug_aranges
    │    sub_39BDF60: .debug_names
    │
    ▼
PTX Output
    .target sm_90, texmode_independent, debug
    .file 1 "kernel.cu"
    .loc 1 42 0, function_name _Z6kernelPf
    ld.param.u64 %rd1, [_Z6kernelPf_param_0];

Knobs Reference

Knob	Type	Default	Scope	Effect
`-g` / `-debug-compile`	bool	off	CLI	Full debug compilation (`FullDebug` emission)
`-generate-line-info`	bool	off	CLI	Line tables only (`LineTablesOnly` emission)
`-no-lineinfo-inlined-at`	bool	off	CLI	Disable inlined-at tracking (sets `-line-info-inlined-at=0`)
`-show-src` / `-nvptx-emit-src`	bool	off	CLI	Interleave source lines as PTX comments
`dwarf-extended-loc`	enum	Default	LLVM	`Default`/`Enable`/`Disable` extended `.loc` flags
`dwarf-version`	unsigned	(platform)	LLVM	DWARF version for debug sections
`line-info-inlined-at`	bool	true	LLVM	Emit inlined-at chains in `.loc` directives
`debugify-each`	bool	off	LLVM	Debugify + CheckDebugify around every pass
`debugify-level`	enum	location+variables	LLVM	`locations` or `location+variables`
`debugify-quiet`	bool	off	LLVM	Suppress debugify diagnostics
`debugify-func-limit`	int	unlimited	LLVM	Max functions to debugify
`debugify-export`	string	--	LLVM	Export debugify results to file
`verify-each`	bool	off	LLVM	Run IR verifier after every pass
`verify-debuginfo-preserve`	bool	off	LLVM	Enable debug info preservation checking
`no-inline-line-tables`	bool	off	LLVM	Prevent inlining from merging line tables
`write-experimental-debuginfo`	bool	true	LLVM	Use DbgRecord format instead of intrinsics
`preserve-input-debuginfo-format`	boolOrDefault	false	LLVM	Preserve input debug info format as-is
`NvvmDebugVersion`	{u8,u8}	{3,2}	Container	Debug metadata schema version
`qword_5008FC8`	bool	off	Global	Verbose diagnostic output enable
`qword_5008C88`	int32	>0	Global	Metadata depth threshold (<=0 skips deep scope walk)

NVIDIA Modifications vs Stock LLVM

Inlined-at .loc extension. Upstream LLVM's NVPTX AsmPrinter emits standard .loc file line column. cicc appends function_name and inlined_at attributes that encode the full inlining chain for cuda-gdb call stack reconstruction.
Eight-table verification. Upstream CheckDebugInfoPass tracks DISubprogram and debug variable intrinsics. NVIDIA's sub_29C8000 maintains eight separate hash tables covering subprograms, scopes, global variables, local variables, types, imported entities, labels, and retained nodes.
JSON structured reporting. NVIDIA added a YAML/JSON serializer to the verification pass that produces machine-parseable bug reports with per-pass attribution -- no upstream equivalent.
Metadata reconstruction. After verification, NVIDIA's pass reconstructs the module's metadata tables from verified versions (Phase 8), effectively serving as a "repair" pass that normalizes metadata after corruption.
Container debug versioning. The NvvmDebugVersion field in the NVVM container header tracks the debug metadata schema independently of the IR version -- a concept that does not exist in upstream LLVM.
Three-level debug info enum. The NVVM_DEBUG_INFO_NONE / LINE_INFO / DWARF enum in the container provides a compile-unit-level debug mode indicator that ptxas and libNVVM can check without parsing the full module metadata.

Function Map

Function	Address	Size	Role
Emit `DILocalVariable` for function parameter	`sub_9433F0`	--	--
Emit debug info for `GlobalVariable` (conditional on `[ctx+0x170]`)	`sub_943430`	--	--
Set IR builder `DebugLoc` from EDG source position	`sub_941230`	--	--
Module finalizer: emit `"Debug Info Version" = 3` module flag	`sub_915400`	133B	--
Flag filter: checks `-debug-compile`, `-g`, `-generate-line-info`	`sub_12C6910`	--	--
Debug info verification pass (main entry)	`sub_29C8000`	12,480B	--
Per-instruction `DILocation` verifier	`sub_29C3AB0`	5,592B	--
Debugify synthetic debug info injector	`sub_29C1CB0`	--	--
`NewPMCheckDebugifyPass` wrapper	`sub_22702B0`	--	--
`NewPMDebugifyPass` wrapper	`sub_2270390`	--	--
Per-instruction `.loc` emission	`sub_31D55F0`	--	--
Function-scope `.file`/`.loc` emission	`sub_31E4280`	--	--
`insertDebugLocEntry` (file/line to MCSymbol mapping)	`sub_31E6100`	--	--
Instruction-level debug comment emission	`sub_31D89B0`	--	--
`emitHeader` (`.version`, `.target ... , debug`)	`sub_214F370`	7.2KB	--
Module-level emission entry / NVPTX Debug Info Emission	`sub_215ACD0`	8.1KB	--
`DwarfDebug::beginModule()`	`sub_399B1E0`	29KB	--
`.debug_aranges` emission	`sub_3997B50`	33KB	--
Range list emission (`DW_RLE_*`)	`sub_399D1D0`	12KB	--
Register location expressions	`sub_399EB70`	12KB	--
`.debug_names` accelerator table	`sub_39BDF60`	38KB	--
DWARF form size calculator	`sub_39B6390`	33KB	--
`DIBuilder` / debug metadata helper	`sub_ADCDB0`	--	--
`cl::opt` registration: `debug-compile`, `generate-line-info`, `line-info-inlined-at`	`sub_48D7F0`	--	--
NVVM container version check (validates `NvvmDebugVersion.Major == 3`)	`sub_CD41B0`	--	--

Cross-References

Debug Info Verification -- detailed sub_29C8000 algorithm, 9-phase walk, JSON output format
AsmPrinter & PTX Body Emission -- .loc/.file directive emission, per-instruction debug annotation, inlined-at chain
PTX Emission -- module-level emission, .target ... , debug suffix
Entry Point & CLI -- -g, -generate-line-info flag parsing in sub_8F9C90
NVVM IR Generation -- dual-path architecture, codegen context
CLI Flags -- flag routing through the 3-column dispatch table
LLVM Knobs -- debugify-*, verify-each, dwarf-* knobs
Pipeline & Ordering -- where debug verification fits in the pass ordering
NVVM Container -- NvvmDebugVersion field in the binary header
Inliner Cost Model -- inlining decisions that create the inlined-at chains

NVIDIA Custom Passes

25+ proprietary optimization passes not found in upstream LLVM. Registered into the New PM pipeline at sub_2342890 and into the pipeline assembler at sub_12E54A0.


Module-level custom	16 passes
Function-level custom	9 passes
Loop-level custom	1 pass
Custom analyses	2 analyses
Machine-level custom	13 passes
Registration	`sub_2342890` (New PM) + `sub_12E54A0` (pipeline builder)
Dedicated deep-dive pages	22

IR-Level Module Passes

Pass Name	Class / Function	Size	Description
`memory-space-opt`	`sub_1C70910` / `sub_1CA2920`	cluster	Resolves generic pointers to specific address spaces (global/shared/local/const). Warns on illegal ops: atomics on constant mem, wmma on wrong space. Parameterized: `first-time`, `second-time`, `no-warnings`, `warnings`
`printf-lowering`	`sub_1CB1E60`	31KB	Lowers `printf` → `vprintf` + local buffer. Validates format string is a literal. `"vprintfBuffer.local"`, `"bufIndexed"`
`nvvm-verify`	`sub_2C80C90`	230KB	Three-layer NVVM IR verifier (module + function + intrinsic). Validates triples, address spaces, atomic restrictions, pointer cast rules, architecture-gated intrinsic availability
`nvvm-pretreat`	`PretreatPass`	—	IR pre-treatment before optimization
`check-kernel-functions`	`NVPTXSetFunctionLinkagesPass`	—	Kernel function linkage validation
`check-gep-index`	—	—	GEP index validation
`cnp-launch-check`	`CNPLaunchCheckPass`	—	Cooperative launch validation
`ipmsp`	`IPMSPPass`	—	Inter-procedural memory space propagation
`nv-early-inliner`	—	—	NVIDIA early inlining pass
`nv-inline-must`	`InlineMustPass`	—	Force-inline functions marked `__forceinline__`
`select-kernels`	`SelectKernelsPass`	—	Kernel selection for compilation
`set-global-array-alignment`	—	—	Parameterized: `modify-shared-mem`, `skip-shared-mem`, `modify-global-mem`, `skip-global-mem`
`lower-aggr-copies`	—	72KB+58KB	Lower aggregate copies: struct splitting, memmove unrolling. Param: `lower-aggr-func-args`
`lower-struct-args`	—	—	Lower structure arguments. Param: `opt-byval`
`process-restrict`	—	—	Process `__restrict__` annotations. Param: `propagate-only`
`lower-ops`	`LowerOpsPass`	—	Lower special operations. Includes FP128/I128 emulation via 48 `__nv_*` library calls

IR-Level Function Passes

Pass Name	Function	Size	Description
`branch-dist`	`sub_1C47810` cluster	—	Branch distribution optimization. Knobs: `branch-dist-block-limit`, `branch-dist-func-limit`, `branch-dist-norm`
`nvvm-reflect`	`sub_1857160`	—	Resolves `__nvvm_reflect()` calls to integer constants based on target SM and FTZ mode. Runs multiple times as inlining exposes new calls
`nvvm-reflect-pp`	—	—	NVVM reflect preprocessor
`nvvm-intrinsic-lowering`	`sub_2C63FB0`	140KB	Lowers `llvm.nvvm.*` intrinsics to standard LLVM IR. Two levels: 0 = basic, 1 = barrier-aware. Runs up to 10 times in mid pipeline
`nvvm-peephole-optimizer`	—	—	NVVM-specific peephole optimizations
`remat`	`sub_1CE7DD0`	67KB	IR-level rematerialization. Analyzes live-in/live-out register pressure per BB. Contains IV demotion sub-pass (75KB)
`reuse-local-memory`	—	—	Local memory reuse optimization
`set-local-array-alignment`	—	—	Set alignment for local arrays
`sinking2`	—	—	NVIDIA-specific instruction sinking (distinct from LLVM's Sink pass)

IR-Level Loop Pass

Pass Name	Function	Size	Description
`loop-index-split`	`sub_2CC5900` / `sub_1C7B2C0`	69KB	Split loops on index conditions. NVIDIA-preserved pass (removed from upstream LLVM)

Custom Analyses

Analysis Name	Purpose
`rpa`	Register Pressure Analysis — feeds into scheduling and rematerialization decisions
`merge-sets`	Merge set computation — used by coalescing and allocation

Machine-Level Passes

Pass Name	Function	Pass ID	Size	Description
Block Remat	`sub_2186D90`	`nvptx-remat-block`	47KB	Two-phase candidate selection + iterative "pull-in" for register pressure reduction. `"Max-Live-Function("`, `"Really Final Pull-in:"`
Machine Mem2Reg	`sub_21F9920`	`nvptx-mem2reg`	—	Promotes `__local_depot` stack objects back to registers post-regalloc
MRPA	`sub_2E5A4E0`	`machine-rpa`	48KB	Machine Register Pressure Analysis — incremental tracking, not in upstream LLVM
LDG Transform	`sub_21F2780`	`ldgxform`	—	Transforms global loads to `ldg.*` (texture cache) for read-only data
GenericToNVVM	`sub_215DC20`	`generic-to-nvvm`	36KB	Moves globals from generic to global address space
Alloca Hoisting	`sub_21BC7D0`	`alloca-hoisting`	—	Ensures all allocas are in entry block (PTX requirement)
Image Optimizer	`sub_21BCF10`	—	—	Optimizes texture/surface access patterns
NVPTX Peephole	`sub_21DB090`	`nvptx-peephole`	—	NVPTX-specific peephole optimization
Prolog/Epilog	`sub_21DB5F0`	—	—	Custom frame management (PTX has no traditional prolog/epilog)
Replace Image Handles	`sub_21DBEA0`	—	—	Replaces IR-level image handles with PTX texture/surface references
Extra MI Printer	`sub_21E9E80`	`extra-machineinstr-printer`	—	Register pressure statistics reporting
Valid Global Names	`sub_21BCD80`	`nvptx-assign-valid-global-names`	—	Sanitizes global names to valid PTX identifiers
NVVMIntrRange	`sub_216F4B0`	`nvvm-intr-range`	—	Adds `!range` metadata to NVVM intrinsics (e.g., tid.x bounds)

Major Proprietary Subsystems

Dead Synchronization Elimination — `sub_2C84BA0`

Field	Value
Size	96KB
Purpose	Removes redundant `__syncthreads()` barriers

Bidirectional fixed-point dataflow analysis across the CFG, tracking four memory access categories per BB through eight red-black tree maps. Each deletion triggers full restart. Distinct from lightweight basic-dbe. See dedicated page for full algorithm.

MemorySpaceOpt — Multi-Function Cluster

Function	Size	Purpose
`sub_1C70910`	—	Pass entry point
`sub_1C6A6C0`	—	Pass variant
`sub_1CA2920`	32KB	Address space resolution — `"Cannot tell what pointer points to, assuming global memory space"`
`sub_1CA9E90`	28KB	Secondary resolver
`sub_1CA5350`	45KB	Infrastructure
`sub_2CBBE90`	71KB	Memory-space-specialized function cloning

NV Rematerialization Cluster

Function	Size	Role
`sub_1CE7DD0`	67KB	Main driver — live-in/live-out analysis, skip decisions
`sub_1CE67D0`	32KB	Block-level executor — `"remat_"`, `"uclone_"` prefixes
`sub_1CE3AF0`	56KB	Pull-in cost analysis — `"Total pull-in cost = %d"`

NLO — Simplify Live Output

Function	Size	Strings
`sub_1CE10B0`	48KB	`"Simplify Live Output"`, `"nloNewBit"`, `"newBit"`
`sub_1CDC1F0`	35KB	`"nloNewAdd"`, `"nloNewBit"`

Creates new add/bit operations to simplify live-out values at block boundaries.

IV Demotion — `sub_1CD74B0`

Field	Value
Size	75KB
Strings	`"phiNode"`, `"demoteIV"`, `"newInit"`, `"newInc"`, `"argBaseIV"`, `"newBaseIV"`, `"iv_base_clone_"`, `"substIV"`

Demotes induction variables (e.g., 64-bit to 32-bit), creates new base IVs, clones IV chains for register pressure reduction. Sub-pass of rematerialization. See dedicated page for full algorithm.

RLMCAST — `sub_2D13E90`

Field	Value
Size	67KB
Purpose	Register-level multicast instruction lowering

Broadcasts a value to multiple register destinations. Uses 216-byte and 160-byte node structures.

Texture Group Merge (.Tgm) — `sub_2DDE8C0`

Groups texture load operations to hide latency. Uses .Tgm suffix in scheduling and function pointer table (3 predicates) for grouping decisions.

NVVM Intrinsic Verifier — `sub_2C7B6A0`

Field	Value
Size	143KB
Purpose	Validates ALL NVVM intrinsics against SM capabilities

Architecture-gated validation for every intrinsic call. Part of the three-layer NVVM verifier (230KB total).

NVVM Intrinsic Lowering — `sub_2C63FB0`

Field	Value
Size	140KB
Purpose	Lowers NVVM intrinsics to concrete operations

Pattern-matching rewrite engine for llvm.nvvm.* intrinsics. Two levels (basic + barrier-aware), runs up to 10 times. See dedicated page for full dispatch table.

Base Address Strength Reduction — `sub_2CA4A10`

Field	Value
Size	58KB
Knobs	`do-base-address-strength-reduce` (two levels: 1 = no conditions, 2 = with conditions)

Scans loop bodies for memory ops sharing a common base pointer, hoists the anchor computation, rewrites remaining addresses as (anchor + relative_offset). See dedicated page for the anchor selection algorithm.

Common Base Elimination — `sub_2CA8B00`

Field	Value
Size	39KB
Purpose	Hoists shared base address expressions to dominating CFG points

Operates at inter-block level (vs BASR intra-loop). The two passes form a complementary pair for comprehensive GPU address computation reduction. See dedicated page.

CSSA Transformation — `sub_3720740`

Field	Value
Size	22KB
Purpose	Conventional-SSA for GPU divergent control flow
Knobs	`do-cssa`, `cssa-coalesce`, `cssa-verbosity`, `dump-before-cssa`
Debug	`"IR Module before CSSA"`

Rewrites PHI nodes to be safe under warp-divergent execution by inserting explicit copy instructions at reconvergence points. See dedicated page for the divergence model.

NVIDIA Codegen Knobs — `sub_1C20170`

70+ knobs parsed from the NVVM container format:

Graphics Pipeline

VSIsVREnabled, VSIsLastVTGStage, EnableZeroCoverageKill, AllowComputeDerivatives, AllowDerivatives, EnableNonUniformQuadDerivatives, UsePIXBAR, ManageAPICallDepth

Compute / Memory

DisableSAMRAM, DoMMACoalescing, DisablePartialHalfVectorWrites, AssumeConvertMemoryToRegProfitable, MSTSForceOneCTAPerSMForSmemEmu, AddDepFromGlobalMembarToCB

Register Allocation / Scheduling

AdvancedRemat, CSSACoalescing, DisablePredication, DisableXBlockSched, ReorderCSE, ScheduleKils, NumNopsAtStart, DisableERRBARAfterMEMBAR

Type Promotion

PromoteHalf, PromoteFixed, FP16Mode, IgnoreRndFtzOnF32F16Conv, DisableLegalizeIntegers

PGO

PGOProfileKind, PGOEpoch, PGOBatchSize, PGOCounterMemBaseVAIndex

Knob Forwarding

OCGKnobs, OCGKnobsFile, NVVMKnobsString, OmegaKnobs, FinalizerKnobs

Compile Modes — `sub_1C21CE0`

Mode	Constant
Whole-program no-ABI	`NVVM_COMPILE_MODE_WHOLE_PROGRAM_NOABI`
Whole-program ABI	`NVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI`
Separate ABI	`NVVM_COMPILE_MODE_SEPARATE_ABI`
Extensible WP ABI	`NVVM_COMPILE_MODE_EXTENSIBLE_WHOLE_PROGRAM_ABI`

Opt Level	Constant
None	`NVVM_OPT_LEVEL_NONE`
1	`NVVM_OPT_LEVEL_1`
2	`NVVM_OPT_LEVEL_2`
3	`NVVM_OPT_LEVEL_3`

Debug Info	Constant
None	`NVVM_DEBUG_INFO_NONE`
Line info	`NVVM_DEBUG_INFO_LINE_INFO`
Full DWARF	`NVVM_DEBUG_INFO_DWARF`

NVVMReflect

The NVVMReflect pass resolves calls to __nvvm_reflect() -- a compile-time introspection mechanism that lets CUDA device code query compilation parameters such as the target GPU architecture, flush-to-zero mode, and precision settings. Each __nvvm_reflect("__CUDA_ARCH") call is replaced with an integer constant derived from the target SM version, and each __nvvm_reflect("__CUDA_FTZ") is replaced with 0 or 1 depending on the -ftz flag. After replacement, the constant result feeds into conditional branches that standard LLVM passes (SimplifyCFG, SCCP, ADCE) can fold away, eliminating dead architecture-specific code paths at compile time. This is NVIDIA's primary mechanism for producing architecture-specialized code from a single portable source: libdevice alone contains hundreds of __nvvm_reflect calls that select between FTZ and non-FTZ instruction variants.

The pass is relatively small in code size but architecturally critical -- it runs multiple times at different pipeline positions because inlining, loop unrolling, and other transformations continuously expose new __nvvm_reflect calls that were previously hidden inside un-inlined function bodies.

Key Facts

Property	Value
Pass factory	`sub_1857160`
Pass level	Function pass (runs per-function)
Registration	Legacy PM only (not separately registered in New PM); post-processor `nvvm-reflect-pp` is New PM #381 at line 2237
Runtime positions	Tier 0 #7; Tier 1/2/3 #9, #73 (see Pipeline)
Pipeline disable flag	NVVMPassOptions offset `+880`
Knob	`nvvm-reflect-enable` (boolean, default: `true`)
Global knob constructor	`ctor_271`
Vtable (likely)	`unk_3C2026C`
Post-processing pass	`nvvm-reflect-pp` = `SimplifyConstantConditionalsPass`
New PM registration	Not separately registered -- NVVMReflect is a legacy-PM pass invoked from the pipeline assembler; `nvvm-reflect-pp` is the New PM companion at registration line 2237 of `sub_2342890`
Upstream equivalent	`NVVMReflect` in `llvm/lib/Target/NVPTX/NVVMReflect.cpp`
Occurrences in pipeline	~8 invocations across all paths (see Multi-Run Pattern)

Reflect Query Names

The __nvvm_reflect mechanism supports a fixed set of query strings. These are embedded as global string constants in NVVM IR (typically from libdevice bitcode) and matched by the pass:

Query String	Meaning	Value Source
`__CUDA_ARCH`	Target GPU compute capability	`-arch=compute_XX` flag, encoded as `major100 + minor10`
`__CUDA_FTZ`	Flush-to-zero mode for single-precision	`-ftz=1` sets to 1; default 0
`__CUDA_PREC_DIV`	Precise division mode	`-prec-div=1` sets to 1; default 0
`__CUDA_PREC_SQRT`	Precise square root mode	`-prec-sqrt=1` sets to 1; default 0

`__CUDA_ARCH` Values

The __CUDA_ARCH value is an integer encoding SM_major * 100 + SM_minor * 10, propagated from the CLI through the EDG frontend as -R __CUDA_ARCH=NNN:

Architecture	`__CUDA_ARCH`	SM Variants
Turing	750	sm_75
Ampere	800, 860, 870, 880	sm_80, sm_86, sm_87, sm_88
Ada Lovelace	890	sm_89
Hopper	900	sm_90, sm_90a (both share 900)
Blackwell	1000, 1030	sm_100/100a/100f, sm_103/103a/103f
(SM 11.x)	1100	sm_110/110a/110f
(SM 12.x)	1200, 1210	sm_120/120a/120f, sm_121/121a/121f

Note: Architecture variants with a (accelerated) and f (forward-compatible) suffixes share the same __CUDA_ARCH value as their base. They differ only in -opt-arch and -mcpu flags, which affect instruction selection and scheduling but not reflect queries.

Algorithm

The NVVMReflect pass implements a straightforward pattern-matching replacement. In pseudocode:

bool NVVMReflectPass::runOnFunction(Function &F) {
    bool changed = false;
    if (!nvvm_reflect_enable)  // controlled by 'nvvm-reflect-enable' knob
        return false;

    SmallVector<CallInst *, 8> reflect_calls;

    // Phase 1: Collect all __nvvm_reflect call sites
    for (BasicBlock &BB : F) {
        for (Instruction &I : BB) {
            if (auto *CI = dyn_cast<CallInst>(&I)) {
                Function *callee = CI->getCalledFunction();
                if (callee && callee->getName() == "__nvvm_reflect")
                    reflect_calls.push_back(CI);
            }
        }
    }

    // Phase 2: Resolve each call to a constant
    for (CallInst *CI : reflect_calls) {
        // Extract the query string from the first argument.
        // The argument is a pointer to a global constant string:
        //   @.str = private constant [12 x i8] c"__CUDA_ARCH\00"
        // The pass traces through the GEP/bitcast to find the
        // ConstantDataArray initializer.
        StringRef query = extractStringArgument(CI->getArgOperand(0));

        int result = 0;
        if (query == "__CUDA_ARCH")
            result = sm_version;          // e.g., 900 for sm_90
        else if (query == "__CUDA_FTZ")
            result = ftz_enabled ? 1 : 0;
        else if (query == "__CUDA_PREC_DIV")
            result = prec_div ? 1 : 0;
        else if (query == "__CUDA_PREC_SQRT")
            result = prec_sqrt ? 1 : 0;
        else
            result = 0;  // unknown query => 0

        // Replace the call with the constant integer
        CI->replaceAllUsesWith(ConstantInt::get(CI->getType(), result));
        CI->eraseFromParent();
        changed = true;
    }
    return changed;
}

The string extraction logic must handle the IR pattern produced by the CUDA frontend and libdevice linking:

@.str = private unnamed_addr constant [12 x i8] c"__CUDA_ARCH\00", align 1

%1 = call i32 @__nvvm_reflect(ptr @.str)

The pass walks through the argument operand, stripping ConstantExpr GEPs and bitcasts, to reach the ConstantDataArray containing the query string. If the argument is not a resolvable constant string, the call is left unmodified (this is a no-op safety -- in practice, all reflect calls use literal string arguments).

Interaction with Constant Propagation and Dead Code Elimination

The reflect replacement produces a constant integer that feeds directly into an icmp and conditional branch. This is the canonical pattern in libdevice:

Before NVVMReflect (from libdevice.10.ll, function __nv_floorf):

define float @__nv_floorf(float %f) {
  %1 = call i32 @__nvvm_reflect(ptr @.str)   ; @.str = "__CUDA_FTZ"
  %2 = icmp ne i32 %1, 0
  br i1 %2, label %ftz_path, label %precise_path

ftz_path:
  %3 = call float @llvm.nvvm.floor.ftz.f(float %f)
  br label %merge

precise_path:
  %4 = call float @llvm.nvvm.floor.f(float %f)
  br label %merge

merge:
  %.0 = phi float [ %3, %ftz_path ], [ %4, %precise_path ]
  ret float %.0
}

After NVVMReflect (with -ftz=1):

define float @__nv_floorf(float %f) {
  %2 = icmp ne i32 1, 0            ; constant 1 replaces the call
  br i1 %2, label %ftz_path, label %precise_path

ftz_path:
  %3 = call float @llvm.nvvm.floor.ftz.f(float %f)
  br label %merge

precise_path:                       ; now unreachable
  %4 = call float @llvm.nvvm.floor.f(float %f)
  br label %merge

merge:
  %.0 = phi float [ %3, %ftz_path ], [ %4, %precise_path ]
  ret float %.0
}

After SimplifyCFG / SCCP / ADCE (subsequent passes):

define float @__nv_floorf(float %f) {
  %1 = call float @llvm.nvvm.floor.ftz.f(float %f)
  ret float %1
}

The icmp ne i32 1, 0 folds to true, SimplifyCFG eliminates the dead branch, and ADCE removes the unused llvm.nvvm.floor.f call. The function collapses from 4 basic blocks to 1.

This pattern repeats for every libdevice math function: __nv_fabsf, __nv_fminf, __nv_fmaxf, __nv_rsqrtf, __nv_exp2f, and dozens more all contain the same __nvvm_reflect("__CUDA_FTZ") branch. After reflect resolution, each function specializes to either FTZ or precise mode.

`__CUDA_ARCH` branching pattern

For architecture-dependent code, the pattern uses inequality comparisons:

%arch = call i32 @__nvvm_reflect(ptr @.str.1)  ; "__CUDA_ARCH"
%is_sm80_plus = icmp sge i32 %arch, 800
br i1 %is_sm80_plus, label %sm80_path, label %legacy_path

sm80_path:
  ; use SM 8.0+ specific intrinsics (e.g., async copy, cp.async)
  ...

legacy_path:
  ; fallback path for older architectures
  ...

After NVVMReflect replaces %arch with (e.g.) 900 for Hopper, the comparison icmp sge i32 900, 800 folds to true, and the legacy path is eliminated.

Multi-Run Pattern

NVVMReflect (sub_1857160) is invoked multiple times across the pipeline because optimization passes continuously expose new reflect calls. The key insight is that __nvvm_reflect calls originate primarily from libdevice functions, which are linked as bitcode and initially exist as un-inlined function calls. Each inlining pass expands these functions inline, exposing their internal __nvvm_reflect calls to the containing function.

Tier 0 Pipeline (Full Optimization via `sub_12DE330`)

In the Tier 0 (O1/O2/O3) full optimization pipeline, NVVMReflect appears once:

Position	Factory	Context
#7	`sub_1857160()`	After CGSCC inliner (#2), GVN (#5-6). Catches reflect calls exposed by first-round inlining

"mid" Path Pipeline (Ofcmid/Ofcmin via `sub_12E54A0` PATH B)

In the "mid" fast-compile path, NVVMReflect appears at three distinct positions:

Position	Factory	Guard	Context
After CGSCC pipeline #8	`sub_1857160()`	`!opts[880]`	After aggressive CGSCC inlining (8 iterations). Catches reflect calls from freshly inlined libdevice bodies
After Sinking2 + EarlyCSE	`sub_1857160()`	`!opts[880]`	After loop transformations and code motion. Catches reflect calls in loop bodies after unrolling
(appears once more in late position)	`sub_1857160()`	`!opts[880]`	Final cleanup after late CGSCC pass and NVVMIntrinsicLowering

Default/General Path Pipeline (PATH C)

In the default path (external bitcode input), NVVMReflect appears at three positions:

Position	Factory	Context
After CGSCC pipeline #4	`sub_1857160()`	First resolution after initial inlining
After NVVMIntrinsicLowering	`sub_1857160()`	Intrinsic lowering may expose new reflect patterns
After LoopUnroll + InstCombine	`sub_1857160()`	Loop unrolling duplicates loop bodies containing reflect calls

Tiered Pipeline Insertions (`sub_12DE8F0`)

Within the tiered sub-pipeline, NVVMReflect appears with additional gating:

Tier	Guard	Position
1, 2, 3	`opts[3200] && !opts[880]`	Mid-tier, after NVVMVerifier and IPConstPropagation
3 only	`opts[3200] && tier==3 && !opts[880]`	Late-tier, after ADCE and LoopOpt/BarrierOpt. This extra run at O3 catches reflect calls exposed by the most aggressive transformations

Why Multiple Runs Are Necessary

Consider this scenario:

User code calls __nv_sinf(x) (a libdevice function).
Initially, __nv_sinf is an external function call -- its body contains __nvvm_reflect("__CUDA_FTZ") but the reflect call is not visible to the optimizer.
First NVVMReflect run: No-op for this function (the reflect is inside __nv_sinf's body, which has not been inlined yet).
CGSCC Inliner runs: Inlines __nv_sinf into the caller, expanding its body with the __nvvm_reflect call.
Second NVVMReflect run: Now sees the freshly-inlined __nvvm_reflect call and resolves it to a constant.
Loop Unrolling runs: If the __nv_sinf call was inside a loop, unrolling duplicates the call site. If the loop body was too complex to inline before unrolling simplified it, a third inlining opportunity may arise.
Third NVVMReflect run: Resolves any remaining reflect calls exposed by unrolling + re-inlining.

Without multiple runs, libdevice functions inlined late in the pipeline would retain their reflect-based branching, defeating the specialization mechanism and leaving dead code paths in the final binary.

The `nvvm-reflect-pp` Post-Processing Pass

After NVVMReflect replaces calls with constants, the resulting IR contains trivially-foldable comparisons and dead branches. While standard LLVM passes (SimplifyCFG, ADCE) handle most of this, NVIDIA registers a dedicated post-processing pass under the misleading name nvvm-reflect-pp.

Despite its name, nvvm-reflect-pp is SimplifyConstantConditionalsPass (class llvm::SimplifyConstantConditionalsPass), not a reflection pass. It is a targeted dead-branch elimination pass that:

Finds conditional branches where the condition is a constant (icmp with both operands constant).
Replaces the branch with an unconditional branch to the taken target.
Marks the not-taken successor as potentially unreachable.
Cleans up resulting dead phi nodes and empty blocks.

This pass is registered in the New PM at sub_2342890 line 2237 as a function-level pass. It runs immediately after NVVMReflect in some pipeline configurations to ensure that reflected constants are cleaned up before subsequent optimization passes see the IR.

Configuration

Knob	Type	Default	Effect
`nvvm-reflect-enable`	`bool`	`true`	Master enable for NVVMReflect. When `false`, all `__nvvm_reflect` calls are left unresolved (they default to 0 at link time, selecting the non-FTZ/non-precise/lowest-arch path).

Pipeline Disable Flag

NVVMPassOptions offset +880 is the per-compilation disable flag for NVVMReflect. When set (e.g., by an internal debugging mechanism), all pipeline insertion points skip the pass via the !opts[880] guard. This flag is distinct from the nvvm-reflect-enable knob: the knob controls the pass's internal behavior, while the pipeline flag prevents the pass from being added to the pipeline at all.

Reflect Value Propagation Path

The reflect query values flow from the CLI through three layers:

CLI: -arch=compute_90 is parsed by sub_95EB40 / sub_12C8DD0
EDG frontend: Receives -R __CUDA_ARCH=900 and defines the preprocessor macro
Optimizer: Receives -opt-arch=sm_90. The NVVMReflect pass reads the SM version from the target machine configuration (not from -R flags -- those are for the preprocessor)

For FTZ/precision flags, the path is:

-ftz=1 maps to -R __CUDA_FTZ=1 (EDG) and -nvptx-f32ftz (optimizer/backend)
The NVVMReflect pass reads the FTZ setting from the NVPTX subtarget or a global variable set during pipeline configuration

Differences from Upstream LLVM

Upstream LLVM's NVVMReflect pass (in llvm/lib/Target/NVPTX/NVVMReflect.cpp) is functionally similar but differs in several respects in CICC v13.0:

Aspect	Upstream LLVM	CICC v13.0
Pipeline placement	Runs once, typically early	Runs ~8 times at strategic positions throughout the pipeline
Post-processing	Relies on standard SimplifyCFG	Has dedicated `nvvm-reflect-pp` (`SimplifyConstantConditionalsPass`)
Pipeline integration	New PM function pass	Legacy PM function pass invoked from the pipeline assembler (`sub_12E54A0`), with the pipeline disable flag at `NVVMPassOptions[880]`
Tier 3 extra run	Not applicable	Extra late-pipeline run gated by `tier==3` for O3-only cleanup
Query string set	`__CUDA_ARCH`, `__CUDA_FTZ`	Same set plus `__CUDA_PREC_DIV`, `__CUDA_PREC_SQRT`

The multi-run strategy is the most significant difference. Upstream LLVM assumes that NVVMReflect runs once before optimization, resolving all reflect calls in the linked libdevice bitcode. CICC's pipeline accounts for the reality that aggressive inlining and loop transformations in a GPU-focused compiler expose reflect calls at many different pipeline stages.

Function Map

Function	Address	Size	Role
NVVMReflect pass factory	`sub_1857160`	--	Creates and returns a new NVVMReflect pass instance
NVVMReflect constructor knob	`ctor_271`	--	Registers `nvvm-reflect-enable` cl::opt
SimplifyConstantConditionalsPass (nvvm-reflect-pp)	registered at line 2237 of `sub_2342890`	--	Post-reflect dead branch cleanup
Pipeline assembler	`sub_12E54A0`	--	Inserts NVVMReflect at multiple positions
Tier 0 pipeline builder	`sub_12DE330`	--	Inserts NVVMReflect as pass #7
Tiered sub-pipeline	`sub_12DE8F0`	--	Inserts NVVMReflect at tier-gated positions
Architecture detection table	`sub_95EB40`	--	Maps `-arch=compute_XX` to `__CUDA_ARCH` values
Architecture detection (libnvvm)	`sub_12C8DD0`	--	Parallel mapping table for the libnvvm path

Test This

The following kernel calls a libdevice math function whose implementation branches on __CUDA_FTZ and __CUDA_ARCH. Compile for two configurations and compare the PTX to see NVVMReflect in action.

#include <math.h>

__global__ void reflect_test(float* out, const float* in, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        out[tid] = sinf(in[tid]);
    }
}

Compile twice:

nvcc -ptx -arch=sm_90 -ftz=true  reflect_test.cu -o reflect_ftz.ptx
nvcc -ptx -arch=sm_90 -ftz=false reflect_test.cu -o reflect_noftz.ptx

What to look for in PTX:

With -ftz=true: the PTX should contain flush-to-zero math instructions (e.g., sin.approx.ftz.f32). The NVVMReflect pass resolved __nvvm_reflect("__CUDA_FTZ") to 1, SimplifyCFG folded the branch, and only the FTZ code path survived.
With -ftz=false: the PTX should contain precise math instructions without the .ftz suffix. The reflect resolved to 0, selecting the non-FTZ path.
The key evidence is that the PTX contains only one code path -- no conditional branch choosing between FTZ and non-FTZ variants. If both paths survive, NVVMReflect or its downstream cleanup passes failed.
Comparing -arch=sm_75 vs. -arch=sm_90 exercises the __CUDA_ARCH reflect. Functions like __nv_dsqrt_rn use architecture comparisons (icmp sge i32 %arch, 800) to select between SM 8.0+ instruction sequences and legacy fallbacks.

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent compile-time reflection mechanism.

1. Returning the wrong __CUDA_ARCH encoding. The __CUDA_ARCH value is major * 100 + minor * 10, not major * 10 + minor. For SM 9.0, the correct value is 900, not 90. For SM 10.0, the correct value is 1000, not 100. A reimplementation that uses the wrong encoding will select the wrong code paths in libdevice, potentially enabling instructions not supported by the target architecture (e.g., SM 7.0 paths on an SM 9.0 target) or disabling instructions that should be available. This encoding is also used by the CUDA preprocessor (__CUDA_ARCH__), so consistency between the frontend macro and the reflect value is critical.

2. Running NVVMReflect only once in the pipeline. The pass must run multiple times (approximately 8 invocations across the full pipeline) because __nvvm_reflect calls are hidden inside un-inlined libdevice function bodies. The first run resolves calls visible at the top level, but each subsequent inlining pass exposes new reflect calls from freshly inlined libdevice functions. A reimplementation with a single early invocation will leave reflected branches unresolved in all functions inlined after that point, resulting in both FTZ and non-FTZ code paths surviving to the final binary -- doubling code size and defeating the entire specialization mechanism.

3. Not running SimplifyConstantConditionalsPass (nvvm-reflect-pp) after reflect resolution. After NVVMReflect replaces __nvvm_reflect("__CUDA_FTZ") with the constant 1, the IR contains icmp ne i32 1, 0 feeding a conditional branch. If no pass simplifies this to an unconditional branch, the dead code path survives through the rest of the pipeline, consuming compile time in every subsequent pass and inflating the final binary. While standard LLVM SimplifyCFG will eventually handle it, the dedicated nvvm-reflect-pp pass provides immediate cleanup at the point where it matters most.

4. Returning 0 for unknown query strings instead of propagating a diagnostic. The pass returns 0 for any unrecognized __nvvm_reflect query string. This is the correct behavior (documented default), but a reimplementation that raises an error or leaves the call unresolved will break forward compatibility: future CUDA toolkit versions may introduce new query strings that libdevice checks. The value 0 is the safe default because libdevice code always treats 0 as "feature not available" and falls back to the conservative code path.

5. Reading the SM version from the wrong source. The reflect query values flow through three layers: CLI (-arch=compute_90), EDG frontend (-R __CUDA_ARCH=900), and optimizer (-opt-arch=sm_90). The NVVMReflect pass must read the SM version from the target machine configuration (the optimizer-level value), not from the -R preprocessor flags. A reimplementation that reads from the wrong layer may get a stale or mismatched value, especially in LTO scenarios where the preprocessor flags were consumed during an earlier compilation phase.

Cross-References

Optimizer Pipeline -- NVVMReflect pipeline positions and the NVVMPassOptions system
NVIDIA Custom Passes -- registry of all NVIDIA-proprietary passes
NVVM Intrinsic Constant-Fold Eligibility (K02) -- sub_14D90D0, the companion pass that checks whether an intrinsic can be constant-folded (NVVMReflect calls are resolved before K02 runs)
Architecture Detection -- the sub_95EB40 table that maps CLI flags to __CUDA_ARCH values
Optimization Levels -- how NVVMReflect placement varies across O0/O1/O2/O3 and fast-compile tiers

NVVM IR Verifier (Deep Dive)

The NVVM IR Verifier (nvvm-verify) is NVIDIA's three-layer correctness gate that runs between optimization passes throughout the CICC pipeline. Unlike LLVM's generic Verifier pass, which validates structural IR invariants, this pass enforces the complete NVVM IR contract: valid target triples, legal address space usage, architecture-gated intrinsic availability, MMA dimension/type constraints, function attribute restrictions, and atomic operation rules. It is the single largest verification subsystem in CICC at approximately 230KB across three cooperating functions. The verifier is inserted at roughly a dozen points in every optimization tier, guarded only by NVVMPassOptions[600] (disable). Every NVVM intrinsic call, every address space cast, and every unsupported CPU-oriented feature triggers a check here; failure produces a diagnostic message and sets the module error flag, but compilation continues to collect as many errors as possible in a single run.

Key Facts

Property	Value
Pass name	`nvvm-verify`
Pass class	`llvm::NVVMIRVerifierPass`
Registration	`sub_2342890` (New PM), `sub_12E54A0` (pipeline builder)
Entry point	`sub_12D4560`
Module verifier	`sub_2C80C90` (51KB, ~1671 lines)
Function verifier	`sub_2C771D0` (36KB, ~1165 lines)
Intrinsic verifier	`sub_2C7B6A0` (143KB, ~4139 lines)
Binary size	~230KB decompiled
Pipeline slot	~12 per tier (O1-O3), after GVN, after DSE, after LICM, etc.
Disable flag	`NVVMPassOptions[600]` (bool)
Primary knobs	`nvvm-verify-show-info`
Error model	Accumulate-and-continue (no early abort)
SM encoding	Internal SM * 10 (e.g., sm_90 = 900) at context offset +8
Upstream equivalent	None -- fully proprietary

Three-Layer Verification Architecture

The pass operates as three nested verification functions. The module verifier is the entry point; it calls the function verifier once per function, and the function verifier dispatches to the intrinsic verifier for every intrinsic call instruction.

sub_2C80C90 (NVVMModuleVerifier)
  |
  +-- Validate data layout string
  +-- Validate target triple against whitelist
  +-- sub_2C797D0() for each global variable
  +-- sub_2C7A130() for each function declaration
  +-- sub_2C7AA20() for each named metadata node
  |
  +-- For each function:
  |     |
  |     +-- sub_2C771D0 (NVVMFunctionVerifier)
  |     |     +-- Cluster dimension validation (Hopper+ gate)
  |     |     +-- Parameter width validation (>=32-bit or sext/zext)
  |     |     +-- Function attribute rejection (17 attributes)
  |     |     +-- Entry/exit handler constraints
  |     |
  |     +-- For each instruction in each basic block:
  |           |
  |           +-- Switch on opcode 0x1E..0x60
  |           +-- Opcode 0x55 (intrinsic call) --> sub_2C7B6A0
  |                 (NVVMIntrinsicVerifier, 143KB)
  |                 +-- Switch on intrinsic ID
  |                 +-- SM version gate checks
  |                 +-- Type, address space, constant arg validation
  |                 +-- MMA shape/type cross-validation

Context Object Layout

All three verifiers share a context object passed as the first argument:

Offset	Type	Field
0	int32	Mode (0 = standard, 1 = UnifiedNVVMIR)
4	int32	Flags
8	int32	SM version (SM * 10, e.g., 900 for sm_90)
16	ptr	Error flag pointer (set on any validation failure)
24	ptr	Error output stream

Target Triple Whitelist

The module verifier validates the module's target triple against two whitelists depending on mode.

UnifiedNVVMIR Mode (mode == 1) -- Exact Match

Eight triples are accepted:

Triple	Arch	API
`nvptx-nvidia-cuda`	32-bit PTX	CUDA
`nvptx64-nvidia-cuda`	64-bit PTX	CUDA
`nvptx-nvidia-nvcl`	32-bit PTX	OpenCL
`nvptx64-nvidia-nvcl`	64-bit PTX	OpenCL
`nvsass-nvidia-cuda`	SASS direct	CUDA
`nvsass-nvidia-nvcl`	SASS direct	OpenCL
`nvsass-nvidia-directx`	SASS direct	DirectX
`nvsass-nvidia-spirv`	SASS direct	SPIR-V

The nvsass triples confirm that CICC can compile directly to native GPU assembly (SASS) without the PTX intermediate step, and can do so for DirectX shader and SPIR-V/Vulkan shader pipelines. This reveals CICC's role in NVIDIA's shader compiler toolchain beyond CUDA.

Failure message: "Invalid target triple".

Standard Mode (mode != 1) -- Prefix + Suffix Match

The triple must begin with "nvptx-" or "nvptx64-" and end with "-cuda". The middle component is wildcarded.

Failure message: "Invalid target triple (<actual>), must be one of:" followed by "nvptx-*-cuda" and "nvptx64-*-cuda".

Data Layout Validation

If the module's data layout string is empty: "Empty target data layout, must exist".

Otherwise, sub_2C74F70 parses and validates the layout. On failure, the verifier prints "Example valid data layout:" with reference strings from:

Global	Description
`off_4C5D0A0`	32-bit layout example
`off_4C5D0A8`	64-bit layout example
`off_4C5D070`	64-bit with mixed pointer widths (p3:32:32:32)

Per-Instruction Validation (Module Verifier)

After calling sub_2C771D0 for function-level checks, the module verifier iterates every instruction in every basic block and dispatches on the LLVM IR opcode. The opcode range is 0x1E through 0x60:

Opcode	IR Instruction	Validation
0x1F	`call` (non-intrinsic)	Calls `sub_2C795F0`. Checks for `"pragma"` metadata; rejects `"unroll"` pragma with: `"pragma unroll is not supported. Please use llvm.loop.unroll.count instead"`. Validates branch pragma operand count.
0x21	`indirectbr`	Rejected via `sub_2C76F10(ctx, "indirectbr", instr)`
0x22	`invoke`	Rejected via `sub_2C76F10(ctx, "invoke", instr)`
0x23	`resume`	Rejected via `sub_2C76F10(ctx, "resume", instr)`
0x3C	`alloca`	Alignment must be <= 2^23. Address space must be Generic (AS 0): `"Allocas are not supported on address spaces except Generic"`
0x3D	`load`	Rejects atomic loads: `"Atomic loads/stores are not supported"`. Rejects tensor memory (AS 6): `"Tensor Memory loads/stores are not supported"`
0x3E	`store`	Same atomic and tensor memory checks as load
0x40	`fence`	In UnifiedNVVMIR mode: only `acq_rel` and `seq_cst` allowed. Otherwise: rejected entirely via `sub_2C76F10`
0x41	`cmpxchg`	Only i32/i64/i128 types. Pointer must be in generic, global, or shared AS
0x42	(GEP/addrspacecast helper)	Calls `sub_2C7AF00`
0x4F	`addrspacecast`	Validates source and target AS are in range. `"Cannot cast non-generic pointer to different non-generic pointer"` -- at least one side must be AS 0 (generic)
0x55	`call` (intrinsic)	Dispatches to `sub_2C7B6A0` (NVVMIntrinsicVerifier)
0x5F	`landingpad`	Rejected: `"landingpad"` unsupported

The unsupported instructions -- indirectbr, invoke, resume, landingpad -- are CPU exception-handling features with no GPU equivalent. Their rejection at the IR level prevents downstream passes from encountering them.

Address Space Casting Rules

The addrspacecast validation enforces NVIDIA's GPU address space model:

Rule: At least one operand of addrspacecast must be AS 0 (generic).
      Non-generic-to-non-generic casts are illegal.

Legal:   addrspacecast i32* addrspace(0) to i32* addrspace(1)  ; generic -> global
Legal:   addrspacecast i32* addrspace(3) to i32* addrspace(0)  ; shared -> generic
Illegal: addrspacecast i32* addrspace(3) to i32* addrspace(1)  ; shared -> global

The valid address space range check uses the expression (AS + ~2) & 0xFFFFFF) > 2, which means AS values 0 (generic), 1 (global), and 3 (shared) are always valid for atomic and cast operations. AS 2 (constant) and higher values have restricted usage contexts.

Function Attribute Rejection

The function verifier (sub_2C771D0) rejects 17 LLVM function attributes that have no GPU meaning. Each is identified by its LLVM attribute kind ID:

Attr ID	Attribute Name	Error Message
4	`builtin`	`"builtin function attribute is not supported."`
17	`jumptable`	`"jumptable function attribute is not supported."`
20	`naked`	`"naked function attribute is not supported."`
23	`nobuiltin`	`"nobuiltin function attribute is not supported."`
30	`noimplicitfloat`	`"noimplicitfloat function attribute is not supported."`
35	`noredzone`	`"noredzone function attribute is not supported."`
42	`nonlazybind`	`"nonlazybind function attribute is not supported."`
53	`returns_twice`	`"returns_twice function attribute is not supported."`
55	`safestack`	`"safestack function attribute is not supported."`
56	`sanitize_address`	`"sanitize_address function attribute is not supported."`
59	`sanitize_memory`	`"sanitize_memory function attribute is not supported."`
63	`sanitize_thread`	`"sanitize_thread function attribute is not supported."`
69	`ssp`	`"ssp function attribute is not supported."`
70	`sspreq`	`"sspreq function attribute is not supported."`
71	`sspstrong`	`"sspstrong function attribute is not supported."`
86	`alignstack`	`"alignstack function attribute is not supported."`
95	`uwtable`	`"uwtable function attribute is not supported."`

These attributes fall into four categories: (1) CPU ABI (naked, alignstack, noredzone), (2) security hardening (ssp/sspreq/sspstrong, safestack, sanitizers), (3) EH-related (uwtable, returns_twice, personality), and (4) linker features (jumptable, nonlazybind, builtin, nobuiltin). None have GPU equivalents.

Additional Function-Level Checks

Check	Error Message	Notes
Cluster dimensions on pre-Hopper	`"Cluster dimensions and cluster maximum blocks are not supported on pre-Hopper Architectures"`	SM version <= 899 (i.e., before sm_90)
Cluster dims on non-kernel	`"Cluster dimensions and cluster maximum blocks are only allowed for kernel functions"`	Checked via `sub_CE9220`
Partial zero cluster dims	`"If any cluster dimension is specified as 0 then all other dimensions must be specified as 0"`
Zero max cluster blocks	`"Cluster maximum blocks must be non-zero"`
Narrow int param without sign attr	`"Integer parameter less than 32-bits without sext/zext flag"`	PTX requires >=32-bit params
Narrow int return without sign attr	`"Integer return less than 32-bits without sext/zext flag"`
InReg attribute	`"InReg attribute on parameter will be ignored"`	Warning only
Nest attribute	`"Nest attribute on parameter will be ignored"`	Warning only
Explicit section	`"Explicit section marker <name> is not allowed."`
Explicit alignment	`"Explicit alignment is not allowed."`
Prefix data	`"Prefix data is not allowed."`	CPU feature
Prologue data	`"Prologue data is not allowed."`	CPU feature
Personality function	`"Personality function is not allowed."`	EH feature
GC names	`"GC names are not supported."`
Non-void kernel/entry	`"non-void entry function."`	Return type must be void
Entry with params	`"entry function with parameters."`	Non-kernel entries only
Non-void exit handler	`"non-void exit handler function."`
Exit handler with params	`"exit handler function with parameters."`

Architecture Gates (SM-Gated Features)

The intrinsic verifier (sub_2C7B6A0) uses the SM version stored at context offset +8 (encoded as SM*10) to gate feature availability. The threshold checks use <=, so e.g. <= 899 means "below sm_90".

SM Gate	Threshold	Intrinsics / Features	Error Message
sm_70 (Volta)	<= 699	`llvm.nvvm.branch.if.all.convergent` (ID 0x205A)	`"...not supported on pre-Volta Architectures"`
sm_72 (Volta+)	<= 719	`llvm.nvvm.cvt` base conversion (ID 0x2106)	`"this instrinsic is only supported for Volta (sm_72)+"`
sm_75 (Turing)	<= 749	`cvt` extended types -- BF16, TF32 conversions (within ID 0x2106)	`"conversion type only supported for Turing (sm_75)+"`
sm_80 (Ampere)	<= 799	`llvm.nvvm.branch.if.convergent` (ID 0x205B)	`"...not supported on pre-Ampere Architectures"`
sm_89 (Ada)	<= 889	Extended type conversion intrinsic (ID 0x2107)	`"this instrinsic is only supported for Ada (sm_89)+"`
sm_90 (Hopper)	<= 899	TMA, async copy (IDs 0x2279, 0x232D), cluster dims, bulk async (IDs 0x244D-0x2459, 0x2487-0x2489)	`"this intrinsic is only supported for Hopper+"`
sm_90 (Hopper)	<= 899	64-bit pointer requirement for TMA	`"this intrinsic is only supported when pointer size is >= 64 bits"`
sm_100+ (Blackwell)	<= 1199	`.offset.bindless` intrinsics (checked via `sub_CEA320`)	`".offset.bindless intrinsics are not supported on pre-Blackwell architectures"`

Note the typo "instrinsic" in the Volta and Ada messages -- this is present in the binary. The Blackwell gate threshold of 1199 means the .offset.bindless intrinsics are available on sm_120 (value 1200) and above, covering all Blackwell-generation architectures including consumer (sm_120/121) and datacenter (sm_100/103).

Intrinsic Verification Categories

The intrinsic verifier is a single monolithic switch on the NVVM internal intrinsic ID (stored at function value offset +36). The 143KB function covers 26+ validation categories:

A. Constant Argument Validation

Many NVVM intrinsics require one or more arguments to be compile-time constants (typically mode selectors, masks, or task IDs):

"arg0 of intrinsic not constant"
"op0 of intrinsic not constant" / "op1 of intrinsic not constant"
"Flag argument must be an immediate."
"the task_id parameter must be constant"
"the mask parameter must be constant"
"Mode operand must be constant"

B. Rounding Mode Validation

Rounding mode encoding: bits[2:0] of the mode word
Valid range: 1..4 (round-to-nearest-even, round-down, round-up, round-to-zero)
Reject: value == 0 or value > 4
Message: "rounding mode not a valid value"

C. Subword Mode Validation

For conversion intrinsics that operate on sub-word portions:

Source subword mode:  bits[9:7], valid range 0..2
Dest subword mode:    bits[12:10], valid range 0..2
Messages: "src subword mode not a valid value"
          "dest subword mode not a valid value"

D. Reserved Bits Checking

Multiple locations verify that high/reserved bits in mode words are zero:

"reserved flag bits used"

This prevents future-proofing conflicts if NVIDIA later assigns meaning to currently reserved fields.

E. Address Space Validation

Intrinsics that access memory enforce specific address space requirements:

Check	Message
Global pointer required	`"pointer address space not global"`
Invalid arg1 address space	`"arg1 invalid addrspace"`
Arg0 must be pointer	`"arg0 of intrinsic not pointer"`
Constant AS required	`"Operand must be in constant address space"`
Memcpy/memmove targets constant AS	`"memmove/memcpy cannot target constant address space"`
Memset targets constant AS	`"memset cannot point to constant address space"`
Stack ops require local AS (5)	`"llvm.nvvm.stackrestore is only supported with local address space pointers"`
Stack ops require local AS (5)	`"llvm.nvvm.stacksave is only supported with local address space pointers"`

F. Type Validation

Check	Message
bswap operand	`"Invalid type for bswap, need i16, i32, or i64"`
ctpop/ctlz/cttz operand	`"Invalid type for ctpop/ctlz/cttz, need i8, i16, i32, ..."` (i64)
Arithmetic overflow	`"Invalid type for arithmetic overflow intrinsic, need i16, i32, or i64"`
Inline asm type	`"Invalid type in inline assembly, must be i1, i8, i16, i32, i64, float, or double"`
MMA element	`"op1 of intrinsic not containing f32 or i32 element"`

Inline assembly type validation uses a bitmask check: valid bit widths are 1, 8, 16, 32, 64 (encoded as 0x1000000010001 for fast lookup).

G. Atomic Intrinsic Validation

Check	Message
CAS opcode mismatch	`"the opcode of atomic_cas must be CAS"`
RMW opcode error	`"the opcode of atomic_rmw must not be CAS, CAST or CAST_SPIN"`
CAST opcode error	`"the opcode of atomic_cast must be CAST or CAST_SPIN"`
CAST type restriction	`"atomic.cast only overloads on i32 and i64"`
CAST pointer restriction	`"atomic.cast is only allowed on shared pointers"`
CAST ordering restriction	`"atomic.cast works on shared memory, so cannot be ordered"`
Global ordering scope	`"Global ordering on atomics is only allowed on generic/global pointers"`
Ordering mode	`"ordering mode not a valid value"`
Scope mode	`"scope mode not a valid value"`
Cache hint	`"Cache operation hint not a valid value"`
Operation mode	`"operation mode not a valid value"`

H. Texture/Surface Validation

Check	Message
Texture dimensionality	`"dimensionality not a valid value"`
LOD adjust	`"LOD Adjust mode not a valid value"`
Binding mode	`"Binding Mode is not a valid value"`
Border mode	`"border mode not a valid value"`
Address mode	`"address mode not a valid value"`
Scope	`"scope not a valid value"`
Semantic mode	`"semantic mode not a valid value"`
Query mode	`"query mode is not a valid value"`
Handle source	`"Op0 of nvvm.texsurf.handle must be a metadata wrapper around a tex/surf GlobalVariable"`
Deprecated desc	`"Desc parameter is deprecated and should be undef."` (IDs 8937, 9549)

I. SATF (Saturate-to-Float) Validation

For math intrinsics with saturation control (IDs 0x2281-0x229C, covering fma/mul/add variants):

Message: "satf operand must be a constant zero"

The satf parameter was deprecated but the intrinsic signatures retain it for ABI compatibility. The verifier enforces it must be zero.

J. Constant Load Validation

For ID 0x2310 (constant bank load):

Check	Message
Load kind	`"Invalid constant load kind"`
Bound bank type	`"Bound bank must be i32"`
Bindless bank type	`"Bindless bank must be i64"`

K. TMA/Shared Memory Validation

For IDs 0x2319-0x231B:

Check	Message
Column-major restriction	`"ColMajor is not supported for this size"`
Size encoding	`"Invalid size"` (bits[3:1] > 4)

L. Load Bounds Check

For ID 0x231C:

Validation: (value & 7) must be <= 2
Message: "invalid load bounds check type"
Also: "pointer address space not global"

M. Convergent Branch Result Validation

For IDs 8282 (llvm.nvvm.branch.if.all.convergent) and 8283 (llvm.nvvm.branch.if.convergent):

Message: "result of llvm.nvvm.branch.if.convergent and
          llvm.nvvm.branch.if.all.convergent can only be
          used by exactly one branch instruction"

This enforces that the convergent branch intrinsic's boolean result flows directly to a single terminator branch, preventing misuse that would break convergence guarantees.

N. MMA (Matrix Multiply-Accumulate) Validation

The most complex validation category (ID 0x2366 = 9062). Validates WMMA/MMA intrinsics against a multidimensional constraint space:

Opcode byte encoding:

Byte	Bits	Field
byte0	[2:0]	Rounding mode
byte0	[7:4]	MMA opcode
byte1	all	A matrix element type (1-13, lookup via `dword_43A2620`)
byte2	all	B matrix element type
byte4	all	MNK dimension encoding (cases 1-0x19)
byte5	all	Additional type info

MNK dimension decoding (selected cases):

Encoding	M	N	K	Notes
1	8	8	8	Legacy HMMA
0x10	16	8	8
0x17	16	8	16
0x18	32	8	8
0x19	16	8	16

Validation checks:

Check	Message
MNK dimensions	`"Invalid MMA MNK"`
A element type	`"Invalid MMA AType"`
Fragment A bit width	`"Invalid MMA FragASize"`
Fragment B bit width	`"Invalid MMA FragBSize"`
Fragment C bit width	`"Invalid MMA FragCSize"`
Fragment A IR type	`"Invalid fragA type"`
Rounding mode	`"Invalid MMA Rounding Mode"`
MMA opcode	`"Invalid MMA Opcode"`
A/B type match	`"Mismatched MMA A B Type"`
Fragment element consistency	`"Mismatched fragA, fragB and fragC element type"`

O. Type Conversion Validation

For IDs 0x2106 and 0x2107:

Conversion type: bits[3:1], must be 1..4
Messages: "conversion type not a valid value"
          "Invalid dst type" / "Invalid src type"
          "Src and dst type must be different types"
          "Src and dst type must be different bit widths"

P. Other Validation Categories

Category	IDs	Key Messages
Coroutine	--	`"llvm.nvvm.coro.create.suspend must have exactly one argument, which must be a constant integer"`
Subop mode	9383-9384	`"Invalid subop mode"` (bits[3:1] > 5)
Geometry output	--	`"geometry out mode not a valid value"`, `"op1 of GeometryOut intrinsic must be constant when CUT mode"`, `"op1 of GeometryOut intrinsic must be 0 when CUT mode"`
Syncwarp	--	`"syncwarp mode not a valid value"`
Cache operations	--	`"invalid cache type"`, `"invalid cache op"`
Wait intrinsic	--	`"Invalid wait mode"`
ISBE	0x2BC1 (11201)	`"Only writes to MAP or ATTR are supported"`, `"Cannot write to input ISBE"`
Unsupported fallback	--	`"Unsupported intrinsic: <name>"`

Cmpxchg Restrictions

The module verifier enforces strict constraints on cmpxchg:

Allowed types:  i32, i64, i128
Allowed spaces: generic (AS 0), global (AS 1), shared (AS 3)

Messages:
  "Atomic operations on non-i32/i64/i128 types are not supported"
  "cmpxchg pointer operand must point to generic, global, or shared address space"

This rules out i8/i16 atomics (hardware does not support sub-word CAS) and atomics on constant/local address spaces.

Tensor Memory Restrictions

Load and store instructions targeting address space 6 (tensor memory) are rejected at the IR level:

Message: "Tensor Memory loads/stores are not supported"

Tensor memory access is handled through dedicated intrinsics (TMA/cp.async) rather than generic load/store instructions. The verifier enforces this indirection.

Pipeline Placement

The NVVMVerifier is inserted repeatedly throughout the optimization pipeline, not just once. In the pipeline assembler (sub_12E54A0), it appears after nearly every major optimization pass, gated by !NVVMPassOptions[600]:

Position	After Pass	Notes
10 (O1 tier)	GVN	Verify IR after value numbering
After DSE	Dead Store Elimination	Verify after store removal
After EarlyCSE	Early CSE	O2+ only
After LoopIndexSplit	Loop Index Split	O2+ only
After NVVMReflect	NVVM Reflect	Common tail
After LICM	Loop-Invariant Code Motion	Common tail
After LowerSwitch	Switch lowering	Final position in common tail

This aggressive re-verification catches bugs introduced by any optimization pass. In debug/development builds, this is the primary mechanism for detecting optimizer-introduced IR invalidity.

Configuration

Knob	Storage	Type	Default	Description
`NVVMPassOptions[600]`	opts array	bool	false	When true, disables ALL NVVMVerifier insertions in the pipeline
`nvvm-verify-show-info`	`ctor_257`	bool	false	Enables informational messages (e.g., `"IR Kind is UnifiedNVVMIR"`)

Diagnostic Infrastructure

Error messages are produced through a chain of helper functions:

Function	Role
`sub_2C764C0`	Create diagnostic message with severity level
`sub_2C76A00`	Create error diagnostic for a specific instruction
`sub_2C76240`	Flush diagnostic to error stream
`sub_2C76F10`	Report an unsupported instruction by name (takes a string literal like `"indirectbr"`)
`sub_904010`	Append string to diagnostic buffer
`sub_CB6200`	Write raw bytes to output buffer
`sub_CB5AE0`	Flush buffer

The error model is accumulate-and-continue: the verifier sets the error flag at context offset +16 and writes the diagnostic, but does not abort. This allows a single verification run to report all errors in the module.

Function Map

Function	Address	Size	Role
NVVMModuleVerifier	`sub_2C80C90`	51KB	Module entry: triples, data layout, per-instruction dispatch
NVVMFunctionVerifier	`sub_2C771D0`	36KB	Function-level: attributes, params, cluster dims, entry funcs
NVVMIntrinsicVerifier	`sub_2C7B6A0`	143KB	Intrinsic-level: SM gates, types, MMA, atomics, tex/surf
NVVMVerifier pass wrapper	`sub_12D4560`	small	Pipeline entry point, creates context, invokes module verifier
Verify global variable	`sub_2C797D0`	--	Per-global validation
Verify function declaration	`sub_2C7A130`	--	Checks function declarations (not definitions)
Verify named metadata	`sub_2C7AA20`	--	Named metadata validation
Verify address space cast	`sub_2C7AF00`	--	addrspacecast / GEP rule checker
Verify generic call	`sub_2C795F0`	--	Non-intrinsic call validation, pragma check
Report unsupported instruction	`sub_2C76F10`	--	Produces `"<name> is not supported"` diagnostics
Is kernel function?	`sub_CE9220`	--	Checks kernel calling convention
Extract cluster dimensions	`sub_CE8EA0`	--	Reads cluster dims from function metadata
Extract cluster max blocks	`sub_CE9030`	--	Reads max cluster blocks from metadata
Check function attribute	`sub_A73ED0`	--	Tests presence of attribute by ID
Is .offset.bindless?	`sub_CEA320`	--	Blackwell gate predicate
Get intrinsic name string	`sub_BD5D20`	--	Returns intrinsic name for error messages
Get integer bit width	`sub_BCAE30`	--	Type query helper
Compute total bit width	`sub_CA1930`	--	Aggregate/vector width computation

Cross-References

GPU Target Architecture -- SM table and architecture gating
Hopper (sm_90) -- TMA, cluster operations, WGMMA
Blackwell (sm_100) -- tcgen05, .offset.bindless
Memory Space Optimization -- address space enforcement and resolution
NVIDIA Custom Passes index -- pass inventory
IP Memory Space Propagation -- inter-procedural address space analysis

NVVM Intrinsic Lowering

The NVVMIntrinsicLowering pass is a pattern-matching rewrite engine that transforms NVVM intrinsic calls into equivalent sequences of standard LLVM IR operations. NVVM IR uses hundreds of target-specific intrinsics (llvm.nvvm.*) for GPU-specific operations -- texture/surface access, warp shuffles, type conversions, wide vector manipulations, barrier synchronization, and tensor core primitives. These intrinsics encode NVIDIA-specific semantics that have no direct LLVM IR equivalent. This pass bridges the gap: it matches each intrinsic call against a database of lowering rules and, when a match is found, replaces the call with a combination of standard LLVM instructions (shufflevector, extractelement, insertelement, bitcast, arithmetic) that express the same semantics in a form amenable to standard LLVM optimization passes.

The pass runs repeatedly throughout the pipeline -- up to 10 times in the "mid" compilation path -- because other optimization passes (NVVMReflect, InstCombine, inlining) can expose new intrinsic calls or simplify existing ones into forms that become lowerable. Two distinct invocation levels exist: level 0 for basic intrinsic lowering, and level 1 for barrier-related intrinsic lowering that must happen after barrier analysis infrastructure is in place.


Pass factory	`sub_1CB4E40` (creates pass instance with level parameter)
Core engine	`sub_2C63FB0` (140KB, 2,460 lines)
Pass type	FunctionPass (Legacy PM)
Registration	Legacy PM only (not separately registered in New PM); invoked from pipeline assembler
Runtime positions	Tier 1/2/3 #1, #3, #28, #50, #64 (level 1); "mid" path has 4 level-0 invocations (see Pipeline)
NVVMPassOptions slot	99 (offset 2000, `BOOL_COMPACT`, default = 0 = enabled)
Disable flag	`opts[2000] = 1` disables all invocations
Level parameter	0 = basic lowering, 1 = barrier-aware lowering
Iteration limit	30 (global `qword_5010AC8`)
Upstream equivalent	None -- entirely NVIDIA-proprietary
Address range	`0x2C4D000`--`0x2C66000` (lowering engine cluster)

Algorithm

Entry and Dispatch

The pass factory sub_1CB4E40 takes a single integer parameter -- the lowering level. Level 0 performs basic intrinsic lowering (type conversions, vector decomposition, shuffle lowering). Level 1 adds barrier-related intrinsic lowering that depends on barrier analysis having already run. The factory allocates a pass object and stores the level in it; the pass entry point reads this level to filter which intrinsics are candidates for lowering.

At the core engine (sub_2C63FB0), the entry check validates that the instruction is an intrinsic call: the byte at call->arg_chain->offset_8 must equal 17 (intrinsic call marker), and call->offset_16 must be non-null (the callee exists). If either check fails, the function returns 0 (no lowering performed).

Pattern-Matching Rewrite Loop

The algorithm operates as a worklist-driven rewrite system:

function lowerIntrinsic(ctx, call, level):
    if not isIntrinsicCall(call): return 0
    if not hasCallee(call): return 0

    operands = collectOperands(call)        // v285/v286 arrays
    worklist_direct = []                     // v288: direct operand replacements
    worklist_typed  = []                     // v294: type-changed operands
    worklist_shuf   = []                     // v300: shuffle/reorganized operands

    iterations = 0
    while iterations < qword_5010AC8:       // default 30
        iterations++

        // Phase 1: build candidate lowerings
        candidates = buildCandidates(operands)   // sub_2C4D470
        for each candidate in candidates:
            pattern = extractPattern(candidate)  // sub_2C4D5A0

            // Phase 2: type compatibility check
            if not checkTypeCompat(pattern, operands):  // sub_AD7630
                continue

            // Phase 3: operand matching
            if not matchOperands(pattern, operands):
                continue

            // Phase 4: additional pattern checks
            if not additionalChecks(pattern):    // sub_2C50020
                continue

            // Phase 5: core lowering -- create replacement
            replacement = buildReplacement(      // sub_2C515C0
                ctx, operands,
                worklist_direct, worklist_typed, worklist_shuf)

            // Phase 6: substitute
            replaceAllUses(call, replacement)    // sub_BD84D0
            transferMetadata(call, replacement)  // sub_BD6B90
            queueForDeletion(call)               // sub_F15FC0
            return 1

    return 0  // no lowering found within iteration limit

The iteration limit of 30 (stored in qword_5010AC8) exists because lowering one intrinsic can produce new intrinsic calls that themselves need lowering. For example, lowering a wide vector intrinsic into narrower operations may produce calls to narrower intrinsics. Without the limit, pathological patterns could cause infinite expansion. In practice, most intrinsics lower in a single iteration; the limit is a safety net.

Three Worklist Structures

The rewrite engine maintains three parallel worklist structures that categorize how operands are transformed:

Worklist	Variable	Purpose
Direct	`v288`	Operands that pass through unchanged -- same value, same type
Type-changed	`v294`	Operands that need a type conversion (e.g., NVVM-specific type to standard LLVM type)
Shuffle/reorganized	`v300`	Operands that need positional rearrangement (vector lane reordering, element extraction)

When sub_2C515C0 builds the replacement instruction, it reads all three worklists to assemble the final operand list: direct operands are copied verbatim, type-changed operands go through a bitcast or type conversion, and shuffle operands are processed through a shufflevector or extractelement/insertelement sequence.

Lowering Categories

Vector Operation Decomposition

Wide vector NVVM intrinsics (operating on v4f32, v2f64, v4i32, etc.) are decomposed into sequences of narrower operations. The NVVM IR frontend emits vector intrinsics to express data-parallel GPU operations, but the NVPTX backend's instruction selector handles scalar or narrow-vector operations more efficiently.

The decomposition pattern:

// Before: single wide-vector intrinsic call
%result = call <4 x float> @llvm.nvvm.wide.op(<4 x float> %a, <4 x float> %b)

// After: four scalar operations + vector reconstruction
%a0 = extractelement <4 x float> %a, i32 0
%a1 = extractelement <4 x float> %a, i32 1
%a2 = extractelement <4 x float> %a, i32 2
%a3 = extractelement <4 x float> %a, i32 3
%b0 = extractelement <4 x float> %b, i32 0
...
%r0 = call float @llvm.nvvm.narrow.op(float %a0, float %b0)
%r1 = call float @llvm.nvvm.narrow.op(float %a1, float %b1)
...
%v0 = insertelement <4 x float> undef, float %r0, i32 0
%v1 = insertelement <4 x float> %v0,   float %r1, i32 1
...

This decomposition enables scalar optimizations (constant folding, CSE) to work on individual lanes, and the narrower intrinsics may themselves lower in subsequent iterations -- hence the iteration limit.

Shuffle Vector Lowering

When an NVVM intrinsic performs pure data reorganization -- lane permutation, broadcast, or subvector extraction -- without any arithmetic, the pass replaces it with an LLVM shufflevector instruction. The core lowering for this path goes through sub_DFBC30, which takes:

sub_DFBC30(context, operation=6, type_info, shuffle_indices, count, flags)

The operation=6 constant identifies this as a shufflevector creation. The shuffle_indices array encodes the lane mapping: for a warp shuffle that broadcasts lane 0 to all lanes, the mask would be <0, 0, 0, 0, ...>. For a rotation, it might be <1, 2, 3, 0>.

Shuffle lowering handles several NVVM intrinsic families:

Warp-level shuffle operations (__shfl_sync, __shfl_up_sync, etc.) when the shuffle amount is a compile-time constant
Subvector extraction from wider types (e.g., extracting the low v2f16 from a v4f16)
Lane broadcast patterns used in matrix fragment loading

Type Conversion Lowering

NVVM defines intrinsic-based type conversions for types that LLVM's standard type system does not directly support, such as:

BF16 (bfloat16) to/from FP32 -- intrinsic ID 0x2106, gated by sm_72+
TF32 (tensorfloat32) conversions -- intrinsic ID 0x2106 with conversion type 3+, gated by sm_75+
FP8 (E4M3/E5M2) conversions -- intrinsic ID 0x2107, gated by sm_89+ (Ada)
Extended type conversions with saturate, rounding mode control

The lowering replaces these intrinsic calls with sequences of:

bitcast for reinterpretation between same-width types
fptrunc / fpext for standard floating-point width changes
trunc / zext / sext for integer width changes
Arithmetic sequences for rounding mode emulation when the hardware rounding mode is not directly expressible

The sub_2C52B30 helper ("get canonical type") resolves NVVM-specific type encodings to their standard LLVM Type* equivalents during this process.

Multi-Run Pattern

NVVMIntrinsicLowering appears more times in the compilation pipeline than any other NVIDIA custom pass. In the "mid" path (standard CUDA compilation), it runs approximately 10 times across the main path and the Tier 1/2/3 sub-pipelines. The pattern reveals a deliberate interleaving strategy.

"mid" Path Invocations (Level 0)

All four invocations in the main "mid" path use level 0 and are guarded by !opts[2000]:

Position	Context	Preceding Pass	Following Pass	Purpose
1st	Early pipeline	ConstantMerge	MemCpyOpt	Lower intrinsics from the original IR before SROA/GVN operate
2nd	After InstCombine + standard pipeline #5	LLVM standard #5	DeadArgElim	Re-lower intrinsics that InstCombine may have simplified or inlining may have exposed
3rd	After NVVMReflect + standard pipeline #8	LLVM standard #8	IPConstProp	Lower intrinsics whose arguments became constant after NVVMReflect resolved `__nvvm_reflect()` calls
4th	Late pipeline	LICM	NVVMBranchDist	Final cleanup of any remaining lowerable intrinsics before register-pressure-sensitive passes

Tier 1/2/3 Invocations (Level 1)

Within the sub_12DE8F0 tier sub-pipeline, the pass runs with level 1 at five distinct points:

Position	Context	Notes
1st	Tier entry	Immediately at tier start -- lower barrier-related intrinsics before barrier analysis
2nd	After 1st NVVMIRVerification	Re-lower after verification may have canonicalized IR
3rd	After CVP + NVVMVerifier + NVVMIRVerification	Post-optimization cleanup of barrier intrinsics
4th	After LoopUnswitch + standard pipeline #1	Re-lower intrinsics exposed by loop transformations
5th	After DSE + DCE + standard pipeline #1	Final tier cleanup before MemorySpaceOpt

Each tier (1, 2, and 3) runs this same sequence independently, so in a full compilation with all three tiers active, level-1 lowering executes up to 15 times total.

Level Parameter Semantics

The level parameter partitions the intrinsic lowering rules into two sets:

Level 0 -- Basic lowering. Handles intrinsics whose lowering depends only on the intrinsic's operands and types. This includes vector decomposition, shuffle lowering, and standard type conversions. These are safe to run at any point in the pipeline because they have no dependencies on analysis results. The "mid" path runs level 0 exclusively.

Level 1 -- Barrier-aware lowering. Handles intrinsics related to synchronization barriers (__syncthreads, __syncwarp, barrier-guarded memory operations) whose lowering must coordinate with the barrier analysis infrastructure. In the tier sub-pipeline, level 1 runs at the entry point before NVVMBarrierAnalysis and NVVMLowerBarriers, and again after those passes have run. This two-phase pattern within the tier ensures that:

Barrier intrinsics are lowered to a canonical form that the barrier analysis can recognize
After barrier analysis and lowering, any residual barrier-related intrinsics are cleaned up

The reason level 1 is restricted to tiers rather than the main "mid" path: the tier sub-pipeline (sub_12DE8F0) sets up the barrier analysis state (via sub_18E4A00 / sub_1C98160) that level 1 lowering depends on. Running level 1 in the main path before this state exists would produce incorrect results.

Interaction with NVVMReflect

NVVMReflect resolves compile-time queries about the target GPU architecture:

%arch = call i32 @llvm.nvvm.reflect(metadata !"__CUDA_ARCH__")
; After NVVMReflect: %arch = i32 900  (for sm_90)

This resolution has a cascading effect on intrinsic lowering. Many NVVM intrinsics are conditionally emitted by the frontend behind architecture checks:

if (__CUDA_ARCH__ >= 900) {
    // Hopper-specific intrinsic
    __nvvm_tma_load_async(...);
} else {
    // Fallback path using standard loads
}

After NVVMReflect replaces the architecture query with a constant, and nvvm-reflect-pp (SimplifyConstantConditionalsPass) eliminates the dead branch, the surviving path may contain intrinsics that were previously unreachable. The pipeline runs NVVMIntrinsicLowering after NVVMReflect specifically to catch these newly-exposed intrinsics. This is why the 3rd invocation in the "mid" path immediately follows NVVMReflect + LLVM standard pipeline #8.

Configuration

NVVMPassOptions

Slot	Offset	Type	Default	Semantics
98	1976	STRING	(empty)	Paired string parameter for the pass (unused or reserved)
99	2000	BOOL_COMPACT	0	Disable flag: 0 = enabled, 1 = disabled

Setting slot 99 to 1 disables all invocations of NVVMIntrinsicLowering across the entire pipeline -- both level 0 and level 1. There is no mechanism to disable one level independently.

Global Variables

Variable	Default	Purpose
`qword_5010AC8`	30	Maximum iterations per invocation of the rewrite loop

This global is not exposed as a user-facing knob. It is initialized at program startup and is constant for the lifetime of the process.

Key Helper Functions

Pattern Matching

Function	Role
`sub_2C4D470`	Build candidate lowering list from intrinsic operands
`sub_2C4D5A0`	Extract pattern from candidate -- returns the lowering rule
`sub_2C50020`	Additional pattern compatibility checks beyond type matching
`sub_2C52B30`	Get canonical LLVM type for an NVVM-specific type encoding
`sub_AD7630`	Type-lowering query -- checks if source type can lower to target type

Instruction Construction

Function	Role
`sub_2C515C0`	Build replacement instruction from three worklist structures
`sub_2C4FB60`	Opcode dispatch -- selects the LLVM opcode for the lowered operation
`sub_DFBC30`	Create shufflevector or similar vector IR construct (operation=6)

IR Mutation

Function	Role
`sub_BD84D0`	Replace all uses of old instruction with new value
`sub_BD6B90`	Transfer metadata from old instruction to replacement
`sub_F15FC0`	Queue old instruction for deletion

Pass Infrastructure

Function	Address	Role
`sub_1CB4E40`	`0x1CB4E40`	Pass factory -- creates pass with level parameter
`sub_2C63FB0`	`0x2C63FB0`	Core lowering engine (140KB, 2,460 lines)

Diagnostic Strings

The core engine at sub_2C63FB0 contains no user-visible diagnostic strings. This is unusual for a 140KB function and reflects the fact that intrinsic lowering is a mechanical pattern-matching operation: either a lowering rule matches (silently applied) or it does not (silently skipped). Failures are not reported because an unlowered intrinsic is not necessarily an error -- it may be handled by a later pass (NVVMLowerBarriers, GenericToNVVM) or by the NVPTX instruction selector directly.

The pass factory sub_1CB4E40 similarly contains no diagnostic strings.

Pipeline Position Summary

sub_12E54A0 (Master Pipeline Assembly)
  │
  ├─ "mid" path (level 0, 4 invocations):
  │    ├─ #1: After ConstantMerge, before MemCpyOpt/SROA
  │    ├─ #2: After InstCombine + LLVM standard #5
  │    ├─ #3: After NVVMReflect + LLVM standard #8
  │    └─ #4: After LICM, before NVVMBranchDist/Remat
  │
  ├─ "ptx" path (level 0, 0 invocations):
  │    └─ (not present -- PTX input already has intrinsics lowered)
  │
  ├─ default path (level 0, 1 invocation):
  │    └─ #1: After NVVMReflect, before NVVMPeephole
  │
  └─ Tier 1/2/3 sub-pipeline (level 1, 5 invocations per tier):
       ├─ #1: Tier entry
       ├─ #2: After NVVMIRVerification
       ├─ #3: After CVP + NVVMVerifier
       ├─ #4: After LoopUnswitch + LLVM standard #1
       └─ #5: After DSE + DCE + LLVM standard #1

Cross-References

LLVM Optimizer -- master pipeline assembly and tier system
NVIDIA Custom Passes -- Inventory -- pass registry and classification
Rematerialization -- runs after intrinsic lowering in "mid" path
NVVM Peephole -- peephole patterns that may expose new lowerable intrinsics
MemorySpaceOpt -- runs after level-1 lowering in tier sub-pipeline

FP128/I128 Emulation

No NVIDIA GPU in any SM generation has native 128-bit arithmetic hardware. Neither fp128 (IEEE 754 binary128) nor i128 (128-bit integer) operations can be lowered to PTX instructions directly. CICC handles this by replacing every fp128 and i128 operation in LLVM IR with a call to one of 48 distinct NVIDIA runtime library functions whose implementations live in a separate bitcode module. The pass at sub_1C8C170 walks each function in the module, inspects every instruction, dispatches on the LLVM opcode byte, and emits the appropriate __nv_* call in place of the original operation. This is a correctness-critical legalization pass -- if any fp128/i128 operation survives past it, instruction selection will abort because NVPTX has no patterns for 128-bit types.

The pass is structurally part of lower-ops (LowerOpsPass), NVIDIA's umbrella module pass for lowering operations that the NVPTX backend cannot handle natively. Within the lower-ops framework, sub_1C8C170 is the dedicated handler for 128-bit types. It runs as a module-level pass early in the pipeline, after libdevice linking and before the main optimization sequence, so that the generated calls can be inlined and optimized by subsequent passes.


Entry point	`sub_1C8C170`
Size	25 KB (~960 lines decompiled)
Pass framework	Part of `lower-ops` / `LowerOpsPass` (module pass)
Registration	New PM slot 144 at `sub_2342890`; param `enable-optimization`
Runtime functions	48 distinct `__nv_*` library calls
Upstream equivalent	None. Upstream LLVM lowers fp128 through SoftenFloat in type legalization. CICC replaces this with explicit call insertion at the IR level.

Opcode Dispatch

The pass reads the LLVM instruction opcode from the byte at offset +16 of the instruction node and dispatches through a dense switch. The following table lists every handled opcode and the corresponding lowering action. All unlisted opcodes in the range 0x18--0x58 produce an early return (no 128-bit type involvement, or handled elsewhere).

Opcode	LLVM Instruction	Lowering Target	Handler
`0x24`	`fadd`	`__nv_add_fp128`	`sub_1C8A5C0`
`0x26`	`fsub`	`__nv_sub_fp128`	`sub_1C8A5C0`
`0x28`	`fmul`	`__nv_mul_fp128`	`sub_1C8A5C0`
`0x29`	`udiv`	`__nv_udiv128`	`sub_1C8BD70`
`0x2A`	`sdiv`	`__nv_idiv128`	`sub_1C8BD70`
`0x2B`	`fdiv`	`__nv_div_fp128`	`sub_1C8A5C0`
`0x2C`	`urem`	`__nv_urem128`	`sub_1C8BD70`
`0x2D`	`srem`	`__nv_irem128`	`sub_1C8BD70`
`0x2E`	`frem`	`__nv_rem_fp128`	`sub_1C8A5C0`
`0x36`	`trunc`/`ext`	Type-based conversion	`sub_1C8ADC0`
`0x3F`	`fptoui`	`__nv_fp128_to_uint` or `__nv_cvt_f_u128_rz`	`sub_1C8ADC0` / `sub_1C8BF90`
`0x40`	`fptosi`	`__nv_fp128_to_int` or `__nv_cvt_f_i128_rz`	`sub_1C8ADC0` / `sub_1C8BF90`
`0x41`	`uitofp`	`__nv_uint_to_fp128` or `__nv_cvt_u128_f_rn`	`sub_1C8ADC0` / `sub_1C8BF90`
`0x42`	`sitofp`	`__nv_int_to_fp128` or `__nv_cvt_i128_f_rn`	`sub_1C8ADC0` / `sub_1C8BF90`
`0x43`	`fptrunc`	`__nv_fp128_to_float` or `__nv_fp128_to_double`	`sub_1C8ADC0`
`0x44`	`fpext`	`__nv_float_to_fp128` or `__nv_double_to_fp128`	`sub_1C8ADC0`
`0x4C`	`fcmp`	`__nv_fcmp_*` (predicate-selected)	dedicated

Ignored opcode ranges: 0x18--0x23, 0x25, 0x27, 0x2F--0x35, 0x37--0x3E, 0x45--0x4B, 0x4D--0x58. Opcode 0x37 (store) receives a similar type check as 0x36 but for store target types.

Library Function Inventory

FP128 Arithmetic (5 functions)

Binary operations on IEEE 754 binary128. Each takes two fp128 operands, returns fp128.

Function	Operation	String Length
`__nv_add_fp128`	`fp128` addition	14
`__nv_sub_fp128`	`fp128` subtraction	14
`__nv_mul_fp128`	`fp128` multiplication	14
`__nv_div_fp128`	`fp128` division	14
`__nv_rem_fp128`	`fp128` remainder	14

All five are lowered through sub_1C8A5C0, which constructs the call with a fixed string length of 14 characters.

I128 Division and Remainder (4 functions)

Integer division and remainder for i128. No native PTX instruction exists for 128-bit integer divide.

Function	Operation	Signedness	String Length
`__nv_udiv128`	`i128` division	unsigned	12
`__nv_idiv128`	`i128` division	signed	12
`__nv_urem128`	`i128` remainder	unsigned	12
`__nv_irem128`	`i128` remainder	signed	12

Lowered through sub_1C8BD70 with string length 12. Note: i128 add/sub/mul are NOT lowered here -- those can be decomposed into pairs of 64-bit operations by standard LLVM legalization. Only division and remainder require the runtime call path because they involve complex multi-word algorithms.

FP128-to-Integer Conversions (10 functions)

Convert fp128 to integer types of various widths. The target width is determined by examining sub_1642F90 and the type's bit-width field (type_id >> 8).

Function	Conversion
`__nv_fp128_to_uint8`	`fp128` -> `i8` (unsigned)
`__nv_fp128_to_uint16`	`fp128` -> `i16` (unsigned)
`__nv_fp128_to_uint32`	`fp128` -> `i32` (unsigned)
`__nv_fp128_to_uint64`	`fp128` -> `i64` (unsigned)
`__nv_fp128_to_uint128`	`fp128` -> `i128` (unsigned)
`__nv_fp128_to_int8`	`fp128` -> `i8` (signed)
`__nv_fp128_to_int16`	`fp128` -> `i16` (signed)
`__nv_fp128_to_int32`	`fp128` -> `i32` (signed)
`__nv_fp128_to_int64`	`fp128` -> `i64` (signed)
`__nv_fp128_to_int128`	`fp128` -> `i128` (signed)

Integer-to-FP128 Conversions (10 functions)

Convert integer types to fp128.

Function	Conversion
`__nv_uint8_to_fp128`	`i8` (unsigned) -> `fp128`
`__nv_uint16_to_fp128`	`i16` (unsigned) -> `fp128`
`__nv_uint32_to_fp128`	`i32` (unsigned) -> `fp128`
`__nv_uint64_to_fp128`	`i64` (unsigned) -> `fp128`
`__nv_uint128_to_fp128`	`i128` (unsigned) -> `fp128`
`__nv_int8_to_fp128`	`i8` (signed) -> `fp128`
`__nv_int16_to_fp128`	`i16` (signed) -> `fp128`
`__nv_int32_to_fp128`	`i32` (signed) -> `fp128`
`__nv_int64_to_fp128`	`i64` (signed) -> `fp128`
`__nv_int128_to_fp128`	`i128` (signed) -> `fp128`

String lengths for both fp128-to-integer and integer-to-fp128 conversions vary from 18 to 21 characters depending on the function name. Lowered through sub_1C8ADC0.

FP128-to-Float/Double Conversions (4 functions)

Truncation and extension between fp128 and the native floating-point types.

Function	Conversion	Opcode
`__nv_fp128_to_float`	`fp128` -> `float`	`0x43` (`fptrunc`)
`__nv_fp128_to_double`	`fp128` -> `double`	`0x43` (`fptrunc`)
`__nv_float_to_fp128`	`float` -> `fp128`	`0x44` (`fpext`)
`__nv_double_to_fp128`	`double` -> `fp128`	`0x44` (`fpext`)

I128-to-Float/Double Conversions (8 functions)

These handle the non-fp128 path: converting i128 directly to/from float/double without going through fp128 as an intermediate. The _rz suffix denotes round-toward-zero mode; _rn denotes round-to-nearest-even.

Function	Conversion	Rounding	String Length
`__nv_cvt_f32_u128_rz`	`i128` (unsigned) -> `float`	toward zero	20
`__nv_cvt_f32_i128_rz`	`i128` (signed) -> `float`	toward zero	20
`__nv_cvt_f64_u128_rz`	`i128` (unsigned) -> `double`	toward zero	20
`__nv_cvt_f64_i128_rz`	`i128` (signed) -> `double`	toward zero	20
`__nv_cvt_u128_f32_rn`	`float` -> `i128` (unsigned)	to nearest	20
`__nv_cvt_i128_f32_rn`	`float` -> `i128` (signed)	to nearest	20
`__nv_cvt_u128_f64_rn`	`double` -> `i128` (unsigned)	to nearest	20
`__nv_cvt_i128_f64_rn`	`double` -> `i128` (signed)	to nearest	20

All eight are lowered through sub_1C8BF90 with a fixed string length of 20 characters. The rounding mode choice is deliberate: _rz for integer-from-float (truncation semantics matching C/C++ cast behavior) and _rn for float-from-integer (IEEE 754 default rounding for conversions).

The dispatch logic selects between the __nv_fp128_to_* / __nv_*_to_fp128 family and the __nv_cvt_* family based on whether the source or destination type is fp128 (type_id == 5). If neither operand is fp128 but one is i128, the __nv_cvt_* path is taken.

FP128 Comparison Predicates

The fcmp instruction (opcode 0x4C) is dispatched by extracting the comparison predicate from bits 0--14 of the halfword at instruction offset +18. Each LLVM fcmp predicate maps to a dedicated runtime function.

Ordered Comparisons (7 functions)

Ordered comparisons return false if either operand is NaN.

Function	Predicate	Semantics
`__nv_fcmp_oeq`	`oeq`	ordered equal
`__nv_fcmp_ogt`	`ogt`	ordered greater-than
`__nv_fcmp_oge`	`oge`	ordered greater-or-equal
`__nv_fcmp_olt`	`olt`	ordered less-than
`__nv_fcmp_ole`	`ole`	ordered less-or-equal
`__nv_fcmp_one`	`one`	ordered not-equal
`__nv_fcmp_ord`	`ord`	ordered (neither is NaN)

Unordered Comparisons (7 functions)

Unordered comparisons return true if either operand is NaN.

Function	Predicate	Semantics
`__nv_fcmp_uno`	`uno`	unordered (either is NaN)
`__nv_fcmp_ueq`	`ueq`	unordered or equal
`__nv_fcmp_ugt`	`ugt`	unordered or greater-than
`__nv_fcmp_uge`	`uge`	unordered or greater-or-equal
`__nv_fcmp_ult`	`ult`	unordered or less-than
`__nv_fcmp_ule`	`ule`	unordered or less-or-equal
`__nv_fcmp_une`	`une`	unordered or not-equal

The predicate naming follows the standard LLVM fcmp convention: o prefix = ordered, u prefix = unordered. The 14 predicates cover the complete set of IEEE 754 comparison semantics excluding true and false (which are constant-folded before reaching this pass). Each function takes two fp128 operands and returns i1.

Trunc/Ext Handling (Opcode 0x36)

The trunc/zext/sext opcode path requires special logic because it must distinguish between genuine 128-bit truncation/extension and other type conversions that happen to use the same opcode.

sub_1C8C170::handle_trunc_ext(inst):
    if sub_1642F90(*operand, 128):      // Is the operand type 128-bit?
        // Determine source and dest bit-widths from DataLayout
        src_bits = type_id >> 8          // Bit-width encoded in high byte
        dst_bits = target_type_id >> 8
        if src_bits > dst_bits:
            emit_truncation(inst, src_bits, dst_bits)
        else:
            emit_extension(inst, src_bits, dst_bits, is_signed)
    elif type_id == 5:                   // fp128 type marker
        emit_fp128_conversion(inst)
    else:
        return                           // Not a 128-bit operation

The type_id value 5 is the LLVM type tag for fp128 in CICC's internal representation (consistent with the type code table: 1=half, 2=float, 3=double, 4=fp80, 5=fp128, 6=bf16, 0xB=integer with bit-width at type_id >> 8).

Lowering Helpers

Four internal helper functions perform the actual call construction. Each creates a new CallInst with the library function name, replaces all uses of the original instruction with the call result, and erases the original instruction.

Helper	Address	Purpose	Name Length
`sub_1C8A5C0`	`0x1C8A5C0`	Binary `fp128` arithmetic (add/sub/mul/div/rem)	14
`sub_1C8BD70`	`0x1C8BD70`	Binary `i128` division (udiv/idiv/urem/irem)	12
`sub_1C8ADC0`	`0x1C8ADC0`	FP128 conversions (to/from all integer widths, to/from float/double)	18--21 (varies)
`sub_1C8BF90`	`0x1C8BF90`	I128-to/from-float/double conversions	20

The "name length" column refers to the string length passed to the call construction routine. This is a fixed constant in each helper, not computed at runtime, which means the function name strings are embedded as literals in the binary (confirmed by string sweep at 0x1C8C170).

Each helper follows the same pattern:

helper(module, instruction, name_string, name_length):
    // 1. Get or create function declaration in module
    func = module.getOrInsertFunction(name_string, return_type, param_types...)
    // 2. Build argument list from instruction operands
    args = extract_operands(instruction)
    // 3. Create CallInst
    call = IRBuilder.CreateCall(func, args)
    // 4. Replace uses and erase
    instruction.replaceAllUsesWith(call)
    instruction.eraseFromParent()

Libdevice Resolution

The 48 __nv_* functions emitted by this pass are not present in the standard libdevice.10.bc. The standard libdevice (455,876 bytes embedded at unk_3EA0080 / unk_420FD80) contains ~400+ math functions (__nv_sinf, __nv_expf, etc.) but does not include any fp128 or i128 emulation routines.

Instead, these functions are resolved through one of two mechanisms:

Separate bitcode library: A dedicated 128-bit emulation bitcode module linked after lower-ops runs. This module contains the actual multi-word software implementations of 128-bit arithmetic using 64-bit operations.
Late synthesis during type legalization: The SelectionDAG type legalization pass (SoftenFloat action) can also handle fp128 operations, but CICC's IR-level lowering preempts this by replacing operations before they reach the backend. The __nv_* functions, once declared in the module, must be resolvable at link time.

The call declarations emitted by the pass use external linkage, meaning the linker must supply definitions. If a definition is missing, the compilation will fail at the NVPTX link stage with an unresolved symbol error. The benefit of performing this lowering at the IR level rather than in SelectionDAG is that the resulting calls are visible to the LLVM optimizer: the inliner can inline the emulation routines, SROA can decompose the intermediate values, and the loop optimizers can hoist invariant 128-bit computations.

Configuration

The pass has no dedicated knobs. It is controlled indirectly through the lower-ops pass framework:

Parameter	Effect
`enable-optimization`	Parameter to `LowerOpsPass` registration (slot 144). When enabled, the lowered calls may be marked with optimization attributes.

There are no knobs in knobs.txt specific to fp128 or i128 lowering. The pass runs unconditionally whenever lower-ops is in the pipeline -- there is no way to disable 128-bit emulation because leaving fp128/i128 operations in the IR would cause a fatal error in the NVPTX backend.

Diagnostic Strings

The pass itself emits no diagnostic messages or debug prints. All diagnostic information comes from the embedded function name strings:

"__nv_add_fp128"         "__nv_sub_fp128"         "__nv_mul_fp128"
"__nv_div_fp128"         "__nv_rem_fp128"
"__nv_udiv128"           "__nv_idiv128"
"__nv_urem128"           "__nv_irem128"
"__nv_fp128_to_uint8"    "__nv_fp128_to_int8"
"__nv_fp128_to_uint16"   "__nv_fp128_to_int16"
"__nv_fp128_to_uint32"   "__nv_fp128_to_int32"
"__nv_fp128_to_uint64"   "__nv_fp128_to_int64"
"__nv_fp128_to_uint128"  "__nv_fp128_to_int128"
"__nv_uint8_to_fp128"    "__nv_int8_to_fp128"
"__nv_uint16_to_fp128"   "__nv_int16_to_fp128"
"__nv_uint32_to_fp128"   "__nv_int32_to_fp128"
"__nv_uint64_to_fp128"   "__nv_int64_to_fp128"
"__nv_uint128_to_fp128"  "__nv_int128_to_fp128"
"__nv_fp128_to_float"    "__nv_fp128_to_double"
"__nv_float_to_fp128"    "__nv_double_to_fp128"
"__nv_cvt_f32_u128_rz"   "__nv_cvt_f32_i128_rz"
"__nv_cvt_f64_u128_rz"   "__nv_cvt_f64_i128_rz"
"__nv_cvt_u128_f32_rn"   "__nv_cvt_i128_f32_rn"
"__nv_cvt_u128_f64_rn"   "__nv_cvt_i128_f64_rn"
"__nv_fcmp_oeq"          "__nv_fcmp_ogt"          "__nv_fcmp_oge"
"__nv_fcmp_olt"          "__nv_fcmp_ole"          "__nv_fcmp_one"
"__nv_fcmp_ord"          "__nv_fcmp_uno"          "__nv_fcmp_ueq"
"__nv_fcmp_ugt"          "__nv_fcmp_uge"          "__nv_fcmp_ult"
"__nv_fcmp_ule"          "__nv_fcmp_une"

Function Map

Function	Address	Size	Role
Main entry	`sub_1C8C170`	25 KB	Opcode dispatch, instruction walk, type checks
FP128 binary lowering	`sub_1C8A5C0`	--	Emits `__nv_{add,sub,mul,div,rem}_fp128` calls
FP128 conversion lowering	`sub_1C8ADC0`	--	Emits `__nv_fp128_to_` / `__nv__to_fp128` calls
I128 division lowering	`sub_1C8BD70`	--	Emits `__nv_{u,i}div128` / `__nv_{u,i}rem128` calls
I128-float lowering	`sub_1C8BF90`	--	Emits `__nv_cvt_*` calls (rz/rn variants)
Type width check	`sub_1642F90`	--	Tests whether a type has a given bit-width (e.g., 128)

Cross-References

NVIDIA Custom Passes -- pass registry including lower-ops
Other NVIDIA Passes -- summary entry for this pass
Type Legalization -- SelectionDAG SoftenFloat path for fp128 (preempted by this pass)
Libdevice Linking -- how the embedded libdevice is linked (standard math, not fp128)
Cast Codegen -- EDG frontend cast generation, type tag 5 = fp128
Struct Splitting -- sibling pass within the same address cluster

Struct/Aggregate Splitting

GPU register files are typed and scalar. An SM has no concept of loading a struct, storing a struct, or passing a struct through a register -- every value that survives past IR lowering must reduce to a set of individually-named scalar registers. LLVM's standard SROA pass handles alloca-based aggregates by promoting them to scalars, but a large class of aggregate operations never touch an alloca: return values, call arguments, PHI nodes carrying struct types, and aggregate load/store patterns from memcpy lowering. NVIDIA's struct-splitting pass operates on these non-alloca aggregate operations at the NVVM IR level, decomposing every struct-typed value into its constituent scalar fields so that downstream register allocation sees only scalar types.

The pass exists in two binary instances. The primary implementation at sub_1C86CA0 (72KB, ~1,200 lines, 500+ locals) lives in the aggregate-splitting cluster at 0x1C80000--0x1CBFFFF and operates on NVVM IR using NVIDIA-proprietary type IDs. A second, closely related implementation at sub_2CCF450 (58KB) handles the lower-aggr-copies pipeline pass and shares the same string constants ("splitStruct", "srcptr", "dstptr", "remsrc", "remdst", "split", "vld"). Both instances produce the same fundamental transformation: aggregate operations become sequences of scalar operations on individual struct elements.

Key Facts

Property	Value
Entry point	`sub_1C86CA0`
Size	72KB (~1,200 lines decompiled), 500+ local variables
Binary cluster	`0x1C80000`--`0x1CBFFFF` (Aggregate Splitting + Memory Ops)
Second instance	`sub_2CCF450` (58KB, `lower-aggr-copies` pass)
Pipeline pass name	`lower-aggr-copies` (parameterized: `lower-aggr-func-args`)
Related pass	`lower-struct-args` (parameterized: `opt-byval`)
IR level	NVVM IR (NVIDIA-proprietary type IDs, not LLVM `Type::TypeID`)
Key opcode	32 (`splitStruct` instruction)
Use replacement	`sub_164D160` (RAUW -- Replace All Uses With)
LLVM upstream	No equivalent -- this is entirely NVIDIA-proprietary

Algorithm

The pass walks every instruction in a function, looking for operations whose result type or operand type is an aggregate (struct or array). For each such operation, it decomposes the aggregate into its scalar elements, creates a splitStruct multi-output instruction, and rewires all uses to reference individual element extractions.

Step 1: Type Decomposition

For each struct type encountered, the pass retrieves the struct layout from the DataLayout and enumerates its elements:

function decomposeStructType(struct_type, data_layout):
    layout = sub_1643350(data_layout, struct_type)  // GetStructLayout
    element_types = []
    for each element in struct_type.elements:
        scalar_ty = sub_159C470(element)            // getScalarType
        element_types.append(scalar_ty)
    return element_types

sub_1643350 retrieves the StructLayout from the DataLayout, giving byte offsets and sizes for each field. sub_159C470 maps each element to its scalar type -- for nested structs, this recurses; for arrays, it yields the element type; for scalars, it returns the type directly.

The element types accumulate in a local array v505[] with the count tracked in v506. This flattened type list drives all subsequent instruction creation.

Step 2: `splitStruct` Instruction Creation (Opcode 32)

The pass creates a new multi-output instruction with NVVM opcode 32:

function createSplitStruct(original_inst, element_types, count):
    composite_ty = sub_15F9F50(element_types, count)     // ComputeCompositeType
    aligned_ty   = sub_1646BA0(composite_ty, data_layout) // SetAlignmentFromDL

    // If original was a vector type (type_id == 16), wrap in vector
    if getTypeId(original_inst.type) == 16:
        aligned_ty = sub_16463B0(aligned_ty)              // WrapInVectorType

    split_inst = sub_15F1EA0(aligned_ty, 32, parent, nops, flags)
                                                          // InitInstruction(opcode=32)
    // Store original type info at inst+56, composite at inst+64
    split_inst[+56] = original_type_info
    split_inst[+64] = sub_15F9F50(composite_ty)
    return split_inst

The splitStruct instruction is the NVVM-specific multi-result node that represents the decomposition. It produces N outputs, one per struct element. The instruction stores both the original aggregate type (at offset +56) and the composite element type (at offset +64) for later phases that may need to reconstruct type information.

Step 3: Element Pointer Extraction

For each element of the decomposed struct, the pass creates an indexed load from the splitStruct result:

for i in 0..count:
    ptr = sub_15FD590(split_inst, element_types[i],
                      operand=i, name="ptr", insertion_point)
    // Creates opcode 56 (extractvalue-like) with type=1

sub_15FD590 creates an instruction with opcode 56 that extracts the i-th element from the multi-output splitStruct node. The "ptr" name prefix appears in debug output. Each extraction yields a scalar-typed value that downstream passes can assign to an individual PTX register.

Step 4: Split Load with Alignment Preservation

For the actual memory access that feeds the splitStruct, the pass creates a split load instruction:

function createSplitLoad(original_load, element_types):
    alignment = computeAlignment(original_load)
    split_load = sub_15F90A0(element_types, alignment, ...)
    additional_align = sub_1CCB4A0(data_layout, element_types)
    final_align = alignment & (-additional_align)  // min power-of-2
    return split_load

The resulting instruction carries the "split" name prefix. The alignment computation is described in detail in the next section.

Step 5: Use Replacement

After creating all scalar operations, sub_164D160 (RAUW -- Replace All Uses With) replaces every use of the original aggregate operation with the corresponding scalar element extraction:

sub_164D160(original_aggregate_inst, split_inst)

This is the same RAUW infrastructure used across CICC (also called from GlobalOpt, DSE, the inliner, and other passes). After replacement, the original aggregate instruction has zero uses and is eligible for dead code elimination.

Alignment Preservation

The pass must preserve memory alignment when splitting aggregate loads/stores into per-element accesses. GPU memory transactions have strict alignment requirements: a misaligned access can silently produce wrong results or trap, depending on the address space and SM architecture.

The Alignment Formula

The decompiled alignment calculation is:

aligned_value = 1 << (alignment_field >> 1) >> 1

Breaking this down:

alignment_field >> 1 -- the alignment is stored in a compressed encoding where the field value is approximately 2 * log2(alignment) + bias.
1 << (result) -- converts back to a power-of-two alignment value.
>> 1 -- adjusts for the encoding's off-by-one (the encoding stores 2*log2 + 1, so the final shift corrects it).

For example, if alignment_field = 9, then 9 >> 1 = 4, 1 << 4 = 16, 16 >> 1 = 8, yielding 8-byte alignment. This encoding is compact and used throughout NVVM's type system to store alignment in a single byte.

Additional Alignment Computation

sub_1CCB4A0 provides a DataLayout-aware alignment computation for the element type. The final alignment is the minimum of the original alignment and the element's natural alignment, computed via:

final_align = original_align & (-element_natural_align)

The bitwise AND with the negation of the element alignment selects the largest power-of-two that divides both values, ensuring the per-element access is always naturally aligned for its type without exceeding the original aggregate's alignment guarantee.

NVVM Type ID System

The pass operates on NVVM's proprietary type ID system, not LLVM's Type::TypeID. The size classification logic (decompiled lines 997--1030) reveals the mapping:

NVVM Type ID	Type	Bit Width
1	BFloat16 (i8 pair with padding)	16
2	Float	32
3	Double / i32 (context-dependent)	64
4	i64	80 (with padding to 10 bytes)
5, 6	FP128 / PPC FP128	128
7	Pointer	`8 * DataLayout::getPointerSizeInBits(0)`
9	Float (alternate, possibly metadata)	64
0xB (11)	Integer (arbitrary width)	`element_encoding >> 8`
0xD (13)	Array	`8 * DataLayout::getStructLayout(type)` total size
0xE (14)	Struct	Recursive sum of element sizes
16	Vector	Triggers vector-type wrapping via `sub_16463B0`

For struct types (ID 0xE), the size computation is recursive: the pass sums the sizes of all elements, each resolved through the same type-ID dispatch table. Array types (ID 0xD) use sub_15A9930 to look up the total allocation size from the DataLayout's StructLayout cache (which also handles arrays despite the name).

Nested Struct and Array Handling

When a struct element is itself a struct or an array, the pass recurses. The sub_159C470 (getScalarType) call during type decomposition flattens nested aggregates: a struct {i32, {f32, f64}, i16} decomposes not into three elements but into four scalars: i32, f32, f64, i16. The flattening continues until every element is a primitive scalar or a pointer.

Arrays within structs are handled differently depending on their size. Small arrays may be fully unrolled into individual element accesses. The size threshold is governed by the max-aggr-copy-size and large-aggr-store-limit knobs. Arrays that exceed the threshold are not decomposed into per-element loads but instead lowered to byte-copy loops (the "remsrc" / "remdst" / "i8dst" paths correspond to this remainder-byte handling when the aggregate cannot be evenly split into typed elements).

The remainder path:

Computes the number of whole elements that can be extracted as typed loads.
For any trailing bytes that do not fill a complete element, generates an i8 byte loop: "remsrc" is the source pointer for the remainder, "remdst" is the destination, and "i8dst" is the byte-typed destination pointer.

Relationship with SROA

LLVM's SROA (Scalar Replacement of Aggregates) and NVIDIA's struct splitting are complementary, not overlapping:

Aspect	LLVM SROA	NVIDIA Struct Splitting
Target	`alloca` instructions in entry block	Non-alloca aggregate operations
Scope	Stack-allocated structs	Return values, call args, PHI nodes, memcpy results
IR level	LLVM IR (standard `Type::TypeID`)	NVVM IR (proprietary type IDs)
Pipeline position	Early scalar optimization passes	After LLVM optimization, NVVM lowering phase
Output	SSA scalars replacing alloca uses	`splitStruct` (opcode 32) multi-output nodes
Upstream	Standard LLVM pass	No upstream equivalent

SROA runs during the standard LLVM optimization pipeline and eliminates alloca-based aggregates. By the time struct splitting runs, all remaining aggregate operations are those SROA could not handle: function return values carrying struct types, call sites passing or receiving struct-typed parameters, and aggregate-typed PHI nodes at control flow merges. Struct splitting is the final lowering step that ensures no aggregate-typed values survive into register allocation.

PTX Register Mapping

After struct splitting, every value in the IR is scalar-typed. During instruction selection and register allocation, each scalar maps to a PTX virtual register of the corresponding type:

// Before struct splitting:
%result = load {i32, f32, i64}, ptr %p, align 8

// After struct splitting:
%split = splitStruct {i32, f32, i64}   // opcode 32, multi-output
%r0 = extractelement %split, 0         // i32 -> %r1 (32-bit register)
%r1 = extractelement %split, 1         // f32 -> %f1 (32-bit FP register)
%r2 = extractelement %split, 2         // i64 -> %rd1 (64-bit register)

In PTX, register types are explicit:

%r registers: 32-bit integers
%rd registers: 64-bit integers
%f registers: 32-bit floats
%fd registers: 64-bit floats
%h registers: 16-bit values (half/bfloat)
%p registers: predicates (1-bit)

Without struct splitting, the register allocator would need to handle aggregate-typed live ranges, which is impossible on GPU hardware where the register file has no concept of a "struct register." The pass is therefore a hard prerequisite for correct register allocation.

Pipeline Position

The pass runs as part of the NVVM lowering phase, after the main LLVM optimization pipeline has completed. It is registered as lower-aggr-copies in the New PM pipeline parser at index 417 (sub_2342890), with parameter lower-aggr-func-args controlling whether function argument aggregates are also lowered.

Pipeline position:
  LLVM Optimizer (SROA, GVN, DSE, etc.)
    -> NVIDIA NVVM Lowering Phase
      -> lower-struct-args (opt-byval)     [lower struct function args]
      -> lower-aggr-copies (lower-aggr-func-args)  [struct splitting]
      -> memory-space-opt                   [address space resolution]
      -> register allocation preparation

The companion pass lower-struct-args (pass index 418) handles byval-attributed function parameters specifically, converting struct-typed byval parameters into explicit copy + scalar access patterns. It runs before lower-aggr-copies to ensure that byval struct arguments are already decomposed when the main splitting pass encounters them.

Configuration

Knobs (ctor_265 at `0x4F48E0`)

Knob	Default	Description
`devicefn-param-always-local`	--	Treat parameter space as local in device functions
`skiploweraggcopysafechk`	false	Skip safety check in aggregate copy lowering
`large-aggr-store-limit`	--	Threshold for large aggregate store unrolling
`max-aggr-copy-size`	--	Maximum aggregate size for full decomposition
`lower-aggr-unrolled-stores-limit`	--	Limit on unrolled stores per aggregate copy

InstCombine Aggregate Knobs (ctor_086 at `0x49E670`)

Knob	Default	Description
`max-aggr-lower-size`	128	Size threshold (bytes) below which InstCombine lowers aggregates
`aggressive-max-aggr-lower-size`	256	Aggressive threshold for aggregate lowering
`instcombine-merge-stores-from-aggr`	true	Merge stores originating from aggregate decomposition

Knob	Scope	Description
`lsa-opt`	`lower-struct-args`	Controls struct argument lowering
`lower-read-only-devicefn-byval`	`lower-struct-args`	Lower read-only device function byval params
`hoist-load-param`	`lower-struct-args`	Hoist parameter loads
`nvptx-force-min-byval-param-align`	backend	Force 4-byte minimum alignment for byval params
`nvptx-early-byval-copy`	backend	Copy byval arguments early in the pipeline

Diagnostic Strings

"splitStruct"     -- Name prefix for the opcode-32 multi-output node
"srcptr"          -- Source pointer in aggregate copy lowering
"dstptr"          -- Destination pointer in aggregate copy lowering
"remsrc"          -- Remainder source pointer (byte-copy tail loop)
"remdst"          -- Remainder destination pointer (byte-copy tail loop)
"i8dst"           -- Byte-typed destination for remainder copies
"split"           -- Name prefix for the per-element split load
"ptr"             -- Name prefix for element pointer extractions
"vld"             -- Vector load variant in the second instance

Function Map

Primary Instance (`sub_1C86CA0`, 72KB)

Function	Address	Role
Main driver	`sub_1C86CA0`	Top-level struct splitting pass
StructLayout query	`sub_1643350`	`DataLayout::getStructLayout`
Scalar type query	`sub_159C470`	Get scalar element type (recursive for nested structs)
Composite type creation	`sub_15F9F50`	Build composite type from element array
Alignment from DL	`sub_1646BA0`	Set type alignment from DataLayout
Vector type wrapping	`sub_16463B0`	Wrap in vector type if original was vector
Instruction creation	`sub_15F1EA0`	`InitInstruction(type, opcode=32, parent, nops, flags)`
Element extraction	`sub_15FD590`	Create indexed load from multi-output node
Split load creation	`sub_15F90A0`	Create load with alignment preservation
Alignment computation	`sub_1CCB4A0`	DataLayout-aware alignment for element type
Use replacement	`sub_164D160`	RAUW (Replace All Uses With)
Pointer size query	`sub_15A9520`	`DataLayout::getPointerSizeInBits(AS)`
Struct size query	`sub_15A9930`	`DataLayout::getStructLayout` for size lookup

Second Instance (`sub_2CCF450`, 58KB)

Function	Address	Role
Aggregate lowering	`sub_2CCF450`	`lower-aggr-copies` pass implementation

Pipeline Registration

Function	Address	Role
New PM registration	`sub_2342890`	Pass index 417 (`lower-aggr-copies`)
Parameter parser	`sub_233A3B0`	Parses `lower-aggr-func-args` parameter
`lower-struct-args` parser	`sub_233A370`	Parses `opt-byval` parameter

Test This

The following kernel returns a struct from a device function. Struct splitting should decompose the aggregate return value into individual scalar registers.

struct Result {
    float value;
    int   index;
    float confidence;
};

__device__ Result compute(const float* data, int tid) {
    Result r;
    r.value      = data[tid] * 2.0f;
    r.index      = tid;
    r.confidence = 0.95f;
    return r;
}

__global__ void struct_split_test(const float* in, float* out_val,
                                   int* out_idx, float* out_conf, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= n) return;

    Result r = compute(in, tid);
    out_val[tid]  = r.value;
    out_idx[tid]  = r.index;
    out_conf[tid] = r.confidence;
}

What to look for in PTX:

The compute function should be inlined, but even if it is not, the struct return should be decomposed. Look for the absence of .local memory for the Result struct -- all three fields (value, index, confidence) should live in individual PTX registers (%f for floats, %r for int).
No ld.local/st.local pairs for passing the struct between compute and the kernel. If the struct survives unsplit, the caller allocates local memory for the return value, the callee stores into it, and the caller loads from it -- a 200+ cycle penalty per field.
In the PTX, the three stores to out_val, out_idx, out_conf should use values directly from registers without any intermediate local memory traffic. Look for st.global.f32 and st.global.u32 with register operands, not loaded-from-local operands.
To see the unsplit case, make compute a __noinline__ function and compile at -O0. The struct will be passed through .param space with explicit st.param/ld.param sequences, showing the overhead that struct splitting eliminates.

Cross-References

SROA -- upstream SROA handles alloca-based aggregates; complements StructSplitting's inter-procedural splitting
Rematerialization -- struct splitting reduces aggregate live ranges before remat
Memmove Unrolling -- companion pass at sub_1C82A50 that unrolls memmove/memcpy loops
FP128/I128 Emulation -- companion pass at sub_1C8C170 in the same binary cluster
Pipeline & Ordering -- pass ordering in the New PM pipeline
NVIDIA Custom Passes Overview -- master inventory of all NVIDIA passes
Code Generation -- register allocation that consumes split scalars

Memmove Unrolling

CUDA GPUs have no hardware instruction for bulk memory copy. On a CPU, memcpy and memmove compile down to optimized microcode sequences (REP MOVSB, AVX-512 scatter/gather, or libc hand-tuned SIMD loops). On an SM, every byte of a copy must pass through explicit load and store instructions executed by individual threads. LLVM's standard memcpy lowering in SelectionDAG produces reasonable load/store sequences, but it operates late in the pipeline and cannot reason about NVVM IR semantics -- address spaces, alignment guarantees from the CUDA memory model, or the interaction between copy direction and overlapping shared-memory buffers. NVIDIA's memmove unrolling pass replaces llvm.memmove and llvm.memcpy intrinsic calls at the NVVM IR level with explicit element-wise copy loops, generating both forward and reverse copy paths to handle overlapping memory correctly.

The pass lives in the aggregate-lowering cluster at 0x1C80000--0x1CBFFFF, adjacent to struct splitting (sub_1C86CA0) and FP128/I128 emulation (sub_1C8C170). It is part of the lower-aggr-copies pipeline pass (pass index 417), which coordinates memmove unrolling, struct splitting, and aggregate store lowering as a single pipeline unit. Upstream LLVM has no equivalent IR-level memmove unroller -- this is entirely NVIDIA-proprietary.

Key Facts

Property	Value
Entry point	`sub_1C82A50`
Size	39KB (~1,200 lines decompiled)
Binary cluster	`0x1C80000`--`0x1CBFFFF` (Aggregate Splitting + Memory Ops)
Pipeline pass	`lower-aggr-copies` (pass index 417, parameterized: `lower-aggr-func-args`)
Pass registration	`sub_233A3B0` (parameter parser for `LowerAggrCopiesPass`)
IR level	NVVM IR (pre-instruction-selection)
Unroll threshold global	`dword_4FBD560`
Knob constructor	`ctor_265` at `0x4F48E0`
LLVM upstream	No equivalent -- NVIDIA-proprietary
Neighbor passes	Struct splitting (`sub_1C86CA0`), FP128 emulation (`sub_1C8C170`)

Why This Pass Exists

On a CPU, memmove(dst, src, n) is a single function call that the runtime library implements with architecture-specific optimized loops, often using SIMD instructions that move 32 or 64 bytes per cycle. On a GPU:

No bulk copy instruction. PTX and SASS have ld and st but no memcpy or rep movsb equivalent. Every byte must be an explicit load followed by an explicit store.
Per-thread execution model. Each thread in a warp copies its own portion of data. A 128-byte struct copy in a kernel with 1024 threads means 1024 independent 128-byte copy sequences, all of which must resolve to individual load/store pairs.
Address space semantics. The source and destination may live in different address spaces (global, shared, local, constant). Generic-pointer memmove requires runtime address-space resolution, but if the compiler can resolve the spaces at IR time, it can emit space-qualified loads and stores that map directly to the correct PTX instructions.
Overlap semantics. memmove guarantees correct behavior when source and destination overlap. The pass must emit both a forward path (for dst < src) and a reverse path (for dst >= src) to preserve this guarantee. memcpy is also routed through this pass because the NVVM verifier enforces overlap-safety uniformly.

Algorithm

The pass scans each function for llvm.memmove and llvm.memcpy intrinsic calls. For each call, it replaces the intrinsic with a 4-block CFG that implements element-wise copying. The generated code has two paths: one for when the element count is statically known and small enough to fully unroll, and one for dynamic or large counts that use a loop with a PHI induction variable.

Step 1: Basic Block Structure Creation

The pass creates four new basic blocks, splitting the block containing the memmove call:

              +-------+
              | split |   (direction comparison)
              +---+---+
             /         \
    +--------+--+   +--+----------+
    | forward.for|   | reverse.for |
    +--------+--+   +--+----------+
             \         /
            +----------+
            | nonzerotrip |   (exit / continuation)
            +----------+

Block	Name string	Purpose
Entry	`"split"`	Compares `src` and `dst` addresses to choose copy direction
Forward	`"forward.for"`	Copies elements from index 0 upward
Reverse	`"reverse.for"`	Copies elements from index `count-1` downward
Exit	`"nonzerotrip"`	Continuation after the copy completes

Step 2: Forward vs. Reverse Decision

The split block determines copy direction by comparing the source and destination base addresses:

; Pseudocode for the split block
%cmp = icmp ult ptr %dst, ptr %src     ; sub_12AA0C0, opcode 0x22 (34)
br i1 %cmp, label %forward.for, label %reverse.for   ; sub_15F83E0

The ICMP instruction is created via sub_12AA0C0 with opcode 0x22 (34 decimal, corresponding to an unsigned-less-than integer comparison). The conditional branch is created via sub_15F83E0. When dst < src, memory does not overlap in the forward direction, so the forward path is safe. When dst >= src, copying forward would overwrite source bytes before they are read, so the reverse path is required.

Step 3: Copy Generation -- Small/Static Path

When the copy size is statically known and satisfies size <= dword_4FBD560 (the compile-time unroll threshold), the pass generates fully unrolled element-by-element copies with no loop overhead.

Reverse copy (decompiled lines 606--690):

; Fully unrolled reverse copy, count elements
; For i = count-1 downto 0:
%src.gep.N = getelementptr i8, ptr %src, i64 N     ; named "src.memmove.gep.unroll"
%val.N     = load i8, ptr %src.gep.N, align A       ; sub_15F9210 (InitLoadInstruction)
%dst.gep.N = getelementptr i8, ptr %dst, i64 N     ; named "dst.memmove.gep,unroll" [sic]
store i8 %val.N, ptr %dst.gep.N, align A            ; sub_15F9650 (InitStoreInstruction)
; ... repeated for each index from count-1 down to 0

Forward copy (decompiled lines 1036--1123):

; Fully unrolled forward copy, count elements
; For i = 0 to count-1:
%src.gep.N = getelementptr i8, ptr %src, i64 N     ; "src.memmove.gep.unroll"
%val.N     = load i8, ptr %src.gep.N, align A
%dst.gep.N = getelementptr i8, ptr %dst, i64 N     ; "dst.memmove.gep,unroll" [sic]
store i8 %val.N, ptr %dst.gep.N, align A
; ... repeated for each index from 0 up to count-1

Each load is created via sub_15F9210 (InitLoadInstruction, opcode 64 type 1) and each store via sub_15F9650 (InitStoreInstruction, opcode 64 type 2). Alignment is set on both loads and stores via sub_15F8F50 / sub_15F9450, preserving the alignment from the original memmove intrinsic call (passed as parameter a15). Memory attributes (volatile flags, etc.) are propagated through parameters a16 and a17.

Step 4: Copy Generation -- Large/Dynamic Path

When the copy size exceeds the threshold or is not statically known, the pass generates a single-iteration loop body with a PHI induction variable:

Forward loop:

forward.for:
  %iv = phi i64 [ 0, %split ], [ %iv.next, %forward.for ]   ; sub_15F1EA0, opcode 53
  %src.gep = getelementptr i8, ptr %src, i64 %iv
  %val = load i8, ptr %src.gep, align A
  %dst.gep = getelementptr i8, ptr %dst, i64 %iv
  store i8 %val, ptr %dst.gep, align A
  %iv.next = add i64 %iv, 1        ; sub_15A0680 (constant 1) + sub_15FB440 (ADD, opcode 13)
  %done = icmp eq i64 %iv.next, %count
  br i1 %done, label %nonzerotrip, label %forward.for   ; sub_15F83E0

Reverse loop:

reverse.for:
  %iv = phi i64 [ %count.minus1, %split ], [ %iv.next, %reverse.for ]
  %src.gep = getelementptr i8, ptr %src, i64 %iv
  %val = load i8, ptr %src.gep, align A
  %dst.gep = getelementptr i8, ptr %dst, i64 %iv
  store i8 %val, ptr %dst.gep, align A
  %iv.next = sub i64 %iv, 1
  %done = icmp eq i64 %iv.next, -1      ; or icmp slt i64 %iv.next, 0
  br i1 %done, label %nonzerotrip, label %reverse.for

The PHI node is created via sub_15F1EA0 with opcode 53. The constant 1 for the increment is created via sub_15A0680. The addition/subtraction uses sub_15A2B60 or sub_15FB440 (the 5-argument node constructor, opcode 13 for ADD). The nonzerotrip block serves as the exit target for both loop directions.

Step 5: Alignment Propagation

The pass preserves the alignment annotation from the original memmove/memcpy intrinsic call. The alignment value is passed through the internal parameter a15 to the load/store alignment setter functions sub_15F8F50 (SetLoadAlignment) and sub_15F9450 (SetStoreAlignment). This matters because downstream PTX emission can generate wider loads (e.g., ld.global.v4.b32 for 16-byte aligned accesses) if the alignment permits it.

Step 6: Cleanup

After generating the replacement CFG, the original memmove/memcpy intrinsic call is erased. The pass uses sub_164D160 (RAUW -- Replace All Uses With) to rewire any remaining references.

Unroll Threshold

The global variable dword_4FBD560 controls the boundary between full unrolling and loop generation. This value is registered at ctor_265 (0x4F48E0) as part of the aggregate copy lowering knob group.

Condition	Code generation
`count` statically known AND `count <= dword_4FBD560`	Fully unrolled: N load/store pairs with no loop overhead
`count` statically known AND `count > dword_4FBD560`	Dynamic loop with PHI induction variable
`count` not statically known	Dynamic loop with PHI induction variable

The tradeoff is straightforward: full unrolling eliminates loop overhead (branch, PHI, compare) but increases code size linearly. For GPU kernels where instruction cache pressure is rarely the bottleneck, unrolling small copies is almost always profitable. The threshold prevents pathological code size explosion for large static copies (e.g., a 4KB struct assignment would generate 4,096 load/store pairs without the limit).

The related knob lower-aggr-unrolled-stores-limit provides an additional limit on the number of stores generated in unrolled mode, and large-aggr-store-limit controls when aggregate stores transition from unrolled sequences to loops.

Naming Conventions

The pass names its generated GEP instructions with distinctive prefixes that are visible in IR dumps and useful for debugging:

Instruction	Name string	Notes
Source GEP	`"src.memmove.gep.unroll"`	Period-separated
Destination GEP	`"dst.memmove.gep,unroll"`	Comma before `unroll` -- a typo in the binary [sic]

The comma in "dst.memmove.gep,unroll" (where a period would be expected by analogy with the source GEP name) is a benign naming inconsistency baked into the binary string table. It has no semantic effect -- LLVM IR value names are arbitrary strings -- but it serves as a reliable fingerprint for identifying output from this specific pass. A reimplementation should preserve this exact string if binary-identical IR output is desired, or normalize it to "dst.memmove.gep.unroll" if not.

Configuration

Knobs registered at ctor_265 (0x4F48E0), applicable to the lower-aggr-copies pass cluster:

Knob	Global	Description
`lower-aggr-unrolled-stores-limit`	--	Maximum number of stores in unrolled mode
`large-aggr-store-limit`	--	Element count above which aggregate stores use a loop
`max-aggr-copy-size`	--	Maximum aggregate copy size the pass will handle
`skiploweraggcopysafechk`	--	Skip safety check in aggregate copy lowering
`devicefn-param-always-local`	--	Treat device function parameter space as local

The pass can be invoked via the pipeline text interface:

-Xcicc "-passes=lower-aggr-copies"
-Xcicc "-passes=lower-aggr-copies<lower-aggr-func-args>"

Related aggregate lowering knobs from ctor_089 (0x4A0D60):

Knob	Default	Description
`max-aggr-lower-size`	128	Threshold size (bytes) below which aggregates are lowered
`aggressive-max-aggr-lower-size`	256	Aggressive threshold for aggregate lowering

Diagnostic Strings

"split"
"forward.for"
"reverse.for"
"nonzerotrip"
"src.memmove.gep.unroll"
"dst.memmove.gep,unroll"
"memmove/memcpy cannot target constant address space"   (from nvvm-verify)

Function Map

Function	Address	Size	Role
Memmove unroller	`sub_1C82A50`	39KB	Main pass: CFG construction, copy generation
ICMP creation	`sub_12AA0C0`	--	Creates integer comparison (opcode 0x22)
Conditional branch	`sub_15F83E0`	--	Creates `br i1`
InitLoadInstruction	`sub_15F9210`	--	Creates load instruction (opcode 64, type 1)
InitStoreInstruction	`sub_15F9650`	--	Creates store instruction (opcode 64, type 2)
SetLoadAlignment	`sub_15F8F50`	--	Sets alignment on load
SetStoreAlignment	`sub_15F9450`	--	Sets alignment on store
InitInstruction (PHI)	`sub_15F1EA0`	--	Creates PHI node (opcode 53)
CreateConstant	`sub_15A0680`	--	Creates integer constant (e.g., 1 for increment)
CreateBinaryOp	`sub_15FB440`	--	Creates binary operation node (5-arg constructor)
CreateBinaryOp (variant)	`sub_15A2B60`	--	Alternative binary op constructor
RAUW	`sub_164D160`	--	Replace All Uses With
Pipeline param parser	`sub_233A3B0`	--	Parses `lower-aggr-func-args` parameter

Cross-References

Struct/Aggregate Splitting -- sibling pass in the same lower-aggr-copies pipeline unit; decomposes struct-typed operations into scalar field operations
FP128/I128 Emulation -- neighbor in the 0x1C80000 cluster; replaces wide arithmetic with runtime library calls
NVVM Verifier -- validates that memmove/memcpy targets are not in constant address space
NVIDIA Custom Passes -- master index of all proprietary passes
SROA -- upstream LLVM pass that splits alloca-based aggregates; handles memcpy/memmove during alloca rewriting

printf-lowering

The printf lowering pass rewrites device-side printf() calls into CUDA's runtime vprintf() ABI. GPU hardware does not support C variadic function calls, so the compiler must pack all arguments into a stack buffer and emit a two-argument call to vprintf(format_string, arg_buffer_ptr). CICC implements this transformation at two levels: a module-level IR pass and an AST-level lowering function.


Pass name	`printf-lowering`
Class	`llvm::PrintfLoweringPass`
Scope	Module pass
Registration	New PM slot 130, `sub_2342890`
Module-level entry	`sub_1CB1E60` (31 KB)
AST-level lowering	`sub_12992B0` (24 KB)
Enable knob	`nvvm-lower-printf` (registered at `ctor_269`)

Two Lowering Stages

Printf lowering happens at two points in the compilation pipeline:

Stage 1 -- AST-level (sub_12992B0): During initial IR generation from the EDG frontend output, when the code generator encounters a direct call to printf, it intercepts the call and emits the vprintf rewrite inline. This is the earlier, more detailed pass that handles type promotion, buffer packing, and alloca management.

Stage 2 -- Module-level (sub_1CB1E60): A cleanup pass that runs during the LLVM optimization pipeline. It catches any remaining printf calls that survived the AST lowering (e.g., from linked bitcode modules or inlined functions) and applies the same transformation. This pass validates that the format string is a string literal: "The first argument for printf must be a string literal!".

AST-Level Lowering Algorithm (sub_12992B0)

The AST-level lowering is the more thoroughly analyzed implementation. It operates in six phases:

Phase 1: Resolve the vprintf Symbol

The pass looks up or creates the "vprintf" function declaration in the module:

Build the vprintf parameter type list: (i8*, i8*)
Create the FunctionType via sub_1644EA0
Call sub_1632190(Module*, "vprintf", 7, funcType) -- this is Module::getOrInsertFunction

The literal string "vprintf" with length 7 is stored in a local variable.

Phase 2: Set Up Argument List

The format string (**a3) becomes the first argument
The remaining varargs (a3[1..]) are collected into a dynamic argument array
A 22-QWORD (176-byte) stack small-buffer optimization avoids heap allocation for typical printf calls with fewer than ~16 arguments

Fast path: if argCount <= 1 (format string only, no varargs), the pass skips buffer creation entirely and emits vprintf(fmt, undef) using sub_15A06D0 (UndefValue::get).

Phase 3: Allocate Packed Argument Buffer

For the varargs case, a stack buffer named "tmp" is allocated:

sub_127FC40(context, type, "tmp", alignment=8, addrspace=0) creates an alloca
The alloca is cached at a1[19] and reused across multiple printf calls within the same function
If a cached alloca exists, its size is reused (and potentially grown in Phase 5)

Phase 4: Per-Argument Processing

For each vararg, the pass:

Float promotion: per C variadic calling convention, float arguments are promoted to double via an fpext instruction. Detected when type_info[+12] == 2 and type_info[+16] != 0.

Type size calculation: a multi-level switch on the LLVM type tag computes the byte width:

Type tag	Size (bits)	Notes
1	16	half / i16
2	32	float / i32
3, 9	64	double / i64
4	80	x86_fp80
5, 6	128	fp128 / ppc_fp128
7	target-dependent	Pointer size from DataLayout
11	custom	`dword >> 8` (arbitrary-width integer)
13	aggregate	Struct size from DataLayout
14	packed struct	Complex alignment calculation, up to 3 levels of nesting

Alignment and offset: each argument is placed at the next naturally-aligned offset in the buffer. If offset % argSize != 0, the offset is rounded up.
GEP creation: a GetElementPtr named "buf.indexed" indexes into the packed buffer at the computed byte offset.
Bitcast: if the GEP result type differs from the argument type, a bitcast instruction named "casted" (opcode 47) is emitted.
Store: the argument value is stored into the buffer slot via a StoreInst.

Phase 5: Alloca Resize

After processing all arguments, the pass checks whether the total packed size exceeds the current alloca size. If so, it patches the alloca's size operand in-place by manipulating the use-def chain directly -- unlinking the old size constant and linking a new one. This unusual technique avoids creating a second alloca while ensuring a single allocation dominates all printf pack sites.

Phase 6: Emit vprintf Call

sub_1285290 emits the final call: vprintf(format_string, arg_buffer_ptr).

Cleanup frees any heap-allocated argument arrays (from the small-buffer overflow path).

Module-Level Pass (sub_1CB1E60)

The module-level pass at 0x1CB1E60 (31 KB) performs a similar transformation but operates on already-lowered LLVM IR rather than AST nodes. Key recovered strings:

String	Purpose
`"DataLayout must be available for lowering printf!"`	Guard: DataLayout required
`"vprintf"`	Target function name
`"The first argument for printf must be a string literal!"`	Format string validation
`"vprintfBuffer.local"`	Name of the packed argument buffer alloca
`"bufIndexed"`	Name of GEP instructions into the buffer

The module-level pass uses "vprintfBuffer.local" as the alloca name (versus "tmp" in the AST-level lowering), and "bufIndexed" for the GEP instructions (versus "buf.indexed"). These naming differences confirm the two implementations are distinct codepaths.

Implementation Details

Small-buffer optimization: the argument array uses a 22-QWORD (176-byte) stack buffer. Only if more than ~16 arguments overflow does it heap-allocate via the SmallVector grow path (sub_16CD150). This avoids malloc for typical printf calls.

Alloca caching: a1[19] in the IRGenState caches the "tmp" alloca across multiple printf calls within the same function. This reduces alloca instruction count in functions with many printf calls.

Struct nesting limit: the type-size calculation handles up to 3 levels of nested struct packing (three nested switch statements in the decompilation). Deeper nesting hits a JUMPOUT at 0x129A22F -- likely an assertion for structs nested more than 3 levels in printf arguments.

Pointer tag bits: the basic block instruction list uses an intrusive doubly-linked list where the low 3 bits of next/prev pointers carry metadata tags (masked with 0xFFFFFFFFFFFFFFF8). This is consistent with LLVM's ilist implementation using pointer-int pairs.

Diagnostic Strings

Diagnostic strings recovered from p2-B08-printf-lowering.txt and p1.7-04-sweep-0x1B00000-0x1CFFFFF.txt.

String	Source	Category	Trigger
`"DataLayout must be available for lowering printf!"`	`sub_1CB1E60` (module-level pass)	Assertion/Error	Module lacks DataLayout; fatal guard at module pass entry
`"The first argument for printf must be a string literal!"`	`sub_1CB1E60` (module-level pass)	Error	Format string argument is not a constant string; validation failure
`"vprintf"`	`sub_1632190` / `sub_12992B0`	Symbol	Target function name looked up or created in the module (literal string, length 7)
`"vprintfBuffer.local"`	`sub_1CB1E60` (module-level pass)	IR name	Name of the packed argument buffer alloca in the module-level pass
`"bufIndexed"`	`sub_1CB1E60` (module-level pass)	IR name	Name of GEP instructions into the argument buffer in the module-level pass
`"tmp"`	`sub_12992B0` (AST-level lowering)	IR name	Name of the packed argument buffer alloca in the AST-level lowering; cached at `a1[19]`
`"buf.indexed"`	`sub_12992B0` (AST-level lowering)	IR name	Name of GEP instructions into the argument buffer in the AST-level lowering
`"casted"`	`sub_12992B0` (AST-level lowering)	IR name	Name of bitcast instructions when GEP result type differs from argument type (opcode 47)
`"nvvm-lower-printf"`	`ctor_269`	Knob	Enable knob for the printf lowering pass

The two lowering stages produce different IR names for the same conceptual entities ("vprintfBuffer.local" vs "tmp" for the alloca, "bufIndexed" vs "buf.indexed" for the GEPs), confirming they are distinct codepaths.

IRGenState Layout

The codegen context object used by the AST-level lowering:

Offset	Field	Purpose
`a1[4]`	`Module*`	The LLVM module
`a1[5]`	Return type	Function return type / type context
`a1[6]`	`DebugLoc`	Current debug location
`a1[7]`	`BasicBlock*`	Current insertion block
`a1[8]`	Iterator	Insertion point in BB's instruction list
`a1[9]`	AS context	Address space context for alloca type creation
`a1[19]`	`AllocaInst*`	Cached `"tmp"` alloca (reused across printf calls)

ipmsp -- Inter-Procedural Memory Space Propagation

The IPMSP pass resolves generic (address space 0) pointer arguments to concrete NVIDIA address spaces by analyzing call sites across the entire module. When all callers of a function agree that a pointer argument points to a specific memory space (global, shared, local, constant), the pass either specializes the function in place or clones it with narrowed pointer types. This enables downstream passes to emit space-specific load/store instructions (e.g., ld.shared instead of generic ld) and eliminates addrspacecast overhead.

Disabling this pass (-disable-MemorySpaceOptPass) causes 2--20x performance regressions on real workloads. The pass is automatically disabled in OptiX IR mode (--emit-optix-ir routes -do-ip-msp=0).


Pass name	`ipmsp`
Class	`llvm::IPMSPPass`
Scope	Module pass
Registration	New PM slot 125, line 1111 in `sub_2342890`
Main function	`sub_2CBBE90` (71 KB) -- MemorySpaceCloning worklist driver
LIBNVVM variant	`sub_1C6A6C0` (54 KB)
Inference engine	`sub_2CE96D0` -> `sub_2CE8530`
Cloning engine	`sub_F4BFF0` (CloneFunction)
Callee matching	`sub_2CE7410`
Propagation	`sub_2CF5840` -> `sub_2CF51E0`
Pipeline control	`do-ip-msp` NVVMPassOption (default: enabled)

NVPTX Address Spaces

The pass resolves generic (AS 0) pointers to specific address spaces: global (AS 1), shared (AS 3), constant (AS 4), local (AS 5), or param (AS 101). Generic pointers require a runtime address space check on every access; resolving them statically eliminates this overhead. See Address Spaces for the complete table with hardware mapping, pointer widths, aliasing rules, and the MemorySpaceOpt bitmask encoding.

Algorithm Overview

The pass operates as a worklist-driven inter-procedural fixed-point analysis. The top-level loop:

function IPMSP_Run(Module M):
    worklist = deque<Function*>{}
    argSpaceMap = map<Value*, int>{}        // formal arg -> resolved AS
    returnSpaceMap = map<Function*, int>{}  // function -> return AS
    calleeInfoMap = map<Function*, set<Function*>>{}  // reverse call graph

    // Phase 1: seed
    for each F in M.functions():
        if shouldProcess(F):
            worklist.push_back(F)
        for each caller of F:
            calleeInfoMap[F].insert(caller)

    debug("Initial work list size : %d", worklist.size())

    // Phase 2: fixed-point iteration
    while worklist not empty:
        F = worklist.pop_back()

        // Analyze and specialize F's callee arguments
        changed = analyzeAndSpecialize(F, argSpaceMap, calleeInfoMap)

        if changed:
            // Propagate to F's callees
            propagateSpacesToCallees(F, argSpaceMap)
            for each callee C of F in calleeInfoMap:
                if shouldProcess(C):
                    worklist.push_back(C)
            debug("%d callees are affected")

        // Check return space
        if resolveReturnSpace(F, returnSpaceMap):
            debug("%s : return memory space is resolved : %d")
            // propagate to callers and push them onto worklist

Phase 1: Build Worklist

The pass iterates all functions in the module. A function enters the worklist if sub_2CBA650 returns true, meaning:

The function is not a declaration or available_externally
Its linkage is not extern_weak or common
It is not an intrinsic (sub_B2DDD0 filter)
It has at least one formal argument that is a generic pointer not yet in the resolved-space map

Specifically, sub_2CBA650 checks:

function shouldProcess(this, F):
    if F has no users (F[16] == 0): return false

    linkage = F.linkage & 0xF
    if (linkage + 14) & 0xF <= 3: return false   // available_externally, appending
    if (linkage + 7) & 0xF <= 1: return false     // common, extern_weak

    if isIntrinsic(F): return false

    retType = F.getReturnType()
    if retType is pointer with AS 0 and not in returnSpaceMap:
        return true

    return hasUnresolvedPointerArgs(this, F)

sub_2CBA520 (hasUnresolvedPointerArgs) walks the formal arg list (stride 40 bytes) and returns true if any arg has type byte 14 (pointer) and is not already in the arg-space map.

A reverse call graph is also constructed: for each callee, the pass records which callers invoke it.

Debug output (when dump-ip-msp is enabled): "Initial work list size : N"

Phase 2: Per-Function Analysis

For each function popped from the worklist:

Classify arguments: allocate a per-arg array initialized to 1000 ("unresolved"). Non-pointer args and already-resolved args are marked 2000 ("skip").
Walk call sites: for each call instruction, examine each actual argument:
- If the actual's address space is non-zero (already specific), record it.
- If the actual is generic (AS 0), first check the callee-space map for a cached result. If not found, invoke the dataflow inference engine sub_2CE96D0 to trace the pointer's provenance.
- If this is the first call site for this arg, record the space. If a subsequent call site disagrees, mark 2000 ("conflicting -- give up").
Count resolved arguments: any arg where all call sites agree on a single address space is a candidate for specialization.

function analyzeArgSpaces(F, argSpaceMap, calleeSpaceMap):
    numArgs = F.arg_size()
    spaces[numArgs] = {1000, ...}     // 1000 = unresolved

    for i in 0..numArgs:
        arg = F.getArg(i)
        if arg.type != pointer:
            spaces[i] = 2000          // not a pointer, skip
        else if arg in argSpaceMap:
            spaces[i] = 2000          // already resolved

    for each CallInst CI using F:
        calledFn = CI.getCalledFunction()
        for i in 0..numArgs:
            if spaces[i] == 2000: continue
            actual = CI.getOperand(i)
            if actual == F.getArg(i): continue  // passthrough

            as = actual.type.addrspace
            if as == 0:
                // Check cache first
                if actual in calleeSpaceMap:
                    as = calleeSpaceMap[actual]
                else:
                    ok = inferAddressSpace(calledFn, actual, &as, ...)
                    if !ok:
                        spaces[i] = 2000
                        continue

            if spaces[i] == 1000:
                spaces[i] = as         // first call site
            else if spaces[i] != as:
                spaces[i] = 2000       // conflict

    return count(s for s in spaces if s != 1000 and s != 2000)

Debug output: "funcname : changed in argument memory space (N arguments)"

Phase 3: Specialization Decision

The pass chooses between two strategies based on linkage:

Linkage	Strategy	Mechanism
Internal / Private (7, 8)	In-place specialization	Modify the function's arg types directly. No clone needed since all callers are visible.
External / Linkonce / Weak	Clone	Create a new function with specialized arg types and internal linkage. Rewrite matching call sites to target the clone. Keep the original for external callers.

The decision at line 1114 in sub_2CBBE90:

if (F.linkage & 0xF) - 7 <= 1:
    // Internal/Private: specialize in place
    for each resolved arg:
        argSpaceMap[arg] = resolvedAS
else:
    // External: must clone
    if resultsTree is empty:
        debug("avoid cloning of %s")
    else:
        createClone(F, resolvedArgs)

The clone is created by sub_F4BFF0 (CloneFunction):

Builds a new FunctionType with specific-space pointer arg types
Allocates a new Function object (136 bytes via sub_BD2DA0)
Copies the body via a ValueMap-based cloner (sub_F4BB00)
For each specialized arg, inserts an addrspacecast from specific back to generic at the clone's entry (these fold away in later optimization)
Sets clone linkage to internal (0x4007)

Debug output: "funcname is cloned"

Phase 4: Transitive Propagation

After specializing a function, the pass propagates resolved spaces to its callees via sub_2CF5840. This function:

Creates an analysis context similar to sub_2CE96D0
Calls sub_2CF51E0 which walks F's body
For each call instruction in F that targets a known function, determines if the called function's args now have resolved spaces
Updates the arg-space map accordingly

Affected callees are pushed back onto the worklist. This enables bottom-up resolution through call chains: if A -> B -> C, specializing A's args may resolve B's args, which in turn resolves C's args.

Debug output: "N callees are affected"

Phase 5: Return Space Resolution

After argument processing, the pass checks return values:

If the function returns a generic pointer, walk all ret instructions.
Follow the def chain through GEPs to the base pointer.
If all returns agree on a single address space, record it in the return-space map and propagate to callers.

Debug output: "funcname : return memory space is resolved : N"

The Dataflow Inference Engine

The inference engine is the core analysis that determines what address space a generic pointer actually points to. It is invoked when a call-site argument has address space 0 (generic) and the pass needs to determine the concrete space.

Entry Point: `sub_2CE96D0`

function inferAddressSpace(calledFn, actualArg, &result, module, symtab, argSpaceMap):
    as = actualArg.type.addrspace
    if as != 0:
        *result = as
        return true                    // trivially resolved

    // Generic pointer: need full analysis
    context = alloca(608)              // 608-byte stack context
    // Initialize 6 tracking sets:
    //   [0]  visited set (bitset for cycle detection in PHI chains)
    //   [1]  user-list collector
    //   [2]  callee mapping
    //   [3]  load tracking (when track-indir-load)
    //   [4]  inttoptr tracking (when track-int2ptr)
    //   [5]  alloca tracking

    return coreDataflowWalker(context, calledFn, actualArg,
                              &loadsVec, &callsVec, result)

The 608-byte context is allocated on the stack and contains all working state for the backward dataflow walk.

Core Backward Dataflow Walker: `sub_2CE8530`

The walker traces the pointer's provenance backward through the SSA def chain. It uses a worklist plus visited-set to handle cycles (primarily PHI nodes).

IR nodes handled:

IR node	Action
`getelementptr`	Transparent: follow the base pointer operand
`bitcast`	Transparent: follow the source operand
`addrspacecast`	Extract target address space, record it
`phi`	Add all incoming values to the worklist
`select`	Add both arms to the worklist (result = OR of both)
`call` / `invoke`	Look up callee in return-space map; if found, use that
`load`	If `track-indir-load` enabled: follow the loaded pointer; otherwise opaque
`inttoptr`	If `track-int2ptr` enabled: follow the integer source; otherwise opaque
`alloca`	If `process-alloca-always`: immediately resolve to AS 5 (local)
`argument`	If in arg-space map: use the recorded space

Inference rules (lattice):

The engine collects candidate address spaces from all reachable definitions. The resolution follows these rules:

// All sources agree:     resolved to that space
// Sources disagree:      unresolvable (return false)
// param bit set + param-always-point-to-global:  resolve to global (AS 1)
// alloca found + process-alloca-always:  resolve to local (AS 5)
// __builtin_assume(__isGlobal(p)) + process-builtin-assume:  resolve to global

The walker collects three separate vectors during traversal:

loads: pointers loaded from memory (indirect provenance)
GEPs: getelementptr instructions encountered along the chain
calls: function calls whose return values contribute to the pointer

Per-Callee Space Propagation: `sub_2CE8CB0`

This function is the heavy-weight driver called from the worklist loop for each function. It processes a function's call graph entries and determines concrete address spaces for callees by examining actual arguments at all call sites.

Architecture:

A global limit at qword_3CE3528 caps maximum analysis depth to prevent explosion on large call graphs.
The function iterates the BB instruction list (offset +328, linked list). For each callee encountered:
- Check visited set. The set has two representations:
  - Small set: flat array at object offsets +32..+52 (checked when flag at +52 is set)
  - Large set: hash-based DenseSet at offset +24 (checked via sub_18363E0)
- If callee has no body (*(_DWORD *)(callee + 120) == 0): collect it as a leaf and record its argument address spaces via sub_2CE80A0
- Otherwise: skip (will be processed when popped from worklist)
For each collected callee, a DenseMap cache at offset +160 is checked:
- Hash function: (ptr >> 9) ^ (ptr >> 4), linear probing
- Empty sentinel: -4096 (0xFFFFFFFFFFFFF000)
- If found in cache: skip re-analysis (use cached result)
After collecting all callees: invoke sub_2CE88B0 for merge/commit.
For single-entry results (exactly 1 callee entry in the vector): special fast path via sub_2CE2F10 that commits directly through a vtable dispatch.

function perCalleePropagate(this, F):
    if this.firstVisit:
        // Reset tracking vectors
        clearUserVectors()

    // Walk BB instruction list
    for each BB in F.body():
        if BB in visitedSet: continue
        if BB.isDeclaration(): continue

        collectCalleeInfo(BB)       // -> sub_2CE80A0
        addToVisitedSet(BB)

    // Check depth limit
    if userVector.size() > depthLimit:
        return false

    // Merge phase
    if userVector.size() > 1:
        return mergeAndCommit(this, F)    // sub_2CE88B0
    elif userVector.size() == 1:
        commitSingleResult(this)          // fast path
    return false

Callee Matching Engine: `sub_2CE7410`

When multiple call instructions target the same callee, this function determines the best pair to use for space inference. This is critical for correctness -- the pass must ensure that the inferred space is valid for all uses.

Algorithm:

Parallel operand walk: for each pair of call instructions to the same callee, walk their operand use-chains in parallel. Compare the instructions at each position via the instruction equivalence DenseMap at offset +80.
Coverage scoring: count the number of matching operands (variable v95). Higher coverage means more confidence in the match.
Dominance check: call sub_2403DE0(A, B) to test if BB A dominates BB B. Both directions are checked:
- If A dominates B and B dominates A (same BB or trivial loop): strong match.
- If only one direction: check if the non-dominating one is the entry BB's first instruction.
Loop membership gate: sub_24B89F0 checks whether both call instructions are in the same loop. If both are in the same loop and the coverage score > 1, the match is accepted even without strict dominance (loops create natural fixed-point convergence).
Attribute check: for each matched pair, sub_245A9B0 verifies metadata flags (at instruction offset +44) to ensure the transformation is legal.
Output: the best-scoring pair is written into the results vector for subsequent instruction rewriting.

Post-Inference Merge: `sub_2CE88B0`

After the per-callee analysis produces a list of (instruction, resolved_space) entries:

function mergeAndCommit(this, F):
    entries = this.resultVector
    if entries.size() > 1:
        qsort(entries, comparator=sub_2CE2BD0)  // sort by callee ID

    changed = false
    while entries.size() > 1:
        entry = entries.back()
        calleeId = entry.calleeId

        // Find best match for this callee
        matchScore = sub_2CE7410(this, calleeId, ...)

        if matchScore > 0:
            // Commit via instruction specialization
            sub_2CE4830(this, matchedCallee)     // edge weight
            sub_2CE3B60(this, bestMatchIdx)      // commit space

            // Propagate to other entries sharing this callee
            for each other entry with same callee:
                if other != bestMatch:
                    sub_2CE3780(this, other.users, matchedCallee)

            // Compact the entries vector
            changed = true
        else:
            // No match: fallback propagation
            sub_2CE3A70(this, calleeId, ...)

    return changed

Instruction Specialization: `sub_2CE8120`

Once a callee's address space is determined, this function creates a specialized copy of the instruction:

Legality check: vtable dispatch at offset +408 (sub_25AE460 default). Returns false if the instruction cannot be legally specialized (e.g., volatile operations, intrinsics with fixed types).
Create specialized instruction: sub_244CA00 creates a new instruction with the modified pointer type (generic -> specific address space).
Insert into BB: sub_24056C0 places the new instruction in the basic block's instruction list.
Rewrite use chain: all uses of the old instruction are updated to reference the new specialized version.
Update DenseMap caches:
- Instruction-to-space map at offset +80: insert mapping from new instruction to resolved space
- Edge count at offset +72: update via sub_24D8EE0
- If nested clone tracking (offset +131 flag): update debug info via sub_2D2DBE0

Handling Recursion and Clone Limits

Transitive: clones are pushed back onto the worklist, so chains A->B->C are handled iteratively.
Mutual recursion: already-resolved args are detected via the map (marked 2000), preventing infinite re-processing.
Self-recursion: after the first pass resolves args, re-processing finds agreement and applies specialization.
Clone limit: do-clone-for-ip-msp (default -1 = unlimited) caps the total number of clones. Each clone increments a counter at this[200]. When the limit is exceeded, cloning stops but in-place specialization continues for internal functions.
Analysis depth limit: qword_3CE3528 limits the per-function callee analysis depth to prevent explosion on large modules.

The LIBNVVM Variant

A second implementation at sub_1C6A6C0 (54 KB) serves the LIBNVVM/module-pass path. Key differences:

Uses DenseMap-style hash tables (empty sentinel = -8, tombstone = -16, 16-byte entries)
Includes loop-induction analysis via sub_1BF8310 with maxLoopInd tracking (debug: "phi maxLoopInd = N: Function name")
Three processing phases controlled by globals:
- Phase A (dword_4FBD1E0, default=4): call-site collection, threshold dword_4FBC300 = 500
- Phase B (dword_4FBD2C0, default=2): address space resolution. If dword_4FBCAE0 (special mode), picks the callee with the smallest constant value (minimum address space ID).
- Phase C (dword_4FBCD80, default=2): WMMA-specific sub-pass via sub_1C5FDC0, called with wmma_mode=1 first (WMMA-specific), then wmma_mode=0
Threshold: v302 > 5 triggers sub_1C67780 for deeper analysis
Pre/post analysis toggle: byte_4FBC840 controls calls to sub_1C5A4D0

Interaction with memory-space-opt

The ipmsp and memory-space-opt passes are complementary:

ipmsp is inter-procedural: it analyzes call graphs, infers address spaces across function boundaries, and specializes function signatures via cloning.
memory-space-opt is intra-procedural: it resolves generic pointers within a single function body using backward dataflow analysis and bitmask accumulation.

The typical pipeline flow:

ipmsp runs first (module pass) to propagate address spaces across function boundaries
memory-space-opt runs with first-time mode to resolve obvious intra-procedural cases
Further optimization passes run (may create new generic pointers via inlining, SROA, etc.)
memory-space-opt runs with second-time mode to clean up remaining generic pointers, fold isspacep intrinsics to constants

Both passes share the same set of knobs (with ias- prefixed mirrors for the IAS variant). The inference engine sub_2CE96D0 is shared between IPMSP and the alternate algorithm selected by mem-space-alg.

Knobs

IPMSP-Specific Knobs

Knob	Default	Storage	Description
`dump-ip-msp`	0	`qword_5013548`	Enable debug tracing
`do-clone-for-ip-msp`	-1 (unlimited)	`qword_5013468`	Max clones allowed
`do-ip-msp`	1 (enabled)	NVVMPassOption	Enable/disable the entire pass

Shared Inference Knobs (MemorySpaceOpt variant)

Knob	Default	Storage	Description
`param-always-point-to-global`	true	`unk_4FBE1ED`	Parameter pointers always resolve to global (AS 1)
`strong-global-assumptions`	true	(adjacent)	Assume constant buffer pointers always point to globals
`process-alloca-always`	true	`unk_4FBE4A0`	Treat alloca-derived pointers as local (AS 5) unconditionally
`wmma-memory-space-opt`	true	`unk_4FBE3C0`	Specialize WMMA call args to shared memory (AS 3)
`track-indir-load`	true	`byte_4FBDE40`	Track indirect loads during inference
`track-int2ptr`	true	`byte_4FBDC80`	Track `inttoptr` in inference
`mem-space-alg`	2	`dword_4FBDD60`	Algorithm selection for address space optimization
`process-builtin-assume`	--	(ctor_531_0)	Process `__builtin_assume(__is*(p))` for space deduction

IAS Variant Knobs (IPMSPPass path, ctor_610)

Each shared knob has an ias- prefixed mirror that controls the InferAddressSpaces-based code path (sub_2CBBE90):

Knob	Mirrors
`ias-param-always-point-to-global`	`param-always-point-to-global`
`ias-strong-global-assumptions`	`strong-global-assumptions`
`ias-wmma-memory-space-opt`	`wmma-memory-space-opt`
`ias-track-indir-load`	`track-indir-load`
`ias-track-int2ptr`	`track-int2ptr`

The unprefixed versions control the LIBNVVM variant (sub_1C6A6C0). The ias- prefixed versions control the New PM / IAS variant (sub_2CBBE90).

LIBNVVM Variant Globals

Global	Default	Description
`dword_4FBD1E0`	4	Phase A call-site collection level
`dword_4FBD2C0`	2	Phase B resolution level
`dword_4FBCD80`	2	Phase C WMMA sub-pass level
`dword_4FBC300`	500	Max analysis depth threshold
`dword_4FBCAE0`	--	Special minimum-selection mode
`byte_4FBC840`	--	Pre/post analysis toggle
`dword_4FBD020`	--	Debug: maxLoopInd dump

Debug Dump Knobs

Knob	Description
`dump-ir-before-memory-space-opt`	Dump IR before MemorySpaceOpt runs
`dump-ir-after-memory-space-opt`	Dump IR after MemorySpaceOpt completes
`dump-process-builtin-assume`	Dump __builtin_assume processing
`msp-for-wmma`	Enable Memory Space Optimization for WMMA (tensor core)

Data Structures

Worklist

The worklist is a std::deque<Function*> with 512-byte pages (64 pointers per page). Push-back via sub_2CBB610 (extends the deque when the current page is full). Pop-back from the last page.

Red-Black Tree Maps

The cloning engine uses red-black trees (std::map) for four separate maps:

Map	Key	Value	Purpose
Return-space	`Function*`	Resolved AS	Return value address space
Arg-space	`Value*`	Resolved AS	Per-argument address space
Callee-space	`Value*`	Resolved AS	Callee pointer spaces (cached inference results)
Callee-info	`Function*`	Sub-tree	Reverse call graph (which callers invoke this callee)

Red-black tree nodes are 0x58 bytes with the standard {left, right, parent, color, key} layout at offsets 16, 24, 8, 0, 32.

DenseMap Caches

The inference engine and per-callee propagation use DenseMap hash tables with LLVM-layer sentinels (-4096 / -8192) and 16-byte entries (key + value). Growth is handled by sub_240C8E0. See Hash Table and Collection Infrastructure for the hash function, probing, and growth policy.

Three independent DenseMaps are used:

Offset +80: instruction -> resolved space (per-function analysis cache)
Offset +160: callee -> inference result (cross-function cache)
Offset +232: edge weight tracking (call graph weights for profitability)

Visited Sets

Two representations depending on set size:

Small set (flag at offset +52): flat array at offsets +32..+44, capacity at +40, count at +44. Linear scan for membership test.
Large set (default): hash-based DenseSet at offset +24 via sub_18363E0 for insert and sub_18363E0 for membership test.

Inference Context

The 608-byte stack-allocated context for sub_2CE8530 contains:

Offset range	Content
0--23	Result vector (pointer, size, capacity)
24--47	Loads vector (indirect pointer sources)
48--71	GEPs vector (getelementptr chains)
72--95	Calls vector (call instructions returning pointers)
96--127	Worklist for PHI traversal
128--607	Visited bitset, callee tracking, metadata

Sentinel Values

Value	Meaning	Used in
1000	Unresolved pointer argument (not yet seen at any call site)	Per-arg analysis array
2000	Non-pointer, already resolved, or conflicting (skip)	Per-arg analysis array
-4096	DenseMap empty slot	All DenseMap caches
-8192	DenseMap tombstone (deleted entry)	All DenseMap caches

Diagnostic Messages

Message	Source	Condition
`"Initial work list size : %d"`	`sub_2CBBE90`	Always (when `dump-ip-msp`)
`"funcname : changed in argument memory space (N arguments)"`	`sub_2CBBE90`	Args resolved
`"funcname is cloned"`	`sub_2CBBE90`	Clone created
`"avoid cloning of funcname"`	`sub_2CBBE90`	External linkage, empty results
`"N callees are affected"`	`sub_2CBBE90`	After propagation
`"funcname : return memory space is resolved : N"`	`sub_2CBBE90`	Return space resolved
`"phi maxLoopInd = N: Function name"`	`sub_1C6A6C0`	LIBNVVM loop-ind analysis

Function Map

Function	Address	Size	Role
MemorySpaceCloning	`sub_2CBBE90`	71 KB	Worklist driver (New PM variant)
IPMSPPass	`sub_1C6A6C0`	54 KB	LIBNVVM variant
inferAddressSpace	`sub_2CE96D0`	--	Inference entry point
coreDataflowWalker	`sub_2CE8530`	--	Backward dataflow analysis
perCalleePropagate	`sub_2CE8CB0`	--	Per-callee space propagation
mergeAndCommit	`sub_2CE88B0`	--	Post-inference merge (qsort)
rewriteCalleePair	`sub_2CE85D0`	--	Instruction rewriting for matched pairs
calleeMatchingEngine	`sub_2CE7410`	--	Dominance + coverage scoring
pushInferenceResult	`sub_2CE80A0`	--	Append to result vector
vectorRealloc	`sub_2CE7E60`	--	Grow inference result vector
computeEdgeWeight	`sub_2CE4830`	--	Call graph edge weight
commitSpace	`sub_2CE3B60`	--	Commit resolved space to callee
fallbackPropagate	`sub_2CE3A70`	--	Propagate unmatched entries
propagateToAlternate	`sub_2CE3780`	--	Propagate to alternate callee users
commitSingleCallee	`sub_2CE2F10`	--	Single-callee commit via vtable
singlePredecessorCheck	`sub_2CE2DE0`	--	Check single-predecessor property
qsortComparator	`sub_2CE2BD0`	--	Compare callee entries for sorting
mergeSmallVectors	`sub_2CE2A70`	--	Merge small vector pairs
extractAddressSpace	`sub_2CE27A0`	--	Extract AS from Value's type
cloneInstruction	`sub_2CE8120`	--	Clone instruction + DenseMap update
populateUserSet	`sub_2CE97F0`	--	Build per-arg user list
propagateSpacesToCallees	`sub_2CF5840`	--	Post-specialization propagation
bodyWalker	`sub_2CF51E0`	--	Walk function body for propagation
shouldProcessFunction	`sub_2CBA650`	--	Worklist eligibility predicate
hasUnresolvedPointerArgs	`sub_2CBA520`	--	Check for unresolved generic ptr args
CloneFunction	`sub_F4BFF0`	--	Full function clone with arg rewriting
ValueMapCloner	`sub_F4BB00`	--	ValueMap-based body cloner
replaceAllUsesWith	`sub_BD84D0`	--	Redirect call sites to clone
mapInsertOrFind	`sub_2CBB230`	--	Red-black tree insert
mapLookup	`sub_2CBB490`	--	Red-black tree search
dequeGrow	`sub_2CBB610`	--	Worklist deque push_back
checkAttributeBundle	`sub_245A9B0`	--	Attribute flag membership test
instructionEquivalence	`sub_245AA10`	--	Test instruction equivalence
bbDominates	`sub_2403DE0`	--	BasicBlock dominance test
loopMembership	`sub_24B89F0`	--	Check if two instructions share a loop
createSpecializedInst	`sub_244CA00`	--	Create instruction with modified types
insertIntoBlock	`sub_24056C0`	--	Insert instruction into BB
updateDebugInfo	`sub_2D2DBE0`	--	Debug info update for cloned inst

Cross-References

memory-space-opt -- intra-procedural complement
reference/address-spaces -- consolidated AS reference
config/knobs -- complete knob inventory
pipeline/optimizer -- pipeline position and do-ip-msp option
pipeline/optix-ir -- OptiX disables IPMSP
infra/alias-analysis -- cross-space NoAlias rules

Memory Space Optimization

The Memory Space Optimization pass (memory-space-opt) is NVIDIA's inter-procedural address space resolution engine. Its job is to convert generic (flat) pointers into specific address spaces -- global, shared, local, constant, or parameter -- so that the backend can emit specialized memory instructions (ld.shared, st.global, etc.) instead of generic ones (ld, st) that require address translation hardware at runtime. On NVIDIA GPUs, generic memory accesses go through an address translation unit that adds latency; resolving pointer provenance at compile time eliminates this overhead entirely and is one of the most impactful optimizations in the CUDA compilation pipeline.

The pass is implemented as a multi-function cluster totaling roughly 250KB of decompiled code, with two cooperating systems: an intra-procedural address space resolver and an inter-procedural function cloning engine.

Key Facts

Property	Value
Pass name (pipeline)	`memory-space-opt`
Class	`MemorySpaceOptPass`
Pass type	Parameterized FunctionPass (NVIDIA-custom)
Registration	New PM #416, parameterized: `first-time;second-time;no-warnings;warnings`
Runtime positions	Tier 1/2/3 #65 (after DSE + DCE + LLVM standard pipeline); also runs early in "mid" path (see Pipeline)
Pass entry point	`sub_1C70910` (2,427 lines)
Pass factory	`sub_1C8E680`
NVVMPassOptions slot	Offset `+2680` (disable), offset `+3120` (mode parameter)
Binary size	~250 KB total (multi-function cluster)
Upstream equivalent	None -- entirely NVIDIA-proprietary

NVPTX Address Space Numbering

The pass operates on the standard NVPTX address spaces (0=generic, 1=global, 3=shared, 4=constant, 5=local, 101=param). See Address Spaces for the complete table with hardware mapping, pointer widths, and aliasing rules.

Internally, the pass encodes address spaces as a single-bit bitmask for efficient dataflow computation (0x01=global, 0x02=shared, 0x04=constant, 0x08=local, 0x10=param, 0x0F=unknown). When multiple pointer sources contribute different spaces, the bitmask is OR'd together. A singleton bit (popcount == 1) means the space is fully resolved; multiple bits set means ambiguous. See the MemorySpaceOpt Internal Bitmask section for the complete mapping and resolution algorithm.

IR Before/After Example

The following illustrates the core transformation: generic-pointer loads/stores are resolved to specific address spaces, enabling specialized PTX memory instructions.

Before (generic pointers, AS 0):

define void @kernel(ptr addrspace(0) %shared_buf, ptr addrspace(0) %global_out) {
  %val = load float, ptr addrspace(0) %shared_buf, align 4
  %add = fadd float %val, 1.0
  store float %add, ptr addrspace(0) %global_out, align 4
  %check = call i1 @llvm.nvvm.isspacep.shared(ptr %shared_buf)
  br i1 %check, label %fast, label %slow
fast:
  ret void
slow:
  ret void
}

After (resolved address spaces):

define void @kernel(ptr addrspace(3) %shared_buf, ptr addrspace(1) %global_out) {
  %val = load float, ptr addrspace(3) %shared_buf, align 4    ; -> ld.shared.f32
  %add = fadd float %val, 1.0
  store float %add, ptr addrspace(1) %global_out, align 4     ; -> st.global.f32
  ; isspacep.shared folded to true (phase 2), branch simplified by later DCE
  br label %fast
fast:
  ret void
}

The addrspacecast instructions are inserted during resolution and consumed by downstream passes. The isspacep folding (phase 2 only) eliminates runtime address space checks when the space is statically known.

Two-Phase Architecture

The pass entry point (sub_1C70910) accepts a mode parameter controlling execution:

Mode	Name	Behavior
0	First-time	Conservative resolution via `sub_1CA2920`. Called early in the pipeline.
1	Second-time	Hash-table-based resolution via `sub_1CA9E90`. Called after IP-MSP propagation.
2	First-time, no warnings	Same as mode 0 but suppresses "Cannot tell what pointer points to" messages.
3	Second-time, no warnings	Same as mode 1 but silent. Used on re-runs where repeated warnings would be noise.

Both phases share the same instruction dispatch structure, handling loads (opcode 0x36), stores (0x37), calls (0x4E), atomic loads (0x3A), and atomic stores (0x3B).

Phase 1 (first-time) resolves obvious cases where pointer origin is statically known. It uses sub_1C9F820 for dataflow analysis and sub_1C98370 for annotation-based resolution.

Phase 2 (second-time) runs after inter-procedural propagation has enriched the analysis context. It uses hash-table lookups (sub_1CA8350) and can fold isspacep intrinsics (builtins 0xFD0-0xFD5) to constants when the address space is already known, eliminating runtime space checks.

Inter-Procedural Memory Space Propagation (IP-MSP)

Complexity. Let F = number of functions in the module, A = total number of pointer-typed arguments across all functions, E = total call-graph edges, and I = total instructions. The intra-procedural use-def chain walk is O(I) per function (bounded by visited-set to avoid cycles through PHI nodes). The IP-MSP worklist iterates until no argument's bitmask changes; since each of the A arguments has a 5-bit bitmask that can only grow (OR of incoming values), the worklist converges in at most O(A) rounds. Each round re-analyzes at most O(F) functions, and adding callers back to the worklist costs O(E) in total across all rounds. Worst-case: O(A * (F * I_avg + E)) where I_avg is average instructions per function. Function cloning adds at most O(F) clones (bounded by do-clone-for-ip-msp), each clone being O(I_f) to create. In practice, GPU modules have small call graphs (F < 200 after inlining) and the worklist converges in 2--4 rounds, making the pass effectively O(F * I_avg + E).

The IP-MSP driver in sub_1C70910 implements a fixed-point worklist algorithm that propagates address space information across function boundaries:

Build a worklist of all functions in the module. Debug: "Initial work list size: %d".
Pop a function from the worklist.
Run intra-procedural resolution (phase 1 or 2).
If argument memory spaces changed ("changed in argument memory space"), add all callers back to the worklist ("callees are affected").
If the return memory space is resolved ("return memory space is resolved"), propagate to callers.
Repeat until the worklist is empty.

A second IP-MSP implementation exists at sub_1C6A6C0 (54KB), which appears to be the LIBNVVM/module-pass variant. It uses DenseMap-style hash tables (sentinel -8 for empty, -16 for tombstone), has explicit loop-induction analysis (sub_1BF8310), and runs three sub-phases: call-site collection (level controlled by dword_4FBD1E0, default 4), address space resolution (level dword_4FBD2C0, default 2), and a WMMA-specific pass (sub_1C5FDC0).

Function Cloning for Specialization

When different call sites pass pointers from different address spaces to the same function argument, the pass clones the function so that each clone can be specialized for a single address space. This is the key mechanism that eliminates generic pointers at call boundaries.

The cloning engine (sub_2CBBE90, 71KB) uses two distinct strategies based on function linkage:

Strategy 1 -- In-place specialization (internal/private linkage): All call sites are visible within the module, so the function is modified directly. Pointer argument types are changed from generic (AS 0) to the resolved specific space. No clone is created. This is the cheaper path.

Strategy 2 -- Clone and specialize (external/linkonce/weak linkage): The function might have callers outside the module, so the original must be preserved. A clone is created with internal linkage (0x4007), its argument types are specialized, and internal call sites are rewritten to target the clone. The original remains for any remaining generic-pointer callers.

The cloning process (sub_F4BFF0):

Iterate all formal args of the original function.
For each arg whose address space was resolved, create a new function type with the specific address space.
Allocate a new Function object via sub_BD2DA0(136).
Copy linkage, attributes, and calling convention.
Clone the body via sub_F4BB00 (ValueMap-based cloner).
For specialized args, insert addrspacecast instructions at the clone's entry.
Rewrite matching call sites via sub_BD84D0.

After cloning, the clone is pushed back onto the worklist, enabling recursive specialization through call chains: if A calls B calls C, each level's arguments resolve bottom-up as the worklist iterates.

Intra-Procedural Resolution Algorithm

Use-Def Chain Walking (`sub_1CA5350`)

The core resolver walks backward through use-def chains to find the original allocation a pointer derives from:

IR Node	Behavior
GEP (`H`)	Transparent -- follow pointer operand
Bitcast (`G`)	Transparent -- follow source operand
PHI (`O`)	Follow all incoming values (adds all to worklist)
Call (`M`)	Check if returns a known-space pointer
Load (subcode 32)	Tracked if `track-indir-load` is enabled
inttoptr (subcode 47)	Tracked if `track-int2ptr` is enabled
ptrtoint (subcode 48)	Transparent
Alloca (`8`)	Resolves to local (AS 5)

The walker uses a worklist with a visited bitset to handle cycles through phi nodes. It collects three separate vectors: loads (indirect pointers), GEPs, and calls returning pointers.

Resolution Decision

Once the bitmask is computed:

Single bit set: resolved. Insert addrspacecast to the target space.
Multiple bits set: ambiguous. If param-always-point-to-global is true and the param bit is set, resolve to global. Otherwise emit a warning and default to global.
Zero bits: unreachable or error.

Address Space Inference Engine (`sub_2CE96D0`)

For generic-pointer arguments at call sites, the inference engine creates a 608-byte analysis context on the stack, sets up six independent tracking sets, and calls sub_2CE8530 for deep dataflow analysis tracing pointer provenance through GEPs, bitcasts, PHI nodes, and loads from known-space pointers.

Post-Resolution Optimizations

After resolving a pointer's address space, the pass performs several follow-up transformations:

addrspacecast insertion: sub_1CA1B70 (first-time) / sub_1CA28F0 (second-time) inserts a cast from generic to the resolved space and replaces all uses of the generic pointer.
Instruction rewriting: Loads and stores on generic pointers are rewritten to use the specific space, enabling the backend to emit ld.shared, st.global, etc.
isspacep folding (second-time only): If a pointer's space is known, isspacep.shared(%p) folds to true or false.
Dead cast elimination: Redundant addrspacecast chains (e.g., generic-to-shared followed by shared-to-generic) are simplified.
Call site specialization: After cloning, call sites are rewritten to call the specialized version with casted arguments.

Error Handling for Illegal Operations

The pass detects and reports illegal address-space/operation combinations as soft warnings (compilation continues):

Operation	Illegal Space	Warning Message
Atomic load/store	Constant	`"Cannot do atomic operation on const memory"`
Atomic load/store	Local	`"Cannot do atomic on local memory"`
WMMA	Constant	`"Cannot do WMMA on constant memory"`
WMMA	Local	`"Cannot do WMMA on local memory"`
Vector atomic	Shared	`"Cannot to vector atomic on shared memory"`
Vector atomic	Local	`"Cannot to vector atomic on local memory"`
Vector atomic	Constant	`"Cannot to vector atomic on const memory"`

Note: The vector atomic messages contain a typo in NVIDIA's source -- "Cannot to" should read "Cannot do". This typo is present in all three vector atomic warning strings.

Key Functions

Function	Address	Size	Role
Pass entry / IP-MSP driver	`sub_1C70910`	2427 lines	Main entry point, worklist iteration, mode dispatch
First-time resolver	`sub_1CA2920`	1119 lines	Conservative address space resolution
Second-time resolver	`sub_1CA9E90`	933 lines	Hash-table-based resolution with isspacep folding
Use-def chain walker	`sub_1CA5350`	1641 lines	Backward pointer origin tracking
Per-BB scanner	`sub_1CA8CD0`	898 lines	Instruction scan, bitmask builder
Pass initialization	`sub_1CAB590`	1040 lines	Global registration, data structure setup
MemorySpaceCloning engine	`sub_2CBBE90`	71KB	Inter-procedural function cloning
IPMSPPass variant	`sub_1C6A6C0`	54KB	LIBNVVM module-pass variant
Address space inference	`sub_2CE96D0`	--	Dataflow analysis for single argument
CloneFunction	`sub_F4BFF0`	--	Full function clone with type rewriting
shouldProcessFunction	`sub_2CBA650`	--	Multi-condition filter for worklist eligibility
hasUnresolvedPointerArgs	`sub_2CBA520`	--	Checks if any arg is an unresolved generic pointer
replaceAllUsesWith	`sub_BD84D0`	--	Rewrites call sites to target the clone
propagateSpacesToCallees	`sub_2CF5840`	--	Propagates resolved spaces through call graph

Alternate Algorithm

A parallel implementation exists at sub_2CBBE90 / sub_2CEAC10 / sub_2CF2C20, selected when mem-space-alg != 2. The default algorithm (value 2) is the one documented above; the alternate may be a simpler or older version optimized for different patterns.

Configuration Knobs

Primary Knobs (ctor_264 / ctor_267_0)

Knob	Global	Type	Default	Description
`dump-ip-msp`	`dword_4FBD480`	bool	false	Dump inter-procedural memory space propagation debug info
`do-clone-for-ip-msp`	`dword_4FBD3A0`	int	-1	Max number of clones (-1 = unlimited). Set to 0 to disable cloning.
`param-always-point-to-global`	`unk_4FBE1ED`	bool	true	Assume kernel parameters always point to global memory
`dump-ir-before-memory-space-opt`	`byte_4FBE000`	bool	false	Dump IR before the pass runs
`dump-ir-after-memory-space-opt`	`byte_4FBDF20`	bool	false	Dump IR after the pass completes
`track-indir-load`	`byte_4FBDE40`	bool	true	Track pointers loaded from memory during use-def walking
`mem-space-alg`	`dword_4FBDD60`	int	2	Algorithm selection for address space optimization
`track-int2ptr`	`byte_4FBDC80`	bool	true	Track `inttoptr` casts during analysis

Additional Knobs (ctor_267_0 / ctor_531_0)

Knob	Default	Description
`process-alloca-always`	true	Treat `alloca` instructions as definite local (AS 5) regardless of context
`wmma-memory-space-opt`	true	Enable memory space optimization for WMMA operations
`strong-global-assumptions`	true	Assume const buffer pointers always point to globals
`process-builtin-assume`	--	Process `__builtin_assume(__is*(p))` assertions for space deduction

IP-MSP Pass Knobs (ctor_528)

Knob	Global	Default	Description
`dump-ip-msp`	`qword_5013548`	0	Debug tracing for IPMSP variant
`do-clone-for-ip-msp`	`qword_5013468`	-1	Clone limit for IPMSP variant

Optimization Level Behavior

Level	Phase 1 (first-time)	Phase 2 (second-time)	IP-MSP Cloning
O0	Runs (mode 0) -- address space resolution is required for correct PTX emission	Not run	Not run
Ofcmax	Runs (mode 0); `LSA-Opt` forced to 0, limiting resolution depth	Not run	Not run
Ofcmid	Runs (mode 0)	Runs (mode 1) after IP-MSP propagation	Enabled (`do-clone-for-ip-msp=-1`)
O1+	Runs (mode 0) early in pipeline	Runs (mode 1) after IP-MSP propagation	Enabled; iterates to fixed point

This pass is unusual in that it runs even at O0 -- address space resolution is a correctness requirement, not purely an optimization. Without it, all memory accesses would use generic (flat) addressing, which is functionally correct but significantly slower due to the address translation hardware penalty. At Ofcmax, the pass runs in a reduced mode with LSA-Opt disabled. See Optimization Levels for the complete pipeline structure.

Diagnostic Strings

"Initial work list size: %d"
"changed in argument memory space"
"is cloned"
"avoid cloning of"
"callees are affected"
"return memory space is resolved"
"Cannot tell what pointer points to, assuming global memory space"
"Cannot do atomic operation on const memory"
"Cannot do atomic on local memory"
"Cannot do WMMA on constant memory"
"Cannot do WMMA on local memory"
"Cannot to vector atomic on shared memory"
"Cannot to vector atomic on local memory"
"Cannot to vector atomic on const memory"

Multi-Pass Data Flow: MemorySpaceOpt / IP-MSP / Alias Analysis

The following diagram shows how three cooperating subsystems exchange data to resolve generic pointers into specific address spaces. The left column is MemorySpaceOpt (per-function), the center is IP-MSP (module-level), and the right is NVVM Alias Analysis (query service). Arrows show data produced (-->) and consumed (<--).

 MemorySpaceOpt (per-function)       IP-MSP (module-level)          NVVM Alias Analysis
 ==============================      ==========================      ======================

 1. EARLY RUN (mode 0)
 +----------------------------+
 | Use-def chain walker       |
 | (sub_1CA5350)              |
 | Walk: GEP, bitcast, PHI,  |
 | alloca, call returns       |
 |                            |
 | Produces:                  |
 |  - per-arg bitmask         |
 |    (0x01=global,0x02=shr,  |
 |     0x04=const,0x08=local, |
 |     0x10=param)            |
 |  - unresolved arg list     |
 +---+------------------------+
     |                                                              +----------------------+
     | per-arg bitmasks                                             | Address space         |
     | (singleton bit = resolved,                                   | disjointness table:  |
     |  multi-bit = ambiguous)                                      |                      |
     v                                                              | AS 1 vs AS 3: NoAlias|
 +---+------------------------+                                     | AS 1 vs AS 5: NoAlias|
 | addrspacecast insertion    |                                     | AS 3 vs AS 5: NoAlias|
 | (sub_1CA1B70)              |                                     | AS 0 vs any: MayAlias|
 | Rewrites loads/stores to   |                                     | (stateless, trivial) |
 | ld.shared / st.global etc. |                                     +----------+-----------+
 +---+------------------------+                                                |
     |                                                                         |
     | Resolved pointer types on                                               |
     | function args + return values                                           |
     v                                                                         |
 +---+-----------------------------+      +--------------------------+         |
 | Unresolved args remain generic  | ---> | IP-MSP worklist driver   |         |
 | Need cross-function evidence    |      | (sub_1C70910 / 2CBBE90)  |         |
 +---+-----------------------------+      |                          |         |
     ^                                    | For each function F:     |         |
     |                                    |  1. Collect all callers  |         |
     |                                    |  2. Intersect arg AS     |         |
     |                                    |     across call sites    |         |
     |                                    |  3. If unanimous:        |         |
     |                                    |     specialize or clone  |         |
     |  propagated arg spaces             |                          |         |
     |  (from callers)                    | Produces:                |         |
     +------------------------------------+  - cloned functions      |         |
                                          |    with AS-specific args |         |
                                          |  - updated call sites    |         |
                                          |  - "changed in argument  |         |
                                          |    memory space" events  |         |
                                          +---+----------------------+         |
                                              |                                |
 2. LATE RUN (mode 1)                         | Enriched module with           |
 +----------------------------+               | resolved pointer types          |
 | Hash-table resolver        |               v                                |
 | (sub_1CA9E90)              | <--- cloned functions re-enter worklist        |
 |                            |                                                |
 | Additional capabilities:   |      Each resolved addrspacecast               |
 |  - isspacep folding        |      feeds into...                             |
 |    (builtins 0xFD0-0xFD5) |                                                |
 |  - Dead cast elimination   |                                     +----------v-----------+
 |                            |                                     | NVVM AA (nvptx-aa)   |
 | Consumes:                  |                                     |                      |
 |  - IP-MSP propagated       |                                     | With resolved AS on  |
 |    address spaces          |                                     | pointers, queries    |
 |  - hash table of known     |                                     | return NoAlias for   |
 |    pointer->space mappings |                                     | cross-space pairs    |
 +---+------------------------+                                     |                      |
     |                                                              | Enables downstream:  |
     | Fully resolved IR                                            |  - GVN load forward  |
     | (minimal generic ptrs)                                       |  - DSE elimination   |
     v                                                              |  - LICM hoisting     |
 +---+------------------------+                                     |  - MemorySSA queries |
 | Downstream consumers:      |                                     +----------------------+
 |  - Instruction selection   |
 |    (ld.shared, st.global)  |
 |  - Backend PTX emission    |
 |  - Register allocation     |
 |    (no generic-ptr spills) |
 +----------------------------+

Data flow summary:

Producer	Data	Consumer
MemorySpaceOpt phase 1	Per-arg address space bitmask	IP-MSP worklist
IP-MSP worklist	Cloned functions with specialized arg types	MemorySpaceOpt phase 2
IP-MSP worklist	Call-site rewriting (`addrspacecast` at boundaries)	All downstream passes
MemorySpaceOpt phase 2	`isspacep` folded to `true`/`false`	Dead code elimination
Both phases	Resolved pointer address spaces on all IR values	NVVM AA (`nvptx-aa`)
NVVM AA	`NoAlias` for cross-space pointer pairs	GVN, DSE, LICM, MemorySSA

The feedback loop between MemorySpaceOpt and IP-MSP is the critical insight: phase 1 resolves locally-obvious cases, IP-MSP propagates those resolutions across call boundaries (cloning when necessary), and phase 2 picks up the newly-available information to resolve cases that were previously ambiguous. The worklist iterates until no more argument spaces change, guaranteeing a fixed point. NVVM AA is the downstream beneficiary -- every resolved pointer pair that previously required a conservative MayAlias answer can now return NoAlias, enabling more aggressive optimization in GVN, DSE, LICM, and scheduling.

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent address space resolution engine.

1. Resolving ambiguous pointers to the wrong default space. When the bitmask has multiple bits set (e.g., 0x03 = global OR shared), the pass defaults to global if param-always-point-to-global is true. A reimplementation that defaults to shared instead will silently produce ld.shared instructions for what is actually global memory, causing out-of-bounds accesses on the shared memory aperture. The correct behavior is: ambiguous always resolves to global (the safe conservative choice), never to a more restrictive space.

2. Forgetting to re-run after inter-procedural propagation. The pass must run twice: once before IP-MSP to resolve locally-obvious cases, and again after IP-MSP to consume propagated information. A single-pass reimplementation will miss every case where a callee's argument space is only known from the caller's context. The second run (mode 1) is not optional -- it catches the majority of inter-procedural resolutions and performs isspacep folding that the first run cannot do.

3. Cloning functions with external linkage instead of specializing in-place. The pass uses two strategies: in-place specialization for internal/private functions (all call sites visible) and clone-and-specialize for external/weak linkage. Reversing this logic -- cloning internal functions or modifying external ones -- either wastes compile time on unnecessary clones or breaks callers outside the module who still pass generic pointers. The linkage check (0x4007 for internal) is the discriminator and must not be inverted.

4. Failing to handle the addrspacecast chain correctly. After resolving a pointer's space, the pass inserts addrspacecast from generic to the specific space and replaces all uses. A reimplementation that replaces the pointer type directly (without the cast) will break LLVM's type system invariants, causing assertion failures in downstream passes. The cast must exist in the IR even though it is semantically a no-op -- LLVM's type-based alias analysis and GEP arithmetic depend on it.

5. Not iterating the IP-MSP worklist to a fixed point. The worklist must iterate until no argument bitmask changes. A reimplementation that runs one pass over all functions and stops will miss transitive resolutions through call chains (A calls B calls C). The bitmask OR is monotone (can only grow), so convergence is guaranteed, but early termination produces incomplete resolutions that leave generic pointers in the IR and forfeit the performance benefit of specialized memory instructions.

Test This

The following minimal kernel exercises address space resolution. Compile with nvcc -ptx -arch=sm_90 and inspect the PTX output.

__global__ void memspace_test(float *global_out, int n) {
    __shared__ float smem[64];
    smem[threadIdx.x] = (float)threadIdx.x;
    __syncthreads();
    float val = smem[threadIdx.x];
    global_out[threadIdx.x] = val + 1.0f;
}

What to look for in PTX:

ld.shared.f32 for the read from smem -- confirms the pass resolved the shared pointer from generic (AS 0) to shared (AS 3). If you see a plain ld.f32 without the .shared qualifier, the access goes through the generic address translation unit at runtime.
st.global.f32 for the write to global_out -- confirms global pointer resolution (AS 1).
Absence of cvta.to.shared / cvta.to.global instructions. These cvta (convert address) instructions indicate the backend is converting generic pointers at runtime instead of using resolved address spaces at compile time. Their absence means the pass succeeded fully.
Compare with -O0 to see the unresolved version where generic ld/st instructions dominate.

Reimplementation Checklist

Address space bitmask dataflow engine. Implement the per-value bitmask lattice (0x01=global, 0x02=shared, 0x04=constant, 0x08=local, 0x10=param) with OR-based meet, use-def chain walking through GEP/bitcast/PHI/alloca/inttoptr, and a visited-set to handle cycles through PHI nodes.
Two-phase resolution with mode dispatch. Build a mode-parameterized entry point: mode 0 (conservative first-time), mode 1 (hash-table-based second-time with isspacep folding), and warning-suppression variants (modes 2/3).
Inter-procedural fixed-point worklist (IP-MSP). Implement the module-level worklist that propagates per-argument address space bitmasks across call boundaries, re-adding callers when an argument's bitmask changes, iterating until no bitmask grows.
Function cloning for specialization. Implement two strategies: in-place specialization for internal-linkage functions (modify arg types directly) and clone-and-specialize for external-linkage functions (create internal clone, rewrite call sites, insert addrspacecast at clone entry).
isspacep intrinsic folding (phase 2). When a pointer's address space is resolved, fold isspacep.shared/.global/etc. builtins (IDs 0xFD0--0xFD5) to true or false constants.
Post-resolution cleanup. Insert addrspacecast instructions, rewrite loads/stores to specific address spaces, eliminate dead cast chains (generic-to-shared followed by shared-to-generic), and rewrite call sites to target specialized clones.
Illegal operation detection. Check and warn on illegal address-space/operation combinations (atomics on constant/local, WMMA on constant/local, vector atomics on shared/local/constant) without aborting compilation.

Pipeline Interaction

The pass runs at two points in the CICC pipeline: once early (first-time, mode 0) to resolve obvious cases before optimization, and again after inter-procedural propagation (second-time, mode 1) to catch cases that became resolvable after inlining and constant propagation. The no-warnings variants (modes 2/3) suppress repeated diagnostics on re-runs. The pass feeds directly into instruction selection, where resolved address spaces determine which PTX memory instructions are emitted. It also interacts with the ipmsp module pass, which drives the inter-procedural cloning engine separately from the per-function resolver.

nvvm-peephole-optimizer

The NVVM Peephole Optimizer is an NVIDIA-proprietary function-level IR pass that performs NVVM-specific pattern matching and instruction simplification. It is distinct from both LLVM's standard InstCombine pass (which handles general-purpose peephole optimization across ~600 functions in the 0x1700000--0x17B0000 range) and the machine-level nvptx-peephole pass (sub_21DB090) that operates on MachineInstrs after instruction selection.

This page documents all three peephole layers in cicc, their pipeline positions, their transformations, and the satellite machine-level peephole passes that complement them.


Pass name	`nvvm-peephole-optimizer`
Class	`llvm::NVVMPeepholeOptimizerPass`
Scope	Function pass (IR level)
Registration	New PM slot 382 in `sub_2342890`
Serializer	`sub_2314DA0`
Factory (legacy PM)	`sub_1CEF8F0`
Pipeline parser line	3534 in `sub_233C410`
Enable knob	`enable-nvvm-peephole` (bool, default = `true`)
Knob address	`ctor_358_0` @ `0x50E8D0`
NVVMPassOptions slot	`nvvm-peephole-optimizer` in 4512-byte options struct
Pipeline position	Function-level, runs after NVVMReflect + NVVMIntrinsicLowering

Purpose

CUDA programs produce IR patterns that standard LLVM optimizations do not recognize or cannot legally transform. The NVVM peephole pass fills this gap by matching NVVM-specific idioms -- address space casts, intrinsic call sequences, convergent operation patterns, and GPU-specific type conversions -- and rewriting them into simpler, cheaper forms. It operates at the LLVM IR level before code generation, complementing the machine-level nvptx-peephole pass that runs later in the pipeline.

The pass is always paired with sub_215D9D0 (NVVMAnnotationsProcessor), which runs immediately after the peephole in every pipeline path. This companion pass processes NVVM annotations (e.g., tcgen05 tensor annotation metadata) on the IR that the peephole has just simplified.

Three Peephole Layers

CICC contains three distinct peephole optimization layers, each operating at a different abstraction level and targeting different pattern classes.

Layer	Pass	Level	Address / Slot	Targets
LLVM InstCombine	`instcombine`	IR	`0x1700000`+ (~600 funcs)	General-purpose: algebraic simplification, constant folding, dead instruction removal
NVVM Peephole	`nvvm-peephole-optimizer`	IR	slot 382, factory `sub_1CEF8F0`	NVVM-specific: address space casts, intrinsic sequences, GPU type conversions
NVPTX Peephole	`nvptx-peephole`	MachineInstr	`sub_21DB090`	PTX-specific: redundant `cvta` folding, predicate optimization, move elimination

The NVVM peephole pass handles transformations that require knowledge of NVVM's address space model, intrinsic semantics, or GPU-specific type system -- patterns that InstCombine cannot match because they depend on NVPTX target information not available to target-independent passes. The machine-level NVPTX peephole then handles patterns that only emerge after instruction selection has lowered IR to MachineInstrs.

Pipeline Positions

IR-Level: nvvm-peephole-optimizer

The IR-level peephole (sub_1CEF8F0) is invoked from the legacy pipeline assembler (sub_12E54A0) in all three language-specific code paths. Its companion sub_215D9D0 always follows immediately.

Path A -- "ptx" language (lines 580--638 in sub_12E54A0):

sub_1CEF8F0()    NVVMPeephole
sub_215D9D0()    NVVMAnnotationsProcessor
sub_1857160()    NVVMReflect (conditional)
sub_1A62BF0(1)   LLVM standard pipeline #1
sub_1B26330()    MemCpyOpt
sub_18DEFF0()    DCE
...

Path B -- "mid" language (Ofcmid, lines 814--1075):

sub_184CD60()    ConstantMerge / GlobalDCE
sub_1CB4E40(0)   NVVMIntrinsicLowering
sub_1B26330()    MemCpyOpt
sub_198E2A0()    SROA / CorrelatedValuePropagation
sub_1CEF8F0()    NVVMPeephole                   <<<
sub_215D9D0()    NVVMAnnotationsProcessor
sub_17060B0(1,0) PrintModulePass
sub_198DF00(-1)  JumpThreading / CVP
sub_1C6E800()    GVN / LICM
...

Path C -- default/general (O2/O3, lines 1077--1371):

sub_1A62BF0(4)   LLVM standard pipeline #4
sub_1857160()    NVVMReflect
sub_1CB4E40(0)   NVVMIntrinsicLowering
sub_1857160()    NVVMReflect (second pass)
sub_1CEF8F0()    NVVMPeephole                   <<<
sub_215D9D0()    NVVMAnnotationsProcessor
sub_1A7A9F0()    InstructionSimplify
sub_1A62BF0(5)   LLVM standard pipeline #5
...

Late position (O3 tier finalization):

sub_1B7FDF0(n)   BranchFolding / CFGSimplify
sub_1CEF8F0()    NVVMPeephole                   <<<
sub_215D9D0()    NVVMAnnotationsProcessor
sub_18B3080(f)   Sinking2Pass (fast mode)
sub_1CC60B0()    NVVMSinking
sub_18A3430()    AggressiveInstCombine
...

In every path, the peephole runs after NVVMIntrinsicLowering (sub_1CB4E40) and NVVMReflect (sub_1857160) have resolved intrinsics and reflect calls. This ensures the peephole sees simplified IR where previously-opaque intrinsic call patterns have been reduced to simpler forms amenable to pattern matching.

Machine-Level: nvptx-peephole

The machine-level peephole (sub_21DB090) runs in addPreRegAlloc() (sub_2166ED0):

EarlyTailDuplicate
codegen DCE
Machine LICM + CSE + Sinking        (conditional on enable-mlicm, enable-mcse)
PeepholeOptimizerPass                (stock LLVM, slot 492, disable-peephole)
NVPTXPeephole (sub_21DB090)         <<<
DeadMachineInstrElim
MachineCopyPropagation

The string "After codegen peephole optimization pass" in sub_2166ED0 marks the checkpoint after both the stock LLVM peephole and the NVPTX peephole have completed.

New PM Registration

The pass is registered as a function-level pass in the New Pass Manager at registration line 2242 in sub_2342890. It sits in the mid-optimization phase alongside other NVIDIA function passes:

Slot	Pass	Class
376	`basic-dbe`	`BasicDeadBarrierEliminationPass`
377	`branch-dist`	`BranchDistPass`
378	`byval-mem2reg`	`ByValMem2RegPass`
379	`bypass-slow-division`	`BypassSlowDivisionPass`
380	`normalize-gep`	`NormalizeGepPass`
381	`nvvm-reflect-pp`	`SimplifyConstantConditionalsPass`
382	`nvvm-peephole-optimizer`	`NVVMPeepholeOptimizerPass`
383	`old-load-store-vectorizer`	`OldLoadStoreVectorizerPass`
384	`print<merge-sets>`	`MergeSetsAnalysisPrinterPass`
385	`remat`	`RematerializationPass`

IR-Level Transformation Categories

Based on pipeline position (after NVVMReflect + NVVMIntrinsicLowering, before sinking and rematerialization) and the patterns visible in NVVM IR, the peephole optimizer targets several categories.

Address Space Cast Simplification

After memory-space-opt and ipmsp resolve generic pointers to specific address spaces, redundant addrspacecast chains remain in the IR. The peephole rewrites these:

; Before:
%p1 = addrspacecast ptr addrspace(3) %src to ptr        ; shared -> generic
%p2 = addrspacecast ptr %p1 to ptr addrspace(3)         ; generic -> shared
store i32 %val, ptr addrspace(3) %p2

; After:
store i32 %val, ptr addrspace(3) %src                   ; chain eliminated

; Before:
%p = addrspacecast ptr addrspace(1) %src to ptr addrspace(1)  ; identity cast

; After:
; (use %src directly — identity addrspacecast removed)

The validation function sub_21BEE70 ("Bad address space in addrspacecast", 4.1KB) ensures the peephole does not create illegal address space transitions. NVPTX address spaces are:

AS	Name	Legal cast targets
0	Generic	All
1	Global	Generic
3	Shared	Generic
4	Constant	Generic
5	Local	Generic

Intrinsic Call Folding

After NVVMIntrinsicLowering has expanded NVVM intrinsics, some expansion sequences can be further simplified:

; Before (after intrinsic lowering, launch_bounds known):
%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%cmp = icmp ult i32 %tid, 256       ; blockDim.x known = 256

; After (when nvvm-intr-range has set !range {0, 256}):
%cmp = i1 true                      ; always true for valid threads

; Before:
call void @llvm.nvvm.barrier0()
; (no shared memory operations between barriers)
call void @llvm.nvvm.barrier0()

; After:
call void @llvm.nvvm.barrier0()     ; redundant barrier removed

Type Conversion Cleanup

GPU-specific type representations (bf16, tf32, fp8) produce conversion chains not present in standard LLVM IR:

; Before (roundtrip through wider type):
%wide = fpext half %x to float
%back = fptrunc float %wide to half

; After:
; (use %x directly — roundtrip eliminated when no precision loss)

; Before (bf16 roundtrip):
%f32 = call float @llvm.nvvm.bf16.to.f32(i16 %bf)
%bf2 = call i16 @llvm.nvvm.f32.to.bf16(float %f32)

; After:
; (use %bf directly)

Post-Reflect Dead Code Cleanup

The companion pass nvvm-reflect-pp (SimplifyConstantConditionalsPass) runs immediately before the peephole in the pipeline. It resolves __nvvm_reflect() calls and simplifies constant conditionals:

; Before (after nvvm-reflect-pp resolves __nvvm_reflect("__CUDA_FTZ") = 1):
%ftz = call i32 @__nvvm_reflect(ptr @"__CUDA_FTZ")     ; resolved to 1
%cmp = icmp ne i32 %ftz, 0                              ; always true
br i1 %cmp, label %ftz_path, label %no_ftz_path

; After nvvm-reflect-pp:
br label %ftz_path                   ; unconditional

; The peephole then cleans up dead instructions in %no_ftz_path
; and simplifies any resulting phi nodes at merge points

Convergent Operation Canonicalization

CUDA's convergent operations (__syncwarp, __ballot_sync, etc.) have specific semantic constraints that standard InstCombine cannot reason about because it must treat convergent calls as opaque. The peephole, with knowledge of NVVM semantics, can simplify convergent call sequences when the mask or participating threads can be determined at compile time.

Machine-Level NVPTXPeephole (sub_21DB090)

The machine-level peephole operates on MachineInstr objects after instruction selection has converted LLVM IR to PTX pseudo-instructions. It targets patterns specific to the PTX instruction set.

Redundant cvta Folding

The cvta (convert address) instruction converts between generic and specific address spaces. Address space lowering often inserts redundant conversions:

// Before:
cvta.to.global %rd1, %rd2        ; convert global -> generic
cvta.global %rd3, %rd1           ; convert generic -> global (redundant pair)

// After:
mov.b64 %rd3, %rd2               ; direct copy, cvta pair eliminated

The companion pass sub_21DA810 ("NVPTX optimize redundant cvta.to.local instruction") handles the remaining cvta.to.local instructions that survive to late post-RA:

// Before (late pipeline):
cvta.to.local %rd1, %rd2         ; redundant when %rd2 is already local-space

// After (sub_21DA810 removes it):
; (use %rd2 directly)

Predicate Pattern Optimization

PTX uses predicate registers for conditional execution. The peephole simplifies predicate sequences:

// Before:
setp.ne.s32 %p1, %r1, 0;
@%p1 bra target;

// After (folds setp into branch when pattern is recognized):
// Combined compare-and-branch

PTX Move Elimination

sub_2204E60 ("Remove redundant moves") eliminates identity moves:

// Before:
mov.b32 %r5, %r5;               ; identity move

// After:
; (deleted)

Satellite Machine Peephole Passes

Three additional machine-level passes perform specialized peephole transformations adjacent to the main NVPTXPeephole:

param-opt (sub_2203290)


Pass name	`param-opt`
Entry point	`sub_2203290`
Description	`"Optimize NVPTX ld.param"`

Optimizes parameter load patterns. In PTX, kernel parameters are loaded via ld.param instructions into registers. When the same parameter is loaded multiple times (e.g., after inlining or loop unrolling), param-opt consolidates them:

// Before:
ld.param.u32 %r1, [_param_0];
...
ld.param.u32 %r7, [_param_0];    ; redundant reload of same parameter

// After:
ld.param.u32 %r1, [_param_0];
...
mov.b32 %r7, %r1;                ; reuse previous load

nvptx-trunc-opts (sub_22058E0)


Pass name	`nvptx-trunc-opts`
Entry point	`sub_22058E0`
Description	`"Optimize redundant ANDb16ri instrunctions"` [sic]

Eliminates redundant AND operations on b16 (16-bit) registers. When type legalization widens a sub-16-bit value to 16 bits, it inserts an AND with a mask to preserve the original width. If the value is already correctly masked (e.g., from a load that zero-extends), the AND is redundant:

// Before:
ld.u8 %rs1, [%rd1];              ; loads 8-bit, zero-extended to 16
and.b16 %rs2, %rs1, 0xFF;        ; redundant mask — already 8-bit clean

// After:
ld.u8 %rs1, [%rd1];
// (AND deleted, use %rs1 directly)

The binary contains the string with a typo: "instrunctions" instead of "instructions".

Remove Redundant Moves (sub_2204E60)


Entry point	`sub_2204E60`
Description	`"Remove redundant moves"`

Eliminates move instructions where source and destination are the same register, or where the move is immediately dead. This complements the stock LLVM MachineCopyPropagation pass with PTX-specific move patterns.

Knobs

Knob	Type	Default	Scope	Effect
`enable-nvvm-peephole`	bool	true	IR + Machine	Master switch for both the IR-level `nvvm-peephole-optimizer` and the machine-level `nvptx-peephole`. Registered at `ctor_358_0` (`0x50E8D0`).
`disable-peephole`	bool	false	Machine only	Disables the stock LLVM `PeepholeOptimizerPass` (slot 492). Does not affect the NVIDIA-specific passes. Registered at `ctor_314` (`0x502360`).
`aggressive-ext-opt`	bool	(varies)	Machine only	Controls aggressive extension optimization in stock LLVM peephole.
`disable-adv-copy-opt`	bool	false	Machine only	Disables advanced copy optimization in stock LLVM peephole.
`rewrite-phi-limit`	int	(varies)	Machine only	Limits PHI rewriting in stock LLVM peephole.
`recurrence-chain-limit`	int	(varies)	Machine only	Limits recurrence chain analysis in stock LLVM peephole.

The enable-nvvm-peephole description string recovered from the binary is: "Enable NVVM Peephole Optimizer". Its default-on status indicates the pass is mature and does not require opt-in behavior.

Optimization Level Behavior

The IR-level peephole runs in all optimization paths except -O0:

Level	Path	NVVMPeephole invocations
Ofcmin	"ptx" path	1 (early)
Ofcmid	"mid" path	1 (after SROA/CVP)
O2/O3	"default" path	1 (after NVVMReflect + IntrinsicLowering)
O3 (late)	Tier finalization	1 (after BranchFolding/CFGSimplify)

At -O0, the peephole is likely skipped along with most optimization passes. The factory function sub_1CEF8F0 appears only in code paths that are active at O1 and above.

End-to-End Peephole Pipeline

The complete peephole optimization flow through cicc, from IR to PTX:

Source CUDA
    |
    v
[LLVM IR after clang/EDG frontend]
    |
    v
InstCombine (0x1700000+)           General algebraic simplification
    |                               ~600 functions, target-independent
    v
NVVMReflect (sub_1857160)          Resolve __nvvm_reflect() calls
    |
    v
nvvm-reflect-pp                    Simplify constant conditionals from reflect
    |
    v
NVVMIntrinsicLowering (sub_1CB4E40) Expand NVVM intrinsics
    |
    v
nvvm-peephole-optimizer             NVVM-specific IR patterns:
  (sub_1CEF8F0 factory)              - addrspacecast chain folding
    |                                - intrinsic sequence simplification
    v                                - type conversion roundtrip elimination
NVVMAnnotationsProcessor             - post-reflect dead code cleanup
  (sub_215D9D0 companion)
    |
    v
[Further IR optimization: GVN, LICM, Sinking2, etc.]
    |
    v
[Instruction Selection: DAGToDAG (sub_2200150, 78KB)]
    |     Hash-table pattern matching: hash = (37*idx) & (tableSize-1)
    v
PeepholeOptimizerPass (slot 492)    Stock LLVM machine peephole:
    |                                - redundant copy folding
    v                                - compare-and-branch simplification
NVPTXPeephole (sub_21DB090)         PTX-specific machine peephole:
    |                                - cvta pair elimination
    v                                - predicate folding
param-opt (sub_2203290)              - ld.param consolidation
    |
    v
nvptx-trunc-opts (sub_22058E0)      - ANDb16ri elimination
    |
    v
Remove Redundant Moves (sub_2204E60) - identity move deletion
    |
    v
[Register Allocation]
    |
    v
ProxyRegErasure (sub_21DA810)       Late cvta.to.local removal
    |
    v
[PTX Emission]

Function Map

Function	Address	Size	Role
--	`sub_1CEF8F0`	small	NVVMPeephole factory (legacy PM)
--	`sub_215D9D0`	--	NVVMAnnotationsProcessor (companion, always paired)
--	`sub_2314DA0`	small	NVVMPeepholeOptimizerPass serializer (New PM)
--	`sub_2342890`	--	New PM registration function (slot 382)
--	`sub_233C410`	--	Pipeline text parser (line 3534)
--	`sub_21DB090`	small	NVPTXPeephole machine pass registration
--	`sub_2166ED0`	1.6KB	`addPreRegAlloc()` -- hosts NVPTXPeephole
--	`sub_21DA810`	--	ProxyRegErasure (`cvta.to.local` removal)
--	`sub_2203290`	small	`param-opt` (`ld.param` optimization)
--	`sub_2204E60`	small	Remove Redundant Moves
--	`sub_22058E0`	small	`nvptx-trunc-opts` (ANDb16ri elimination)
--	`sub_21BEE70`	4.1KB	`"Bad address space in addrspacecast"` validation
--	`sub_20DA7F0`	30KB	DAG combine / peephole on MachineInstrs
--	`sub_37E1AE0`	18KB	Late-stage machine optimization (peephole or copy prop)

Differences from Upstream LLVM

Upstream LLVM (as of LLVM 17/18) contains NVPTXPeephole.cpp in llvm/lib/Target/NVPTX/, which implements a small machine-level pass that:

Folds cvta address-space-conversion pseudo-instructions
Removes NVPTX::PROXY_REG pseudo-instructions (now split into a separate NVPTXProxyRegErasure pass in cicc)

CICC v13.0 extends this significantly:

The IR-level pass (nvvm-peephole-optimizer) has no upstream counterpart. It is entirely NVIDIA-proprietary, filling a gap between target-independent InstCombine and target-specific machine peephole.
Three satellite machine passes (param-opt, nvptx-trunc-opts, Remove Redundant Moves) have no upstream equivalents.
The machine-level nvptx-peephole is larger than upstream, likely incorporating additional pattern rules for newer PTX features (tensor core operations, cluster operations, etc.).
ProxyRegErasure is separated from NVPTXPeephole into its own pass (sub_21DA810) and runs late post-RA rather than inline with the peephole.

Evidence Summary

The pass's existence and classification are confirmed through multiple independent sources:

Source	Address / Location	Evidence
Pipeline parser	`sub_233C410` line 3534	Registers `"nvvm-peephole-optimizer"` as function-level NVIDIA custom pass
New PM registration	`sub_2342890` slot 382	Maps string to `llvm::NVVMPeepholeOptimizerPass`
Serializer	`sub_2314DA0`	Produces `"nvvm-peephole-optimizer"` text for pipeline printing
Legacy PM factory	`sub_1CEF8F0`	Called 2x from `sub_12E54A0` (pipeline assembler)
Companion pairing	`sub_215D9D0`	Always immediately follows `sub_1CEF8F0` in all paths
Knob sweep	`0x50E8D0` (ctor_358_0)	`enable-nvvm-peephole` = `"Enable NVVM Peephole Optimizer"`, default true
Knob duplicate	`0x560000` sweep line 292	Confirmed with identical description
NVVMPassOptions	`p2a.3-03-passoptions.txt`	Listed as `nvvm-peephole-optimizer` in option table
Machine pass	`sub_21DB090`	`"NVPTX Peephole"` / `"nvptx-peephole"` registration string
Machine pipeline	`sub_2166ED0`	`"After codegen peephole optimization pass"` checkpoint string

Confidence note. The pass registration, knobs, pipeline position, and factory function are confirmed at HIGH confidence from binary evidence. The specific transformation patterns described above are at MEDIUM confidence -- inferred from pipeline position (runs after NVVMReflect + NVVMIntrinsicLowering), NVVM IR semantics, and address space validation code, but the actual NVVMPeepholeOptimizerPass::run() body has not been individually decompiled. The factory sub_1CEF8F0 creates the pass object; the run method is dispatched through the object's vtable.

Cross-References

Scalar Passes (InstCombine) -- stock LLVM InstCombine that handles general-purpose peephole
NVVM Intrinsic Lowering -- runs before the peephole, expands intrinsics
NVVMReflect -- resolves __nvvm_reflect() before the peephole cleans up
Machine-Level Passes -- documents the full pre-RA / post-RA machine pass pipeline
Minor NVIDIA Passes -- brief entries for nvptx-peephole, proxy-reg-erasure, and other small passes
Address Spaces -- NVPTX address space numbering and cast rules
NVVMPassOptions -- the 4512-byte options struct that gates this pass
Optimization Levels -- which paths invoke the peephole at each -O level
Pipeline Assembler -- the master sub_12E54A0 function that builds the pass pipeline

Sinking2 (NVIDIA Code Sinking)

sinking2 is an NVIDIA-proprietary instruction sinking pass that moves instructions closer to their uses, with specific awareness of GPU texture and surface memory operations. It is entirely distinct from LLVM's stock sink pass: while both perform code sinking, Sinking2 is tailored for NVIDIA's memory hierarchy and iterates to a fixed point rather than making a single pass. The primary motivation is reducing register pressure by deferring computation of values until just before they are consumed, which is especially impactful on GPUs where register files are shared across hundreds of concurrent threads.

The pass is particularly focused on sinking instructions into texture load blocks. Texture operations on NVIDIA GPUs have high latency but are served by a dedicated cache; by sinking the address computation and other operands into the block that performs the texture fetch, the compiler reduces the live range of those values and frees registers for other warps. This directly improves occupancy -- the number of warps that can execute simultaneously on an SM.

Pipeline Position

Field	Value
Pass name (pipeline)	`sinking2`
Pass ID	`sink2`
Display name	`Code sinking`
Pass type	FunctionPass (NVIDIA-custom)
Class	`llvm::Sinking2Pass`
Registration	New PM #390, line 2282 in `sub_2342890`
Runtime positions	Tier 1/2/3 #81 (NVVMSinking2 via `sub_1CC60B0`, gated by `opts[3328] && !opts[2440]`); see Pipeline
Legacy PM entry	`sub_1CCA270`
New PM entry	`sub_2D1C160` (19KB)
Legacy PM registration	`sub_1CC7010`
New PM registration	`sub_2D1B410`
Knob constructor	`ctor_275` at `0x4F7750`
Vtable (Legacy)	`off_49F8BC0`
Vtable (New PM)	`off_4A260F0`

Relationship to All Sink Passes in cicc

CICC v13.0 contains five distinct sinking mechanisms. Understanding which is which is essential when reading the pipeline or debugging register pressure issues:

Pass ID / Factory	Class	Origin	Key Difference
`sink` / `sub_1A634D0`	LLVM `SinkingPass`	Upstream LLVM	Stock single-pass sinking, uses MemorySSA for alias safety
`sink2` / `sub_1CCA270`	`llvm::Sinking2Pass`	NVIDIA	Texture-aware, iterative fixpoint, custom AA layer
`sink<rp-aware>`	Parameterized variant	LLVM + NVIDIA	Register-pressure-aware sinking (stock sink with `rp-aware-sink=true`)
`NVVMSinking2` / `sub_1CC60B0`	NVIDIA late sinking	NVIDIA	Late-pipeline SM-specific sinking, gated by `opts[3328]`
`MachineSink`	LLVM `MachineSinking`	LLVM	MIR-level sinking, opt-in for NVPTX via `nvptx-enable-machine-sink`

The stock LLVM sink (sub_1869C50, called with params (1,0,1)) uses MemorySSA for alias queries and makes a single pass. Sinking2 uses its own alias analysis layer routed through sub_13575E0 and iterates to convergence. NVVMSinking2 (sub_1CC60B0) is a separate NVIDIA pass that runs late in the pipeline after barrier lowering and warp-level optimizations, gated by the SM-specific pass group flag opts[3328].

IR Before/After Example

The pass sinks address computation closer to texture/surface use sites, reducing register pressure by shortening live ranges.

Before (address computation in preheader, live across loop body):

preheader:
  %base = getelementptr float, ptr addrspace(1) %tex_ptr, i64 %offset
  %addr = getelementptr float, ptr addrspace(1) %base, i64 %stride
  br label %loop

loop:
  %i = phi i64 [ 0, %preheader ], [ %i.next, %loop ]
  ; ... many instructions using registers, %base and %addr are live ...
  %tex_addr = getelementptr float, ptr addrspace(1) %addr, i64 %i
  %val = call float @llvm.nvvm.tex.unified.1d.v4f32.f32(i64 %tex_addr)
  %i.next = add i64 %i, 1
  %cmp = icmp slt i64 %i.next, %n
  br i1 %cmp, label %loop, label %exit

After (address computation sunk into loop, next to texture use):

preheader:
  br label %loop

loop:
  %i = phi i64 [ 0, %preheader ], [ %i.next, %loop ]
  ; ... many instructions, but %base and %addr are no longer live here ...
  %base = getelementptr float, ptr addrspace(1) %tex_ptr, i64 %offset
  %addr = getelementptr float, ptr addrspace(1) %base, i64 %stride
  %tex_addr = getelementptr float, ptr addrspace(1) %addr, i64 %i
  %val = call float @llvm.nvvm.tex.unified.1d.v4f32.f32(i64 %tex_addr)
  %i.next = add i64 %i, 1
  %cmp = icmp slt i64 %i.next, %n
  br i1 %cmp, label %loop, label %exit

The GEP instructions now execute inside the loop (higher execution count) but free registers in the rest of the loop body. This is a deliberate tradeoff: extra ALU work for reduced register pressure, which typically improves occupancy and net throughput.

Algorithm

Entry Point

The legacy PM entry sub_1CCA270 performs these steps:

Fetches DominatorTree analysis (via DominatorTreeWrapperPass at unk_4F9E06C)
Fetches LoopInfo analysis (via LoopInfoWrapperPass at unk_4F96DB4)
Reads sink-into-texture knob (qword_4FBF2C0[20]) -- must be non-zero (enabled)
Reads sink-limit knob (qword_4FBF1E0[20]) -- must be greater than zero
Calls the main worklist driver sub_1CC9110

The New PM entry sub_2D1C160 (19KB) performs the same logic using AnalysisManager to fetch analyses, then dispatches to sub_2D1CFB0 (13KB).

The pass does not require ScalarEvolution (SCEV), MemorySSA, or PostDominatorTree, keeping it simpler and cheaper than loop-oriented or MemorySSA-dependent passes.

Main Worklist Driver (`sub_1CC9110`, 22KB)

The core algorithm is a fixpoint iteration over the dominator tree:

function SinkingWorklist(F, DT, LI, textureLevel, sinkLimit):
    changed = false
    do:
        roundChanged = false
        sinkCount = 0

        // Walk dominator tree in DFS preorder
        for BB in DT.dfs_preorder():
            // Skip loop headers to avoid creating loop-carried deps
            if LI.isLoopHeader(BB):
                continue

            // Process instructions bottom-up within each block
            for I in reverse(BB.instructions()):
                if sinkCount >= sinkLimit:
                    break                  // complexity limiter

                if I.mayHaveSideEffects() or I.isTerminator():
                    continue               // unsinkable
                if I.use_empty():
                    continue               // dead, leave for DCE

                // Level 3: consider instructions used only outside BB
                if textureLevel < 3 and allUsesInSameBlock(I, BB):
                    continue

                targetBB = findBestSinkTarget(I, DT, LI)  // sub_1CC7510
                if targetBB == BB:
                    continue               // already in best position

                // Profitability: prefer texture/surface blocks
                if textureLevel >= 1:
                    if not blockContainsTextureOps(targetBB):
                        if not dominatesTextureBlock(targetBB, DT):
                            continue       // not profitable

                // Safety: alias analysis check
                if not isSafeToSink(I, BB, targetBB):     // sub_1CC8920
                    continue
                // Safety: memory dependency check
                if not checkMemDep(I, BB, targetBB):       // sub_1CC8CA0
                    continue

                I.moveBefore(targetBB.firstNonPHI())
                roundChanged = true
                sinkCount++

        changed |= roundChanged
    while roundChanged            // iterate until no more changes

    return changed

Key design points:

DFS preorder ensures parent blocks are processed before children. Instructions sunk from a parent into a child on one iteration may expose further sinking opportunities for grandchild blocks on the next iteration -- hence the fixpoint loop.
Bottom-up within each block processes the last instruction first. This is important because sinking an instruction may make an earlier instruction's operands dead, which DCE will clean up later.
Loop headers are skipped to prevent creating loop-carried dependencies (a value defined in the header, consumed in the latch, sunk into the latch would create a cycle).

Instruction Processing (`sub_1CC7510`, 16KB)

For each candidate instruction, this function:

Walks the use chain to find all consumers (via sub_15F4D60, multi-use check)
For each user, determines the containing basic block
Computes the lowest common dominator (LCD) of all user blocks using the dominator tree
If LCD == current block, no benefit from sinking -- the instruction is already as close to its uses as possible while dominating all of them
Builds a sink mapping: instruction to target block
Checks memory safety via alias analysis (sub_13575E0)
Validates that sinking does not violate memory ordering constraints
Respects PHI nodes (LLVM opcode PHI) as sink boundaries -- an instruction cannot be sunk past a PHI insertion point

The target block selection algorithm effectively finds the nearest common dominator of all uses that is strictly dominated by the current block. If the instruction has a single use, the target is trivially the use's block (or its immediate dominator if the use is a PHI operand).

Dominance Ordering (`sub_1CC8170`, 13KB)

Implements a hash-based ordering of basic blocks for comparing sink profitability. Uses DFS numbering from the dominator tree to determine which block comes "earlier" in the program. This ordering ensures:

Instructions are only sunk toward uses, never away from them
When multiple sink targets exist (multi-use instruction), the lowest common dominator is chosen
The ordering is consistent across iterations of the fixpoint loop

Alias Checking (`sub_1CC8920`, 4KB)

Validates that moving instruction I from block From to block To does not reorder I past any conflicting memory access:

function isSafeToSink(I, From, To):
    if not I.mayReadOrWriteMemory():
        return true                    // pure computation, always safe

    // Walk all instructions on the domtree path From -> To
    for BB in pathBlocks(From, To):
        for J in BB.instructions():
            if J == I: continue
            if AA.getModRefInfo(I, J) != NoModRef:
                return false           // conflict: I aliases with J

    return true

This is not MemorySSA-based (unlike stock LLVM sink). The pass invokes the traditional AliasAnalysis query interface through sub_13575E0. This is less precise than MemorySSA but avoids the cost of building and maintaining the MemorySSA graph, which matters because Sinking2 iterates to fixpoint and would need to update MemorySSA on every move.

Memory Dependency Checking (`sub_1CC8CA0`, 6KB)

Additional memory safety layer beyond alias checking:

Store-load forwarding: if I is a load and there is a store between From and To that may alias the loaded location, sinking would change the value loaded
Store ordering: if I is a store, moving it past another store to a potentially-aliasing location changes program semantics
Volatile/atomic barrier: volatile loads/stores and atomic operations are never sunk (treated as having side effects)
Synchronization intrinsics: barrier calls (__syncthreads, bar.sync) are treated as memory fences; no instruction may be sunk past them

Texture/Surface Awareness

The pass identifies "texture blocks" -- basic blocks containing calls to texture/surface intrinsics (the tex.*, suld.*, sust.* family). Address computations that feed these intrinsic calls are the primary sink candidates, because texture address computation chains (GEP + index arithmetic) produce intermediate values that are consumed only at the texture fetch site. Without sinking, these intermediates occupy registers across potentially many instructions.

The sink-into-texture knob controls aggressiveness:

Level	Behavior
0	Disabled -- no texture-aware sinking
1	Cross-block only: move instructions across block boundaries into texture blocks
2	Cross-block + intra-block: also reorder instructions within a block to position them immediately before their texture use
3 (default)	All of the above + outside-only: consider instructions whose only uses are in blocks other than where the instruction is defined

Level 3 catches the important case where a GEP in a preheader feeds a texture load inside a loop -- the GEP has no uses in its own block, only "outside" uses.

Address space checks for NVPTX (see reference/address-spaces):

AS 1 (global): may alias with texture reads in some configurations
AS 3 (shared): texture operations never access shared memory, so shared-space stores are not barriers to texture sinking
AS 4 (const): texture/surface descriptors typically live in constant memory
AS 5 (local): thread-local, no cross-thread interference

Loop Considerations

Sinking2 is loop-aware but conservative:

Never sinks OUT of a loop: moving an instruction from a loop body to an exit block would change its execution count. The pass skips this entirely.
May sink INTO loop bodies: when an instruction in a loop preheader feeds only uses inside the loop (particularly texture fetches), sinking it into the loop is profitable despite increasing execution count -- the register pressure reduction from shorter live ranges outweighs the extra computation.
Skips loop headers: prevents creating loop-carried dependencies.
Runs after LoopSimplify: the early instance (sub_18B1DE0) runs after LoopSimplify/LCSSA have canonicalized loop structure, so preheaders, latches, and exit blocks are well-formed.

This creates a deliberate tension with LICM:

LICM hoists loop-invariant code into the preheader (reducing execution count)
Sinking2 sinks non-invariant address computation out of the preheader and into the loop body (reducing register pressure)

The two passes run at different pipeline positions and balance each other. LICM runs first; Sinking2 runs after GVN and CGSCC inlining, when texture patterns are fully exposed.

Barrier Awareness

Sinking2 itself does not contain explicit __syncthreads / bar.sync detection logic. Instead, it relies on the LLVM side-effect model:

Barrier intrinsics are marked as having side effects, so they are never sunk
Barrier intrinsics are treated as memory fences by alias analysis, so no memory instruction may be sunk past them

The late NVVMSinking2 (sub_1CC60B0) runs after barrier lowering (sub_1CB73C0) and warp-level optimization passes. By that point, barriers have been lowered to their final form. The pipeline ordering is:

NVVMBranchDist -> NVVMWarpShuffle -> NVVMReduction -> NVVMSinking2

This sequence ensures NVVMSinking2 can sink past warp-level operations that are no longer opaque barriers, while still respecting the lowered barrier representation.

Multi-Run Pipeline Pattern

Sinking2 appears at three to four pipeline positions. Each run has different context and different opportunities:

Position	Factory	Mode	Context
Early (pass ~39)	`sub_18B1DE0()`	Standard	After stock Sink, GVN, and CGSCC inlining. Texture patterns are exposed.
Post-peephole	`sub_18B3080(1)`	Fast (flag=1)	After NVVMPeephole. Peephole may create new sinking opportunities. Reduced iteration budget.
Late SM-specific	`sub_1CC60B0()`	SM-gated	After barrier lowering and warp shuffle. Gated by `opts[3328] && !opts[2440]`.

For fast-compile mode (Ofcmax), only sub_18B3080(1) runs -- the single Sinking2 in fast mode with reduced iteration budget. No stock Sink, no NVVMSinking2.

The rationale for multiple runs:

Run 1 (stock Sink) handles straightforward cases using MemorySSA's precise alias information
Run 2 (Sinking2 early) performs texture-aware sinking now that GVN/CGSCC have simplified the IR
Run 3 (Sinking2 fast) cleans up opportunities created by peephole optimization
Run 4 (NVVMSinking2) performs SM-specific late sinking after barrier and warp-level transforms

NVVMPassOptions Gating

Offset	Type	Effect
`opts[1040]`	bool	Disable stock Sink/MemSSA
`opts[2440]`	bool	Disable NVVMSinking2 (`sub_1CC60B0`)
`opts[3328]`	bool	Enable SM-specific warp/reduction/sinking pass group (gates NVVMSinking2)

Cost Model (New PM)

The New PM object (176 bytes) contains floating-point thresholds at offsets +88 and +144, both initialized to 1065353216 (IEEE 754 1.0f). These thresholds suggest the New PM implementation has a more sophisticated cost model than the Legacy PM version:

Profitability threshold (+88): minimum benefit score for a sink to be accepted. A value of 1.0 means the benefit must at least equal the cost.
Cost threshold (+144): maximum acceptable cost for the sinking motion itself. A value of 1.0 means the movement cost must not exceed the baseline.

The Legacy PM version uses a simpler boolean profitability model (is the target a texture block? yes/no).

Configuration Knobs

Sinking2-Specific (ctor_275 at `0x4F7750`)

Knob	Type	Default	Storage	Description
`sink-into-texture`	int	3	`qword_4FBF2C0`	Texture sinking aggressiveness (0=off, 1=cross-block, 2=+intra, 3=+outside-only)
`sink-limit`	int	20	`qword_4FBF1E0`	Max instructions to sink per invocation (complexity limiter)
`dump-sink2`	bool	false	`qword_4FBF100`	Dump debug information during sinking

Knob	Type	Default	Owner	Description
`sink-check-sched`	bool	true	stock Sink	Check scheduling effects of sinking
`sink-single-only`	bool	true	stock Sink	Only sink single-use instructions
`rp-aware-sink`	bool	false	stock Sink	Consider register pressure (controls `sink<rp-aware>` variant)
`max-uses-for-sinking`	int	(default)	stock Sink	Don't sink insts with too many uses
`sink-ld-param`	bool	(default)	NVPTX backend	Sink one-use `ld.param` to use point
`hoist-load-param`	bool	(default)	NVPTX backend	Hoist all `ld.param` to entry block (counterpart to `sink-ld-param`)
`enable-andcmp-sinking`	bool	(default)	CodeGenPrepare	Sink and/cmp into branches
`aggressive-no-sink`	bool	(default)	(unknown)	Sink all generated instructions
`instcombine-code-sinking`	bool	(default)	InstCombine	Enable code sinking within instcombine
`nvptx-enable-machine-sink`	bool	(default)	NVPTX backend	Enable MIR-level MachineSink
`SinkRematEnable`	bool	(default)	ptxas	Enable sink+rematerialization in ptxas

Analysis Dependencies

Legacy PM	New PM	Purpose
`DominatorTreeWrapperPass` (`unk_4F9E06C`)	`DominatorTreeAnalysis` (`sub_CF6DB0`)	Dominator tree for sink legality and ordering
`LoopInfoWrapperPass` (`unk_4F96DB4`)	`LoopAnalysis` (`sub_B1A2E0`)	Avoid sinking out of loops; skip loop headers

Does not require: SCEV, MemorySSA, PostDominatorTree, BranchProbabilityInfo.

This is a key difference from stock LLVM SinkingPass, which requires MemorySSAAnalysis. Sinking2 uses its own alias analysis queries through helpers sub_1CC8920 and sub_1CC8CA0, routed through the traditional AA interface at sub_13575E0. This avoids the overhead of building/maintaining MemorySSA across fixpoint iterations.

Pass Object Layout

Legacy PM (160 bytes):

Offset	Type	Content
+0	ptr	Vtable pointer (`off_49F8BC0`)
+8	ptr	Pass link (next pass in chain)
+16	ptr	Pass ID pointer (`&unk_4FBF0F4`)
+24	int32	Mode (default=3, from `sink-into-texture`)
+28	int32	Sink limit (default=20, from `sink-limit`)
+32--48	ptr[3]	Worklist data (head, tail, size)
+56	ptr	DominatorTree* (set during runOnFunction)
+64	ptr	List head 1 (self-referential sentinel)
+72--80	ptr[2]	List next/prev 1
+96	int64	Counter (sink count for current iteration)
+104	ptr	LoopInfo* (set during runOnFunction)
+112	ptr	List head 2 (self-referential sentinel)
+120--128	ptr[2]	List next/prev 2
+144	int64	Data field
+152	byte	Changed flag (for fixpoint termination)

New PM (176 bytes): two embedded worklists and float thresholds at offsets +88 and +144 (value 1065353216 = 1.0f IEEE 754).

Differences from Upstream LLVM

Aspect	Upstream LLVM `sink`	NVIDIA `sinking2`
Alias analysis backend	MemorySSA	Custom AA layer (`sub_13575E0`)
Iteration strategy	Single pass	Fixpoint iteration
Texture awareness	None	3-level configurable
Address space awareness	Generic	NVPTX-specific (AS 1,3,4,5)
Complexity limiter	None	`sink-limit` knob (default=20)
Intra-block reordering	No	Level >= 2
Outside-only pattern	No	Level == 3
Debug dump	Standard LLVM debug	`dump-sink2` knob
Cost model	Boolean (profitable or not)	Float thresholds in New PM
Pipeline occurrences	1	3--4 (multi-run strategy)
Fast-compile variant	Same pass	Dedicated fast=1 mode

Diagnostic Strings

String	Context
`"llvm::Sinking2Pass]"`	RTTI name at `sub_2315E20`
`"sink2"`	Pipeline parser ID
`"Code sinking"`	Display name (shared with stock LLVM sink)
`"sinking2"`	New PM pipeline string match

Function Map

Function	Address	Size	Role
--	`sub_1CC7010`	--	Legacy PM pass registration
--	`sub_1CC7100`	--	Legacy PM factory
--	`sub_1CC71E0`	--	Legacy PM alternate factory
--	`sub_1CC7510`	16KB	`processInstruction`: sink candidate evaluation, use-chain walk, LCD computation
--	`sub_1CC8170`	13KB	Dominance ordering: DFS numbering for block comparison
--	`sub_1CC8920`	4KB	Alias checking helper: validates no conflicting memory accesses on path
--	`sub_1CC8CA0`	6KB	Memory dependency helper: store-load forwarding, store ordering, volatile
--	`sub_1CC9110`	22KB	Main worklist driver: fixpoint iteration over dominator tree
--	`sub_1CCA270`	--	Legacy PM `runOnFunction` entry
--	`sub_2D1B410`	--	New PM pass registration
--	`sub_2D1BC50`	--	New PM factory
--	`sub_2D1C160`	19KB	New PM `run()` entry
--	`sub_2D1CFB0`	13KB	New PM core logic
--	`sub_2D1D770`	7KB	New PM helper
--	`sub_2D1DCF0`	7KB	New PM helper
--	`sub_2315E20`	--	RTTI name printer
--	`0x4F7750`	--	Knob constructor (`ctor_275`)

Related pipeline factories:

Address	Role
`sub_18B1DE0`	Sinking2 early-pipeline factory
`sub_18B3080`	Sinking2 fast-mode factory (accepts fast flag parameter)
`sub_1CC60B0`	NVVMSinking2 late-pipeline factory
`sub_1A634D0`	Stock LLVM Sink legacy PM registration
`sub_29776B0`	Stock LLVM Sink New PM registration
`sub_1B51110`	Stock Sink core (51KB, creates `.sink.split` / `.sink` blocks)
`sub_1869C50`	Stock Sink pipeline factory (called with params `1,0,1`)

Total code size: ~80KB (Legacy PM) + ~65KB (New PM) = ~145KB

GPU-Specific Motivation

Register pressure directly determines occupancy -- each additional live register per thread reduces the number of warps available for latency hiding, with discrete cliff boundaries where a single register can drop an entire warp group.

Sinking instructions closer to their uses shortens live ranges and reduces the peak number of simultaneously live registers. This is especially valuable for texture load sequences, which typically involve address computation (GEP chains, index arithmetic) that produces values consumed only at the texture fetch site. Without sinking, these intermediate values occupy registers across potentially many instructions, bloating register pressure unnecessarily.

The three-level sink-into-texture design reflects a graduated approach to this optimization: level 1 handles the common case (cross-block sinking), level 2 adds intra-block reordering for tighter packing, and level 3 (the default) handles the edge case where an instruction's only uses are in blocks other than where it is defined, enabling more aggressive motion.

The multi-run pattern (early Sinking2, post-peephole fast Sinking2, late NVVMSinking2) ensures that sinking opportunities created by other optimization passes are captured throughout the pipeline, rather than relying on a single sinking point that may miss opportunities not yet exposed.

Cross-References

Dead Synchronization Elimination -- runs earlier, removes barriers that Sinking2 would otherwise treat as memory fences
LICM -- counterpart: hoists loop-invariant code into preheaders; Sinking2 sinks address computation out of preheaders
NVVMPeephole -- runs before late Sinking2, may create new sinking opportunities
Rematerialization -- runs after all sinking; rematerialization + sinking together minimize register pressure (ptxas SinkRematEnable knob)
MemorySpaceOpt -- changes address spaces which affects sinking profitability
NVVMPassOptions -- opts[1040] disables stock Sink; opts[2440] disables NVVMSinking2
Register Allocation -- ultimate consumer of the register pressure reduction that sinking provides
Optimization Levels -- Ofcmax runs only fast-mode Sinking2; O2/O3 run full multi-run pattern

Loop Index Split

loop-index-split is a loop transformation pass that splits or peels loops when a condition inside the loop body depends on the loop induction variable. The pass was originally part of upstream LLVM 2.x (circa 2008--2009) but was removed around LLVM 3.0 due to correctness concerns and limited applicability. NVIDIA revived and heavily modified it for CUDA workloads, where loops with index-dependent conditionals are extremely common -- boundary handling in stencil computations, tile edge processing, and index-based predication are pervasive GPU kernel patterns. The NVIDIA version is substantially more sophisticated than the original, implementing three distinct transformation modes with full SCEV-based analysis.

By eliminating index-dependent branches from loop bodies, the pass reduces warp divergence on NVIDIA GPUs. When threads in a warp take different paths through a branch, the GPU must serialize both paths (predicated execution or divergent branch), wasting throughput. Splitting the loop so that each resulting loop has a uniform body eliminates this divergence entirely within the split regions, restoring full SIMT efficiency.

Pipeline Position

Field	Value
Pass name (pipeline)	`loop-index-split`
Display name	`Index Split Loops`
Pass type	LoopPass (NVIDIA-custom, revived from LLVM 2.x)
Class	`llvm::LoopIndexSplitPass`
Legacy PM registration	`sub_1C76080`
New PM registration	`sub_2CBEC60`
Pass ID	`dword_4FBD4A8` / `unk_4FBD4AC`
New PM vtable	`off_4A25510`

Transformation Modes

The pass implements three transformation strategies, attempted in priority order. When the first applicable transformation is found, it is applied and the pass moves on.

Mode A: All-But-One Iteration Peel (`processAllButOneIterationLoop`)

When: The loop body contains a condition that is true for all iterations except exactly one (typically i == K for a constant K).

What: The pass peels the single exceptional iteration out of the loop and removes the condition from the remaining iterations.

Before:

for (i = 0; i < N; i++) {
    if (i == K) special();
    else normal();
}

After:

for (i = 0; i < K; i++) normal();
special();
for (i = K+1; i < N; i++) normal();

This eliminates the branch from both resulting loops entirely. On a GPU, this means warps executing the pre-K or post-K loops never diverge on this condition.

Implementation: sub_2CC3FF0 (13KB, New PM) / part of sub_1C77080 (46KB, Legacy PM).

Mode B: Only-One-Iteration Collapse (`processOnlyOneIterationLoop`)

When: The condition is true for exactly one iteration, and the loop body does nothing useful on other iterations.

What: The pass replaces the entire loop with a guarded single execution of the body.

Before:

for (i = 0; i < N; i++) {
    if (i == K) doWork();
}

After:

if (K >= 0 && K < N) doWork();

This transforms an O(N) loop into O(1) code -- a dramatic optimization when the original loop's only purpose was to find and execute a single iteration.

Implementation: sub_2CC4A70 (19KB, New PM) / part of sub_1C77080 (46KB, Legacy PM).

Mode C: Range Split (`processSplitRangeLoop`)

When: The condition splits the iteration space into two contiguous ranges (e.g., i < M vs i >= M).

What: The pass splits the loop at the boundary point so each resulting loop has a simpler, branch-free body.

Before:

for (i = 0; i < N; i++) {
    if (i < M) a(); else b();
}

After:

for (i = 0; i < min(M, N); i++) a();
for (i = M; i < N; i++) b();

This is the most common transformation for GPU boundary handling code, where the first/last few iterations of a tile perform padding or clamping.

Implementation: sub_2CC5900 (68KB, New PM) / sub_1C7B2C0 (84KB, Legacy PM). The loop cloning and rewiring logic is in sub_2CC1B10 (42KB), with split point computation in sub_2CC0040 and sub_2CC0CC0 (7KB each).

Algorithm Detail

The main driver (sub_2CC5900, 68KB) proceeds as follows:

Verify loop structure: The loop must have exactly one exit, a preheader, a latch block, and an identifiable header.
Initialize SCEV analysis: Obtains the ScalarEvolution result for the loop to identify the induction variable and compute trip counts.
Find the induction variable and exit condition from the loop's back-edge.
Scan the loop body for ICmp or Select instructions that compare the IV against a loop-invariant value.
Validate the comparison uses constant integer bounds (checked via APInt extraction at multiple points).
Safety checks (lines 760--830 of sub_2CC5900):
- Iterate all loop BBs, checking each instruction:
  - Opcode 85 (Call): reject if callee may have side effects
  - Opcodes 34--85: checked against bitmask 0x8000000000041 for safe operations
  - Store instructions: checked for non-interference with the split
- No volatile loads permitted
- No memory operations that prevent reordering
Determine which transformation applies:
- Try processAllButOneIterationLoop first
- Try processOnlyOneIterationLoop second
- Fall back to processSplitRangeLoop
For range splits: Compute the split point, clone the loop (including all basic blocks, PHI nodes, and branch conditions), adjust iteration bounds, and rewire predecessors/successors.

Comparison Classifiers

Four small functions classify how the ICmp operands relate to the induction variable:

Function	Purpose
`sub_2CBED80`	Determine which operand is the IV
`sub_2CBED00`	Determine which operand is the bound
`sub_2CBEE00`	Classify comparison direction (ascending/descending)
`sub_2CBEE80`	Extended classification for range splits

Legality Validation

Function	Size	Purpose
`sub_2CBFC80`	—	Validate split is legal (check exit conditions)
`sub_2CBF770`	—	Validate loop structure for splitting
`sub_2CBF180`	—	Create new loop preheader for split result

Diagnostic Strings

Diagnostic strings recovered from p2b.4-5-sinking2-loopindexsplit.txt. The pass emits optimization remarks via the standard LLVM OptimizationRemark system.

String	Source	Category	Trigger
`"LoopIndexSplit: performed processAllButOneIterationLoop"`	`sub_2CC3FF0` (New PM) / `sub_1C77080` (Legacy PM)	Remark	Mode A transformation applied: single exceptional iteration peeled
`"LoopIndexSplit: performed processOnlyOneIterationLoop"`	`sub_2CC4A70` (New PM) / `sub_1C77080` (Legacy PM)	Remark	Mode B transformation applied: entire loop replaced with guarded single body
`"LoopIndexSplit: performed processSplitRangeLoop"`	`sub_2CC5900` (New PM) / `sub_1C7B2C0` (Legacy PM)	Remark	Mode C transformation applied: loop split at range boundary
`"Index Split Loops"`	`sub_1C76080` / `sub_2CBEC60`	Registration	Display name used in both Legacy PM and New PM pass registration
`"loop-index-split"`	Pipeline parser (`sub_2377300` line 3768, `sub_2368220` line 5081)	Registration	Pipeline ID string (16 characters)
`"LoopSplitIndex"` / `"LoopIndexSplit"`	Remark infrastructure	Remark tag	Optimization remark tag names (both variants observed in binary)

Configuration Knobs

No dedicated cl::opt knobs were found for LoopIndexSplit. The pass is enabled or disabled at the pipeline level via the pass name loop-index-split in the pipeline string or by including/excluding it during pipeline assembly. It can also be controlled by the global pass-control and disable-passno mechanisms.

Analysis Dependencies

Legacy PM	New PM	Purpose
`DominatorTreeWrapperPass` (`sub_15CD350`)	`DominatorTreeAnalysis` (`sub_D4AA90`)	Dominance checks for loop cloning
`LoopInfoWrapperPass` (`sub_13FBE20`)	`LoopAnalysis` (`sub_B1A2E0`)	Loop structure and nesting
`ScalarEvolutionWrapperPass` (`sub_1AE1AE0`)	`ScalarEvolutionAnalysis` (`sub_11CDF60`)	IV identification, trip count, range proofs
`LoopAccessAnalysis` (`sub_1AF93A0`)	`LoopAccessAnalysis` (`sub_F67EE0`)	Memory dependence in loops

SCEV is the critical dependency: it provides induction variable identification, trip count computation, and the mathematical proofs needed to establish that split points are correct and that bounds do not overflow.

Pass Object Layout

Legacy PM: 80-byte pass descriptor.

New PM: 176-byte pass object with embedded worklists and float thresholds. Key fields during execution:

Offset (QWORDs)	Content
0	Vtable / loop pointer
1--3	Sub-loop tracking
4	Sinkable instruction count
5	Exit condition block
6	Split condition (`ICmp`/`FCmp` instruction)
7	Loop bound (lower)
8	Loop bound (upper)
9	Split instruction
10	Instruction counter / worklist
11--13	`DenseSet` for tracking visited blocks
14	Iteration counter
18--24	Computed values (preheader, header, latch, exitBB, etc.)
25	SCEV analysis result pointer
26	New loop blocks array (for split range)

Function Map

New PM Implementation

Function	Address	Size	Role
--	`0x2CBEC60`	—	New PM pass registration
--	`0x2CBFF20`	—	New PM factory
--	`0x2CC3FF0`	13KB	`processAllButOneIterationLoop` (Mode A)
--	`0x2CC4A70`	19KB	`processOnlyOneIterationLoop` (Mode B)
--	`0x2CC5900`	68KB	Main driver + `processSplitRangeLoop` (Mode C)
--	`0x2CC1B10`	42KB	Loop cloning and CFG rewiring
--	`0x2CC0040`	7KB	Split boundary computation
--	`0x2CC0CC0`	7KB	Alternate split boundary computation
--	`0x2CC9AA0`	18KB	Helper
--	`0x2CCB3B0`	25KB	Helper
--	`0x2CCCE20`	13KB	Helper
--	`0x2CCDD70`	15KB	Helper
--	`0x2CCED30`	8KB	Helper
--	`0x2CCF450`	57KB	Large helper / alternate path
--	`0x2CBED80`	—	Comparison classifier (IV operand)
--	`0x2CBED00`	—	Comparison classifier (bound operand)
--	`0x2CBEE00`	—	Comparison direction classifier
--	`0x2CBEE80`	—	Extended comparison classifier
--	`0x2CBFC80`	—	Split legality validation
--	`0x2CBF770`	—	Loop structure validation
--	`0x2CBF180`	—	Create new preheader

Legacy PM Implementation

Function	Address	Size	Role
--	`0x1C76080`	—	Legacy PM pass registration
--	`0x1C76180`	—	Legacy PM factory
--	`0x1C76260`	—	Alternate factory
--	`0x1C76340`	7KB	Hash table management for visited set
--	`0x1C768C0`	4KB	Helper
--	`0x1C76B50`	4KB	Block cloning helper
--	`0x1C76EB0`	2.5KB	Recursive loop tree walker
--	`0x1C77080`	46KB	`processAllButOneIterationLoop` + `processOnlyOneIterationLoop`
--	`0x1C797A0`	15KB	Split legality checking
--	`0x1C7A300`	21KB	Loop body cloning
--	`0x1C7B2C0`	84KB	`processSplitRangeLoop` + main driver

Total code size: ~180KB (Legacy PM) + ~260KB (New PM) = ~440KB. This is one of the largest individual passes in cicc.

GPU-Specific Motivation

Index-dependent conditionals inside loops are ubiquitous in GPU kernels:

Boundary handling: Threads at tile edges must check whether their index falls within the valid data range, leading to if (threadIdx.x + blockIdx.x * blockDim.x < N) patterns inside processing loops.
Stencil codes: Halo region processing requires different behavior for the first and last few iterations of a tile.
Reduction patterns: The final iteration of a reduction loop often has special aggregation logic.
Predicated execution: CUDA warp-level programming frequently uses index-based predicates to assign work to specific lanes.

Each of these patterns introduces a branch that causes warp divergence: threads in the same warp take different paths, forcing the GPU to serialize both sides. By splitting the loop at the index boundary, the pass ensures that within each resulting loop, all threads in a warp execute the same path. This eliminates divergence entirely within the split regions, recovering full SIMT throughput.

The pass's large code size (~440KB) reflects the complexity of correct loop cloning on GPU IR, where PHI nodes, memory dependencies, and SCEV invariants must all be preserved across the transformation.

Branch Distribution (Dead Synchronization Elimination)

Despite its name, the branch-dist pass does not distribute or restructure branches. It is a GPU-specific dead synchronization elimination pass that removes __syncthreads() barriers and fence intrinsics when no actual memory hazard exists across the barrier boundary. In CUDA kernels, programmers often insert barriers conservatively to guarantee correctness, but many of these barriers protect code regions that have no conflicting read/write patterns on shared or global memory. Removing them eliminates warp serialization points and reduces the latency cost of unnecessary thread coordination.

The pass works by classifying every instruction in the function as a shared/global memory read, a write, or neither. It then propagates this information through the control flow graph using a standard dataflow fixed-point iteration. For each synchronization instruction, it examines the memory access patterns above and below the barrier; if no read-after-write, write-after-read, or write-after-write hazard exists, the barrier is dead and is deleted. Because removing one barrier may expose others as redundant, the entire analysis restarts after each deletion until no more dead barriers remain.

Pipeline Position

Field	Value
Pass name	`branch-dist`
Pass type	FunctionPass (NVIDIA-custom, not in upstream LLVM)
Registration	New PM #377, line 2217 in `sub_2342890`
Runtime positions	Tier 1/2/3 #78, #82 (NVVMBranchDist via `sub_1CB73C0`, gated by `!opts[2080] && !opts[2120]`); see Pipeline
Core function	`sub_1C47810` (2357 lines)
Pass wrapper	`sub_1C49D10` (179 lines)
Knob constructor	`ctor_525_0` at `0x563730` (493 lines)
Global enable flag	`byte_4FBB6C0` (initialized to 0 in `ctor_261`)

The pass runs during the NVIDIA IR optimization pipeline. The global enable flag at byte_4FBB6C0 is set by the pipeline setup when appropriate for the current optimization level.

IR Before/After Example

The pass removes __syncthreads() barriers that protect no actual shared/global memory hazard.

Before (conservative barrier placement):

define void @kernel(ptr addrspace(3) %smem) {
entry:
  %x = add i32 %tid, 1               ; pure register computation
  %y = mul i32 %x, 42                ; pure register computation
  call void @llvm.nvvm.barrier0()     ; __syncthreads() -- no shared/global R/W above
  %z = add i32 %y, %x                ; pure register computation
  ret void
}

After (dead barrier removed):

define void @kernel(ptr addrspace(3) %smem) {
entry:
  %x = add i32 %tid, 1
  %y = mul i32 %x, 42
  ; barrier removed: no shared/global reads or writes above or below
  %z = add i32 %y, %x
  ret void
}

When the dataflow analysis determines that neither side of the barrier accesses shared or global memory, the barrier is dead and removed. The pass restarts after each removal since deleting one barrier may expose another as redundant.

Algorithm

Phase 1: Instruction Classification (`sub_1C46330`)

The classifier (sub_1C45690, 117 lines) examines each instruction's opcode byte at offset +16 and determines whether it reads or writes shared/global memory:

Opcode	Hex	Meaning	Action
`0x36`	`'6'`	Load	Check address space; mark as read if shared/global
`0x37`	`'7'`	Store	Check address space; mark as write
`0x3A`	`':'`	Memory op	Check address space
`0x3B`	`';'`	Memory op	Check address space
`0x4E`	`'N'` (78)	Call	Complex analysis: filter sync intrinsics, check callee attributes

The classifier is invoked twice per basic block:

Forward scan (a3=1): iterates from the last instruction backward to the first sync instruction. Everything after the sync is classified as "above" the barrier.
Backward scan (a3=0): iterates from the first instruction forward to the first sync instruction. Everything before the sync is classified as "below" the barrier.

This produces four boolean flags per block, stored in red-black tree maps: reads_above, writes_above, reads_below, writes_below.

Phase 2: CFG Propagation (`sub_1C46620`)

A classic dataflow fixed-point iteration propagates memory access information through successor edges. For each basic block, the read/write flags from its successors' "below" maps are OR-combined into the current block's "above" maps. The iteration repeats until no flags change (convergence). This ensures that a barrier's necessity accounts for memory accesses reachable through any control flow path, not just the local block.

The branch-dist-norm knob modifies the dataflow meet operator: the default (0) uses OR-propagation (conservative), while a non-zero value likely switches to AND-normalization (more aggressive, requiring all paths to access memory before considering a sync necessary).

Phase 3: Dead Sync Identification and Removal

After propagation, the main function (sub_1C47810) iterates over all blocks and instructions. For each synchronization intrinsic, it looks up the four per-instruction flags:

ra = inst_read_above[I]    wa = inst_write_above[I]
rb = inst_read_below[I]    wb = inst_write_below[I]

A sync is dead (removable) when any of these conditions holds:

Condition	Meaning
`!ra && !wa`	Nothing above the barrier accesses shared/global memory
`!rb && !wb`	Nothing below the barrier accesses shared/global memory
`!ra && !wb`	No read-after-write or write-after-write hazard
`!wa && !rb`	No write-after-read or write-after-write hazard

When a sync is removed, the pass calls sub_15F20C0 to delete it from the IR, then restarts the entire algorithm (goto LABEL_2). This restart is necessary because removing one barrier may cause another to become dead.

Special Cases

Barrier variants that carry data -- __syncthreads_count, __syncthreads_and, __syncthreads_or (intrinsic IDs 3734--3736) -- are explicitly excluded from removal. Their return values encode lane participation information, so they cannot be elided even when no memory hazard exists.

Address Space Filtering

The pass only considers memory accesses to shared and global address spaces as relevant for synchronization. The address space check in sub_1C45690:

Address space IDs <= 0x1FF (511) or in the 0x300 range: considered local/private -- do not require synchronization.
Address space IDs > 511 and not in the 0x3xx range: considered shared/global -- these are the accesses that justify keeping a barrier.

This distinction is critical: local memory is per-thread and never visible to other threads in the warp, so barriers protecting only local accesses are always dead.

Intrinsic Classification

Two predicates classify synchronization-related intrinsics:

sub_1C301F0 (is-sync-intrinsic): Returns true for intrinsic IDs representing barrier operations:

ID	Likely Mapping
34	`llvm.nvvm.barrier0` (basic `__syncthreads`)
3718--3720	`barrier.sync` / `bar.warp.sync` variants
3731--3736	`__syncthreads_count/and/or`, `bar.arrive`

sub_1C30240 (is-fence-intrinsic): Returns true for IDs 4046 and 4242, which are memory fence/membar intrinsics. These are excluded from the sync test -- they impose memory ordering but are not full barriers that can be elided by this pass.

Configuration Knobs

All registered in ctor_525_0 at 0x563730. All are cl::opt<> with hidden visibility.

Knob	Type	Default	Description
`dump-branch-dist`	bool	false	Emit diagnostic output on each removed sync
`ignore-call-safety`	bool	true	Treat function calls as non-memory-accessing (aggressive)
`ignore-variance-cond`	int	0	Ignore warp divergence on branch conditions
`ignore-address-space-check`	int	0	Treat all memory accesses as requiring sync (conservative)
`ignore-phi-overhead`	int	0	Ignore PHI node overhead from sync removal in cost model
`disable-complex-branch-dist`	int	0	Disable inter-block CFG propagation (Phase 2)
`no-branch-dist`	string	(empty)	Comma-separated list of function names to skip
`branch-dist-func-limit`	int	-1	Max functions to process (-1 = unlimited)
`branch-dist-block-limit`	int	-1	Max blocks per function (-1 = unlimited)
`branch-dist-norm`	int	0	Dataflow meet operator mode (0 = OR, non-zero = AND)

The default for ignore-call-safety is notably true (aggressive): device function calls are assumed not to access shared/global memory unless proven otherwise. This is reasonable for typical CUDA kernels where helper functions operate on registers and local memory.

Diagnostic Strings

Diagnostic strings recovered from p2b.3-01-branchdist.txt. All runtime diagnostics are gated by the dump-branch-dist knob (default false).

String	Source	Category	Trigger
`"[filename:line] Removed dead synch: Read above: X, Write above: Y, Read below: Z, Write below: W in function NAME"`	`sub_1C47810` phase 3	Debug	`dump-branch-dist` enabled and a barrier is removed; prints the four read/write flags and the function name
`"Dump information from Branch Distribution"`	`ctor_525_0` at `0x563730`	Knob	`dump-branch-dist` knob description
`"Ignore calls safety in branch Distribution"`	`ctor_525_0`	Knob	`ignore-call-safety` knob description
`"Ignore variance condition in branch Distribution"`	`ctor_525_0`	Knob	`ignore-variance-cond` knob description
`"Ignore address-space checks in branch Distribution"`	`ctor_525_0`	Knob	`ignore-address-space-check` knob description
`"Ignore the overhead due to phis"`	`ctor_525_0`	Knob	`ignore-phi-overhead` knob description
`"Disable more complex branch Distribution"`	`ctor_525_0`	Knob	`disable-complex-branch-dist` knob description
`"Do not do Branch Distribution on some functions"`	`ctor_525_0`	Knob	`no-branch-dist` knob description (value format: `"function1,function2,..."`)
`"Control number of functions to apply"`	`ctor_525_0`	Knob	`branch-dist-func-limit` knob description
`"Control number of blocks to apply"`	`ctor_525_0`	Knob	`branch-dist-block-limit` knob description
`"Control normalization for branch dist"`	`ctor_525_0`	Knob	`branch-dist-norm` knob description

Data Structures

The pass allocates a large state object (~696 bytes, 87 QWORDs) containing 13 red-black tree maps organized in three tiers:

Maps	Keys	Values	Purpose
`a1[3..14]` (2 maps)	Block pointer	bool	Has-sync-above/below per block
`a1[15..38]` (4 maps)	Block pointer	bool	Propagated read/write above/below (Phase 2 output)
`a1[39..62]` (4 maps)	Block pointer	bool	Initial read/write above/below (Phase 1 output)
`a1[63..86]` (4 maps)	Instruction pointer	bool	Per-instruction read/write above/below (Phase 3)

All maps are std::map-like red-black trees with 48-byte nodes (left/right/parent pointers + key + 1-byte boolean value at offset 40). Tree operations are implemented in sub_1C46280 (insert-or-find for block maps), sub_1C47760 (insert-or-find for instruction maps), sub_1C45B10 (erase), and sub_1C45C70/sub_1C45940 (recursive destructors).

Function Map

Function	Address	Size	Role
--	`0x1C47810`	2357L	Core algorithm: classify + propagate + remove
--	`0x1C49D10`	179L	Pass wrapper: init state, call core, cleanup
--	`0x1C46330`	197L	Phase 1: forward/backward instruction scan
--	`0x1C46620`	1157L	Phase 2: CFG successor propagation (fixed-point)
--	`0x1C45690`	117L	Instruction classifier: determines R/W flags
--	`0x1C458C0`	28L	Helper: classify all instructions in a block
--	`0x1C46280`	38L	Map insert-or-find (block-level maps)
--	`0x1C47760`	37L	Map insert-or-find (instruction-level maps)
--	`0x1C475C0`	43L	Map lower_bound lookup
--	`0x1C47660`	50L	Map find with hint
--	`0x1C45B10`	113L	Map erase operation
--	`0x1C45C70`	133L	Tree destructor (recursive free)
--	`0x1C45940`	133L	Tree destructor (recursive free, alt type)
--	`0x1C301F0`	15L	Is-sync-intrinsic predicate
--	`0x1C30240`	13L	Is-fence-intrinsic predicate
--	`0x563730`	493L	CLI knob registration (`ctor_525_0`)

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent dead barrier elimination pass using CFG dataflow.

1. Using address-level tracking instead of boolean per-category flags. The pass tracks four boolean flags per block (reads_above, writes_above, reads_below, writes_below) for shared/global memory, not specific addresses. A reimplementation that attempts to track precise addresses ("smem[0] is only written above, smem[1] is only read below") will appear to find more dead barriers but is fundamentally unsound for GPU execution. Different threads access different addresses through the same pointer expression (smem[tid] vs smem[tid-1]), making address-based alias analysis across threads impossible at compile time. The boolean-per-category approach is the correct conservative abstraction.

2. Not excluding __syncthreads_count/and/or (IDs 3734--3736) from removal. These barrier variants return a value that encodes lane participation information (__syncthreads_count returns the number of threads that passed a non-zero predicate). Even when no memory hazard exists across the barrier, the return value carries data that the program depends on. A reimplementation that removes these barriers based solely on memory analysis will break programs that use the return value for algorithmic purposes (e.g., warp-level voting patterns, early-exit counting).

3. Treating the ignore-call-safety default as conservative. The default for ignore-call-safety is true (aggressive): function calls are assumed not to access shared/global memory. This is correct for typical CUDA helper functions that operate on registers and local memory, but a reimplementation that uses false as the default will retain nearly all barriers in code that calls device functions, defeating the optimization. Conversely, a reimplementation that uses true but does not also check the callee's isSharedMemoryAccess attribute when available will miss cases where a called function does access shared memory through a pointer argument.

4. Not restarting the analysis after removing a barrier. The pass restarts from Phase 1 (goto LABEL_2) after each barrier deletion because removing one barrier merges the regions it separated, potentially exposing adjacent barriers as dead. A reimplementation that collects all dead barriers in one pass and removes them simultaneously will miss cascading redundancies. Worse, it may remove barriers in the wrong order: if barrier B2 is dead only because barrier B1 separates it from a hazard, removing both simultaneously removes B1's protection while the hazard still exists.

5. Conflating address space filtering with memory visibility. The pass considers only shared and global memory accesses (address spaces > 511 and not in the 0x3xx range) as relevant for barrier justification. Local/private memory (per-thread, invisible to other threads) is correctly excluded. A reimplementation that includes local memory accesses in the analysis will never remove any barrier in code that uses local arrays, since every function with local variables would show "read+write above and below." The address space filter is essential for the optimization to have any effect.

GPU-Specific Motivation

On NVIDIA GPUs, __syncthreads() forces all threads in a thread block to reach the barrier before any can proceed. This is one of the most expensive control flow operations in CUDA -- it serializes warp execution and creates a pipeline stall. In practice, CUDA programmers insert barriers conservatively (every shared memory access pattern gets a barrier "just in case"), leading to significant over-synchronization. This pass recovers the performance lost to unnecessary barriers by proving, through static dataflow analysis, that specific barriers protect no actual memory hazard.

The ignore-variance-cond knob connects to warp divergence analysis: when a branch condition is provably uniform (all lanes take the same path), synchronization across that branch is trivially unnecessary regardless of memory access patterns. This is a common case in well-structured CUDA code where control flow depends on blockIdx or compile-time constants.

Dead Barrier Elimination

CICC contains three independent passes that eliminate redundant __syncthreads() barriers from CUDA kernels. This page documents the lightweight basic-dbe pass -- a single-pass, intra-block pattern matcher that removes trivially dead barriers without dataflow analysis. The two heavyweight engines are covered on their own pages: Dead Synchronization Elimination (sub_2C84BA0, 96KB, full bidirectional fixed-point dataflow) and Branch Distribution (sub_1C47810, 63KB, NVVM-IR-level fixed-point with restart). All three target the same goal -- eliminating barriers that provably do not order any memory hazard -- but at different cost/precision tradeoffs.

Key Facts: `basic-dbe`

Property	Value
Pass name	`basic-dbe`
Class	`llvm::BasicDeadBarrierEliminationPass`
Scope	Function pass (LLVM IR level)
Registration	New PM #376, line 2212 in `sub_2342890` (first NVIDIA function pass registered)
Runtime positions	Inserted via pipeline extension callbacks; not in the Tier 0/1/2/3 tables (see Pipeline)
Parameters	None (non-parameterized pass)
Knob constructor	`ctor_261` (below 5KB, in `0x4F0000`--`0x51FFFF` range)
Enable global	`byte_4FBB6C0` (initialized to 0 in `ctor_261`, set to 1 by pipeline setup)
Binary size	Small (< 5KB compiled)
Upstream equivalent	None -- entirely NVIDIA-proprietary

Why a Lightweight Pass Exists

The full dead synchronization elimination engine at sub_2C84BA0 is 96KB of code implementing bidirectional fixed-point dataflow with complete restart after each removal. That is expensive. For the common cases -- consecutive barriers with no intervening memory operations, barriers at function entry/exit with no shared memory traffic in the block, or barriers immediately followed by another barrier -- the heavyweight engine is overkill.

basic-dbe exists as a cheap pre-filter: it handles the trivially dead cases in a single linear scan per function, eliminating the low-hanging fruit before the full engine (if scheduled) performs its expensive inter-block analysis. By removing obvious dead barriers early, basic-dbe also reduces the iteration count of the heavyweight pass, since fewer barriers remain for it to analyze.

Algorithm

basic-dbe operates as a single-pass function pass with no dataflow propagation, no fixed-point iteration, and no restart-on-removal. It scans each basic block once and applies local pattern matching to identify barriers that are trivially dead.

Barrier Identification

The pass reuses the same barrier predicate logic as the full engine. An instruction is a synchronization barrier if all of the following hold:

Opcode == 85 (internal call opcode for intrinsics)
The callee pointer at offset -32 is non-null
The callee's byte at offset 0 == 0 (intrinsic, not user-defined function)
The convergent attribute flag (bit 0x20 at byte+33) is set
sub_CEA1A0(callee.field[36]) confirms the intrinsic ID falls within the known barrier ID range

This is the same check implemented by sub_2C83D20 in the full engine.

Elimination Patterns

basic-dbe identifies four categories of trivially dead barriers, all detectable without inter-block analysis:

Pattern 1: Consecutive Barriers

Two or more __syncthreads() calls with no intervening instructions (or only non-memory instructions between them). The second and subsequent barriers are redundant because the first already forces all threads to synchronize.

; Before basic-dbe:
  call void @llvm.nvvm.barrier0()     ; barrier A
  call void @llvm.nvvm.barrier0()     ; barrier B -- DEAD (consecutive)

; After basic-dbe:
  call void @llvm.nvvm.barrier0()     ; barrier A retained

Pattern 2: Barrier in Empty Block

A basic block whose only non-terminator instructions are barriers and non-memory operations (debug info, metadata). If no instruction in the block reads or writes shared/global memory, every barrier in the block is dead -- there is nothing to order.

; Before basic-dbe:
bb_empty:
  call void @llvm.nvvm.barrier0()     ; DEAD -- no memory ops in block
  br label %bb_next

; After basic-dbe:
bb_empty:
  br label %bb_next

Pattern 3: Barrier at Function Entry

A barrier at the start of a kernel (or device function) with no memory operations between function entry and the barrier. Since no thread has performed any shared memory access yet, the barrier orders nothing.

Pattern 4: Barrier Before Return

A barrier immediately before a return with no memory operations between the barrier and the function exit. The barrier would order accesses that have already been performed, but since no subsequent access follows, no hazard exists in the forward direction.

Pseudocode

function BasicDeadBarrierEliminationPass::run(F):
    if not byte_4FBB6C0:          // global enable flag
        return PreservedAnalyses::all()

    changed = false

    for each BB in F:
        barriers = []
        has_memory_op = false

        for each inst in BB:
            if isSyncBarrier(inst):
                if not has_memory_op:
                    // Pattern 2/3: barrier with no preceding memory op
                    // Also handles Pattern 1: consecutive barriers
                    //   (first barrier is not a memory op, so second is dead)
                    mark inst for deletion
                    changed = true
                else:
                    barriers.append(inst)
                    has_memory_op = false   // reset for next segment

            else if classifyMemoryAccess(inst) has read or write:
                has_memory_op = true

        // Pattern 4: check trailing barrier before terminator
        if not barriers.empty() and not has_memory_op:
            mark barriers.back() for deletion
            changed = true

    // Delete all marked instructions
    for each marked inst:
        inst.eraseFromParent()

    if changed:
        return PreservedAnalyses::none()  // IR modified
    else:
        return PreservedAnalyses::all()

The key design choice: basic-dbe treats each basic block as an isolated unit. It does not look at predecessor or successor blocks. This means it will miss cases where a barrier is dead because all reaching paths lack memory accesses -- those cases require the full inter-block dataflow of sub_2C84BA0 or sub_1C47810.

Memory Access Classification

Within the basic block scan, basic-dbe must determine which instructions constitute memory operations that could create cross-thread hazards. The classification mirrors the logic in sub_2C83AE0 (the full engine's classifier):

Opcode	Value	Instruction	Classification
61	`0x3D`	Store	Memory write
62	`0x3E`	Load	Memory read
65	`0x41`	Atomic	Memory read + write
66	`0x42`	AtomicCmpXchg	Memory write
85	`0x55`	Call/Intrinsic	Read+Write if callee accesses shared/global memory

Non-memory instructions (arithmetic, comparisons, PHI nodes, debug info, branches) do not set the has_memory_op flag.

The `byte_4FBB6C0` Enable Flag

The global byte at byte_4FBB6C0 serves as a shared enable flag initialized to 0 in ctor_261. The pipeline setup code sets it to 1 when the optimization level and target configuration warrant running barrier elimination. This same flag gates branch-dist (sub_1C49D10 checks it before invoking sub_1C47810), confirming that ctor_261 initializes shared state for the barrier elimination subsystem as a whole, not just basic-dbe.

Relationship to Other Dead-Sync Passes

CICC's three barrier elimination passes form a layered strategy:

Property	`basic-dbe`	`branch-dist`	Dead Sync Elimination
Entry point	`llvm::BasicDeadBarrierEliminationPass`	`sub_1C47810`	`sub_2C84BA0`
PM slot	376 (New PM function pass)	377 (New PM function pass)	None (module-level caller)
Scope	Intra-block only	Inter-block (CFG propagation)	Inter-block (full restart)
Dataflow	None (pattern match)	Fixed-point, 13 RB-tree maps	Fixed-point, 12 RB-tree maps
Restart on removal	No	Yes (goto LABEL_2)	Yes (goto LABEL_2)
IR level	LLVM IR (opcodes 61/62/65/66/85)	NVVM IR (opcodes 0x36/0x37/0x3A/0x3B/0x4E)	LLVM IR (opcodes 61/62/65/66/85)
Binary size	< 5KB	63KB core + helpers	96KB core + helpers
Knobs	`byte_4FBB6C0` enable flag	10 knobs (ctor_525)	None known (controlled by caller)
Complexity	O(n_instructions)	O(B * F * C)	O(B * F * C)
Typical runtime	Microseconds	Milliseconds	Milliseconds

The intended execution order:

basic-dbe runs first in the function pass pipeline, eliminating trivially dead barriers in O(n) time.
branch-dist runs next (slot 377, immediately after basic-dbe at slot 376), performing full inter-block analysis on the reduced barrier set using NVVM IR opcodes.
Dead Sync Elimination (sub_2C84BA0) runs later from module-level callers (sub_2C88020, sub_2C883F0), performing the most aggressive analysis using LLVM IR opcodes with the element-size gate and special intrinsic ID handling.

Configuration

Knob	Type	Default	Effect
`byte_4FBB6C0`	bool (global)	0 (disabled)	Master enable for `basic-dbe` and `branch-dist`

No dedicated per-pass knobs (threshold, dump flags, or limits) have been identified for basic-dbe itself. The pass is controlled entirely by its enable flag. This is consistent with its role as a lightweight pre-filter -- there is nothing to tune.

Function Map

Function	Address	Size	Role
--	`sub_2342890` line 2212	--	New PM registration: maps `"basic-dbe"` to `llvm::BasicDeadBarrierEliminationPass`
--	`ctor_261` (0x4F range)	--	Global constructor: initializes `byte_4FBB6C0` to 0, registers `basic-dbe` knob string
--	`byte_4FBB6C0`	--	Global enable flag (shared with `branch-dist`)
--	`sub_2C83D20`	--	`isSyncBarrier` predicate (shared with full engine)
--	`sub_2C83AE0`	--	`classifyMemoryAccess` (shared with full engine)
--	`sub_CEA1A0`	--	Barrier intrinsic ID confirmation
--	`sub_B49E00`	--	`isSharedMemoryAccess` -- CUDA address space check
--	`sub_B43D60`	--	`Instruction::eraseFromParent` -- barrier deletion

Cross-References

Dead Synchronization Elimination -- the full 96KB bidirectional dataflow engine
Branch Distribution -- the NVVM-IR-level dead-sync pass (63KB, 13 RB-tree maps)
NVIDIA Custom Passes: Inventory -- registry entry
LLVM Optimizer: Pipeline -- pipeline context showing basic-dbe at slot 376
GPU Execution Model -- why __syncthreads() exists and when it matters

Dead Synchronization Elimination

The dead synchronization elimination engine at sub_2C84BA0 is the largest NVIDIA-custom pass in cicc at 96KB (~3,400 decompiled lines). It removes __syncthreads() barriers that provably do not order any memory hazard, reducing warp stall cycles in CUDA kernels without affecting correctness. The algorithm performs a bidirectional fixed-point dataflow analysis across the entire function's CFG, tracking four memory access categories per basic block through eight red-black tree maps. After convergence, it evaluates every barrier against the computed access sets and deletes those that protect no actual hazard. Each deletion triggers a full restart of the analysis, handling cascading redundancies at the cost of quadratic worst-case complexity.

This pass is distinct from the lightweight basic-dbe pass (slot 376, llvm::BasicDeadBarrierEliminationPass) and from the branch-dist pass. All three target dead barriers, but only this engine performs full inter-block dataflow with complete restart -- the other two handle simpler local or single-pass cases.

Key Facts

Property	Value
Entry point	`sub_2C84BA0`
Binary size	96KB (~3,400 decompiled lines)
Pass type	Module-level NVIDIA custom (not registered in New PM)
Callers	`sub_2C88020`, `sub_2C883F0`, self-recursive
Barrier predicate	`sub_2C83D20`
Access classifier	`sub_2C83AE0`
Per-BB analysis	`sub_2C84640` (bidirectional, parameterized by direction)
State object	12 red-black tree maps at known offsets in `a1`
Diagnostic	`" Removed dead synch: "` with per-category read/write counts
Upstream equivalent	None -- entirely NVIDIA-proprietary

Five-Phase Algorithm

Phase 1: Barrier Identification (`sub_2C83D20`)

The helper sub_2C83D20 classifies whether a given instruction is a synchronization barrier. The check is a conjunction of five conditions:

function isSyncBarrier(inst) -> bool:
    if inst.opcode != 85:                       // internal call opcode
        return false
    callee = inst.field[-32]                     // callee pointer at offset -32
    if callee == null:
        return false
    if callee.byte[0] != 0:                     // byte 0 == 0 means intrinsic (not user-defined)
        return false
    if callee.field[24] != inst.field[80]:       // scope match
        return false
    if !(callee.byte[33] & 0x20):               // convergent attribute flag
        return false
    return CEA1A0(callee.field[36])              // confirm barrier intrinsic ID

The convergent attribute flag (bit 0x20 at byte+33) is the key discriminator. LLVM marks barrier intrinsics as convergent to prevent optimizations from moving them across control flow boundaries. The final sub_CEA1A0 call validates that the intrinsic ID falls within the known barrier ID range, distinguishing barriers from other convergent intrinsics (e.g., warp vote operations).

Phase 2: Memory Access Classification (`sub_2C83AE0`)

For every non-barrier instruction, sub_2C83AE0 determines whether it reads from or writes to memory that could create a hazard across a barrier. It outputs two boolean flags via pointer parameters a2 (read) and a3 (write).

Opcode	Value	Instruction	Classification
61	`0x3D`	Store	Write, if element size > `0x1FF` bits
62	`0x3E`	Load	Read, with same large-type gate
65	`0x41`	Atomic	Read + Write
66	`0x42`	AtomicCmpXchg	Write
85	`0x55`	Call/Intrinsic	Context-dependent (see below)

For call instructions (opcode 85), the classifier applies recursive analysis:

Check if the callee has intrinsic flag 0x20 set.
For barrier-like intrinsics with opcode 25 and field+96 == 0: classify as Read only.
For general calls: invoke sub_B49E00 (isSharedMemoryAccess) to determine whether the callee accesses shared/global memory. If yes: Read + Write.

The element size gate (> 0x1FF bits, i.e., > 511 bits) filters out trivially small memory operations that target scalar types in registers rather than actual memory-backed storage. Loads and stores of types narrower than 512 bits are assumed to operate on register-promoted values and do not participate in cross-thread hazards.

Phase 3: Bidirectional Fixed-Point Dataflow

Complexity. Let B = number of basic blocks, S = number of barrier instructions, and I = total instructions across all blocks. Phase 1 (barrier identification) is O(S). Phase 2 (access classification) is O(I). The dataflow fixed-point iterates until no boolean in the 4 * B * 2 lattice positions flips from 0 to 1; since the lattice has height 1, convergence is bounded by O(B) iterations, each costing O(B + I) for the forward and backward scans, giving O(B * (B + I)) per convergence cycle. Phase 4 (elimination decision) is O(S). Phase 5 restarts the entire analysis from Phase 3 on each removal, yielding a worst-case total of O(S * B * (B + I)). In practice, CUDA kernels have B < 100, S < 20, and convergence in 2--3 iterations, so the pass behaves as near-linear in typical use. The red-black tree maps contribute O(log B) per insert/lookup, but this is dominated by the iteration cost.

This is the core of the pass and accounts for the majority of its 96KB size. The algorithm maintains eight red-black tree maps organized into forward and backward analysis sets, plus four bridge maps for the final elimination decision.

Map Layout

Offset range	Direction	Contents
`a1[15..20]`	Forward	ReadAbove per basic block
`a1[21..26]`	Forward	WriteAbove per basic block
`a1[27..32]`	Forward	ReadBelow per basic block
`a1[33..38]`	Forward	WriteBelow per basic block
`a1[39..44]`	Backward	ReadAbove per basic block
`a1[45..50]`	Backward	WriteAbove per basic block
`a1[51..56]`	Backward	ReadBelow per basic block
`a1[57..62]`	Backward	WriteBelow per basic block
`a1[63..68]`	Bridge	ReadAbove crossing barrier
`a1[69..74]`	Bridge	WriteAbove crossing barrier
`a1[75..80]`	Bridge	ReadBelow crossing barrier
`a1[81..86]`	Bridge	WriteBelow crossing barrier

Each map is a std::map-style red-black tree (48-byte nodes: left/right/parent pointers, key = basic block pointer, value = 1-byte boolean at offset 40). The helper sub_2C84590 performs map insertion; sub_2C84AF0 is a variant for a different node type used in the bridge maps.

Iteration Algorithm

The analysis loop is implemented as a goto-based iteration between labels LABEL_2 and LABEL_178 in the decompiled output:

function analyzeBarriers(F, state):
    LABEL_2:  // restart point after barrier removal

    // --- Forward pass ---
    for each BB in F:
        sub_2C84640(state, BB, direction=1)  // scan BB forward
            // For each instruction from BB start toward first barrier:
            //   classify as read/write via sub_2C83AE0
            //   OR the flags into forward maps [15..38]
            // Propagate successor BBs' flags backward if they
            // contain already-analyzed barriers

    // --- Forward convergence check ---
    changed_fwd = false
    for each BB in F:
        if forward_maps[BB] != previous_forward_maps[BB]:
            changed_fwd = true
            break

    // --- Backward pass ---
    for each BB in F:
        sub_2C84640(state, BB, direction=0)  // scan BB backward
            // For each instruction from BB end toward last barrier:
            //   classify as read/write
            //   OR into backward maps [39..62]
            // Propagate predecessor BBs' flags forward

    // --- Backward convergence check ---
    changed_bwd = false
    for each BB in F:
        if backward_maps[BB] != previous_backward_maps[BB]:
            changed_bwd = true
            break

    // If either direction changed, iterate
    if changed_fwd or changed_bwd:
        goto LABEL_2_inner  // re-run dataflow (not full restart)

    // Both converged -- proceed to Phase 4
    goto elimination_phase

The sub_2C84640 helper is the per-BB analysis workhorse. It takes a direction parameter:

direction=1 (forward): scans from block entry toward the first barrier, accumulating ReadAbove/WriteAbove. Propagates read/write information from successor blocks.
direction=0 (backward): scans from block exit toward the last barrier, accumulating ReadBelow/WriteBelow. Propagates information from predecessor blocks.

The convergence check compares the entire map contents (all four categories for every BB) against their values from the previous iteration. If any single boolean flipped from 0 to 1, the changed flag is set. Since the analysis is monotone (booleans can only transition from 0 to 1, never back), convergence is guaranteed in at most O(|BB|) iterations, though in practice it converges in 2--3 iterations for typical CUDA kernels.

Phase 4: Elimination Decision

After the dataflow converges, the pass examines every barrier instruction and checks the bridge maps (a1[63..86]) which represent the combined read/write sets crossing barrier boundaries.

A barrier is redundant (dead) if any of the following holds:

Condition	Interpretation
`ReadAbove == 0 AND WriteAbove == 0`	No shared-memory accesses reach this barrier from above; the barrier orders nothing
`ReadBelow == 0 AND WriteBelow == 0`	No accesses reach from below
`ReadAbove == 0 AND WriteBelow == 0`	No RAW or WAW hazard across the barrier
`WriteAbove == 0 AND ReadBelow == 0`	No WAR or WAW hazard across the barrier

The first two conditions capture the case where one side of the barrier has no memory traffic at all. The latter two capture the case where both sides access memory, but the access patterns cannot conflict.

Special Case: Intrinsic IDs 8260--8262

For call instructions (opcode 85) where the callee's intrinsic ID satisfies (ID - 8260) <= 2 (i.e., IDs 8260, 8261, or 8262), the pass applies an additional test via sub_BD3660 (hasOneUse). If the barrier-like intrinsic has only a single use, it is considered removable even if the standard dataflow check would keep it. These IDs likely correspond to specialized barrier variants (__syncthreads_count, __syncthreads_and, __syncthreads_or) where the return value is used as data. When the return value has only one use, the compiler can reason that the data-carrying aspect is trivially handled and the barrier itself may still be dead from a memory ordering perspective.

Phase 5: Removal and Complete Restart

When a barrier is identified as dead, the pass:

Emits a diagnostic string (if the controlling dump flag is enabled):

 Removed dead synch: [filename:line] in function <name>
Read above: N, Write above: N, Read below: N, Write below: N

where N is 0 or 1 for each category.

Calls sub_B43D60 (Instruction::eraseFromParent) to delete the barrier instruction from the IR.
Restarts from Phase 3 (goto LABEL_2) -- a complete re-analysis of the entire function.

The restart is not optional. Removing a barrier changes the memory access pattern visible between adjacent barriers: what was previously two separate "above/below" regions separated by a barrier now becomes a single merged region. This merging may cause an adjacent barrier to lose its hazard justification, making it dead as well. The cascading effect can propagate through a chain of barriers.

Dataflow Analysis Detail

The Eight Red-Black Tree Maps

The forward and backward map sets track identical categories but are populated from different CFG traversal directions. This bidirectional approach ensures that the analysis captures memory accesses reachable from any control flow path, not just straight-line code.

Forward maps (a1[15..38]):

Populated by scanning each BB from entry toward the first barrier.
ReadAbove[BB] = 1 if any instruction before the first barrier in BB reads shared/global memory, OR if any successor BB contributes a read.
WriteAbove[BB] = same for writes.
ReadBelow/WriteBelow[BB] = propagated from successor blocks' analysis.

Backward maps (a1[39..62]):

Populated by scanning each BB from exit toward the last barrier.
ReadBelow[BB] = 1 if any instruction after the last barrier in BB reads memory, OR if any predecessor BB contributes a read.
WriteBelow[BB] = same for writes.
ReadAbove/WriteAbove[BB] = propagated from predecessor blocks.

Bridge maps (a1[63..86]):

Keyed by barrier instruction pointer (not BB pointer).
Represent the combined access sets that cross the specific barrier boundary.
Populated during the final pass over barrier instructions after dataflow convergence.

Monotone Dataflow Framework

The analysis is a classic monotone dataflow problem on a Boolean lattice:

Domain: {0, 1} per (basic-block, category) pair.
Transfer function: OR of local classification with propagated values.
Meet operator: OR (any path contributing an access sets the flag).
Direction: Bidirectional (forward pass propagates from successors, backward pass propagates from predecessors).
Convergence: Guaranteed because the lattice has height 1 (a value can only change from 0 to 1, never back). The fixed point is reached when no additional propagation changes any value.

In the worst case, each iteration may set one new bit, and there are 4 * |BB| bits per direction, so convergence takes at most 4 * |BB| iterations per direction. In practice, CUDA kernels have shallow CFGs and the iteration converges in 2--3 rounds.

Cascading Restart Logic

The most expensive aspect of the algorithm is the complete restart after each barrier removal. Consider a function with N barriers:

B0 -- barrier_1 -- B1 -- barrier_2 -- B2 -- barrier_3 -- B3

If barrier_2 is removed first, blocks B1 and B2 merge into a single region. If B1 contained only writes and B2 contained only reads, barrier_1 was previously justified by the WAR hazard between B0's writes and B1's reads. But after merging, B1+B2 now contains both reads and writes, and barrier_3 might become dead if B3 has no memory accesses. This cascading effect requires full re-analysis.

Worst-case complexity: O(N_barriers * N_BBs * convergence_iterations), where convergence_iterations is bounded by 4 * |BB| but is typically 2--3. For a kernel with B barriers removed in sequence, the total work is O(B * F * C) where F is the per-iteration cost of the dataflow and C is the convergence bound.

In practice, CUDA kernels rarely have more than 10--20 barriers, and cascading removals are uncommon (typically 0--3 restarts), so the theoretical quadratic cost is not a bottleneck.

Relationship to `basic-dbe` and `branch-dist`

CICC contains three passes that eliminate dead synchronization barriers. They differ in scope, cost, and the cases they handle:

Property	`basic-dbe`	`branch-dist`	Dead Sync Elimination
Pass name	`basic-dbe`	`branch-dist`	(unnamed, called from module pass)
Entry point	`llvm::BasicDeadBarrierEliminationPass`	`sub_1C47810`	`sub_2C84BA0`
Registration	New PM slot 376	New PM slot (function pass)	Module-level caller
Scope	Single BB / local	Function-level with CFG propagation	Function-level with full restart
Dataflow	None (pattern match)	Fixed-point, 13 rb-tree maps	Fixed-point, 12 rb-tree maps
Restart on removal	No	Yes (goto LABEL_2)	Yes (goto LABEL_2)
Binary size	Small (ctor_261)	63KB core + helpers	96KB core + helpers
Knobs	`basic-dbe`	10 knobs (ctor_525)	None known (controlled by caller)

basic-dbe handles trivially dead barriers detectable without dataflow analysis -- cases where the barrier is immediately adjacent to another barrier, or where the enclosing block contains no memory operations at all. It runs in the standard function pass pipeline and is cheap.

branch-dist performs full CFG propagation with 13 red-black tree maps and restart-on-removal, but it uses NVVM IR opcodes (0x36/0x37/0x3A/0x3B/0x4E) rather than the generic LLVM IR opcodes (61/62/65/66/85) used by the full engine. It also has its own address space filtering logic and 10 configurable knobs.

The full dead synchronization elimination engine (sub_2C84BA0) is the most aggressive of the three. It uses the LLVM IR opcode set, applies the element-size gate for loads/stores, and handles the special intrinsic IDs 8260--8262. It runs separately from the New PM function pass pipeline, invoked from module-level callers sub_2C88020 and sub_2C883F0.

Configuration

No dedicated knobs have been identified for the full engine at sub_2C84BA0. Its behavior is controlled entirely by its callers (sub_2C88020, sub_2C883F0), which determine when and whether the engine runs. This is in contrast to branch-dist, which has 10 knobs, and basic-dbe, which has at least an enable flag.

The diagnostic output is gated by an internal condition in the caller, not by a standalone dump knob.

Diagnostic Strings

" Removed dead synch: "
"Read above: "
", Write above: "
", Read below: "
", Write below: "
" in function "
"dbg"

The complete diagnostic message, assembled from these fragments:

 Removed dead synch: [filename:line] in function <name>
Read above: 0, Write above: 0, Read below: 1, Write below: 1

The numeric values are the boolean (0/1) access flags for each category. When the pass removes a barrier, the diagnostic shows exactly why it was safe: which of the four access categories was absent.

Function Map

Function	Address	Size	Role
--	`sub_2C84BA0`	96KB (3,400 lines)	Main engine: 5-phase algorithm
--	`sub_2C83D20`	small	`isSyncBarrier` predicate
--	`sub_2C83AE0`	small	`classifyMemoryAccess` (read/write classification)
--	`sub_2C84640`	medium	Per-BB analysis (bidirectional, direction parameter)
--	`sub_2C84590`	small	Red-black tree insert (forward/backward maps)
--	`sub_2C84AF0`	small	Red-black tree insert (bridge maps, different node type)
--	`sub_2C84080`	small	Map lookup / convergence check helper
--	`sub_2C83F20`	small	Map initialization / clear helper
--	`sub_2C83D50`	small	Map destructor / cleanup
--	`sub_BD3660`	small	`hasOneUse` -- used for intrinsic IDs 8260--8262 special case
--	`sub_CEA1A0`	small	Barrier intrinsic ID confirmation
--	`sub_B49E00`	small	`isSharedMemoryAccess` -- CUDA address space check
--	`sub_B43D60`	small	`Instruction::eraseFromParent` -- barrier deletion
--	`sub_B46E30`	small	`getNumSuccessors` -- CFG successor count
--	`sub_B46EC0`	small	`getSuccessor(i)` -- i-th successor retrieval
--	`sub_CB6200`	small	`raw_ostream::write` -- diagnostic string output
--	`sub_B91420`	small	Debug location extraction (filename/line)
--	`sub_B91F50`	small	Debug info accessor
--	`sub_BD5D20`	small	Type/value accessor
--	`sub_22409D0`	small	IR utility (instruction manipulation)
--	`sub_CB59D0`	small	`raw_ostream` integer write
--	`sub_CB59F0`	small	`raw_ostream` integer write (variant)
--	`sub_2C88020`	--	Caller: module-level pass invoking the engine
--	`sub_2C883F0`	--	Caller: module-level pass invoking the engine (variant)

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent dead synchronization elimination engine.

1. Removing a barrier that protects a cross-thread shared memory hazard invisible to single-thread analysis. The most dangerous mistake is treating the analysis as a single-thread dataflow problem. The pass classifies memory accesses as read/write per thread, but the barrier's purpose is to order accesses across threads. If thread A writes to smem[tid] above the barrier and thread B reads smem[tid-1] below it, a single-thread view sees no RAW hazard (different addresses). The correct analysis must conservatively assume that any shared memory write above and any shared memory read below constitutes a hazard -- the pass uses boolean flags (not address tracking) precisely because aliasing across threads is unknowable at compile time. A reimplementation that attempts to be "smarter" by tracking addresses will remove barriers that are needed.

2. Not restarting the full analysis after each barrier removal. When a barrier is deleted, the two regions it separated merge into one. This merged region may expose an adjacent barrier as dead (it no longer has memory accesses on one side). A reimplementation that removes all identified dead barriers in a single pass and then stops will miss these cascading redundancies. The restart is mandatory: the pass deliberately uses a goto back to Phase 3 after each removal, re-analyzing the entire function from scratch.

3. Incorrectly classifying call instructions as non-memory-accessing. The access classifier (sub_2C83AE0) must recursively analyze callees to determine if they access shared/global memory. A reimplementation that conservatively marks all calls as read+write will be correct but will retain too many barriers (poor optimization). Conversely, one that ignores calls entirely will remove barriers protecting memory accesses hidden inside called functions. The correct behavior checks the isSharedMemoryAccess predicate on the callee and falls back to read+write if the callee is opaque.

4. Treating __syncthreads_count/and/or (IDs 8260--8262) the same as plain __syncthreads. These barrier variants return a value (lane participation count/and/or). Even when the barrier is dead from a memory-ordering perspective, the return value may be used as data by the program. The pass applies a special hasOneUse check for these IDs. A reimplementation that blindly removes them when the dataflow says "no hazard" will break programs that depend on the return value for algorithmic purposes.

5. Applying the element-size gate too aggressively. The pass filters out loads/stores of types narrower than 512 bits (> 0x1FF), assuming they are register-promoted scalars. A reimplementation that raises this threshold (e.g., to 1024 bits) will miss legitimate memory operations that should keep a barrier alive. Conversely, lowering it to 0 will make the analysis overly conservative, retaining dead barriers for trivial register operations.

Test This

The following kernel contains consecutive __syncthreads() barriers with no shared memory accesses between them. The dead synchronization elimination pass should remove the redundant barriers.

__global__ void dead_sync_test(float* out, int n) {
    __shared__ float smem[256];

    smem[threadIdx.x] = (float)threadIdx.x;
    __syncthreads();    // barrier 1: needed (write above, read below)

    float val = smem[threadIdx.x ^ 1];
    __syncthreads();    // barrier 2: dead -- no smem access between barrier 1 and 2's "below"

    __syncthreads();    // barrier 3: consecutive with barrier 2 -- trivially dead

    out[threadIdx.x] = val;
}

What to look for in PTX:

Count the number of bar.sync 0; instructions. The kernel has three __syncthreads() calls in source, but only one should survive: barrier 1 (which orders the write to smem against the read from smem[tid^1]). Barriers 2 and 3 have no shared memory hazard to protect.
The diagnostic "Removed dead synch:" (visible with internal dump flags) shows the per-category access flags that justified removal: Read above: 0, Write above: 0 means no memory accesses reach the barrier from above.
To verify the pass preserves necessary barriers, move the float val = smem[...] read to between barriers 2 and 3. Now barrier 2 orders the write against this read and must survive -- expect two bar.sync instructions.
The cascading restart behavior is observable with 5 consecutive __syncthreads() with no memory between them. The pass removes one, restarts the analysis, removes the next, and repeats until only one remains.

Reimplementation Checklist

Barrier identification predicate. Implement the five-condition conjunction: opcode == 85 (internal call), non-null callee, byte[0] == 0 (intrinsic flag), scope match (callee.field[24] == inst.field[80]), convergent attribute (bit 0x20 at byte+33), and barrier intrinsic ID confirmation.
Memory access classifier. Classify every non-barrier instruction as read/write/both/neither based on opcode (store=0x3D, load=0x3E, atomic=0x41, cmpxchg=0x42, call=0x55), with the element-size gate (>511 bits) for loads/stores and recursive analysis for call instructions including shared-memory-access checks.
Bidirectional fixed-point dataflow. Maintain eight red-black tree maps (forward ReadAbove/WriteAbove/ReadBelow/WriteBelow per BB, backward same) populated by scanning each BB in both directions, propagating from successors (forward) and predecessors (backward), iterating until no boolean flips from 0 to 1.
Bridge map construction. After dataflow convergence, populate four bridge maps keyed by barrier instruction pointer, representing the combined read/write access sets crossing each specific barrier boundary.
Elimination decision logic. A barrier is dead if: (ReadAbove==0 AND WriteAbove==0), OR (ReadBelow==0 AND WriteBelow==0), OR (ReadAbove==0 AND WriteBelow==0), OR (WriteAbove==0 AND ReadBelow==0). Handle the special case for intrinsic IDs 8260--8262 (__syncthreads_count/and/or) where single-use return values allow additional removal.
Complete restart after removal. After each barrier deletion, restart the entire dataflow analysis from scratch to handle cascading redundancies where removing one barrier makes adjacent barriers dead.

Cross-References

Dead Barrier Elimination -- overview page covering both basic-dbe and this engine
Branch Distribution -- the other full dead-sync pass using NVVM IR opcodes
NVIDIA Custom Passes: Inventory -- registry entry for Dead Synchronization Elimination
LLVM Optimizer: Pipeline -- pipeline context and Phase I/II interaction

Rematerialization

NVIDIA's rematerialization infrastructure in CICC operates at two levels: an IR-level pass (nvvmrematerialize / "Legacy IR Remat") that reduces register pressure before instruction selection, and a machine-level pass (nv-remat-block / "Do Remat Machine Block") that performs the same transformation on MachineIR after register allocation decisions have been made. Both passes share the same fundamental strategy -- recompute cheap values at their use sites rather than keeping them live across long spans -- but they differ significantly in their cost models, candidate selection criteria, and interaction with the surrounding pipeline.

On NVIDIA GPUs, register pressure directly determines occupancy -- the number of concurrent warps per SM -- with discrete cliff boundaries where a single additional register can drop an entire warp group. Rematerialization trades extra ALU work for reduced register count, a tradeoff that is almost always profitable on GPUs where compute throughput vastly exceeds register file bandwidth.

Key Facts

Property	Value
Pass name (New PM)	`remat`
Pass name (Legacy PM)	`nvvmrematerialize` / `"Legacy IR Remat"`
Class	`RematerializationPass`
Registration	New PM #385, line 2257 in `sub_2342890`
Runtime positions	Tier 0 #34 (NVVMRematerialization via `sub_1A13320`); Tier 1/2/3 #55 (gated by `!opts[2320]`); see Pipeline
Pass factory	`sub_1A13320`
Machine-level companion	`nv-remat-block` / `"Do Remat Machine Block"` at `sub_2186D90`
Upstream equivalent	None -- entirely NVIDIA-proprietary

IR-Level Rematerialization (`nvvmrematerialize`)

Registration and Dependencies

The pass is registered at sub_1CD0BE0 with pass ID "nvvmrematerialize" and entry point sub_1CD0CE0. Before running, it initializes five analysis passes:

Analysis	Function	Purpose
Dominator tree	`sub_15CD350`	Dominance queries for instruction placement
Loop info	`sub_1440EE0`	Loop nest structure for cost scaling
Unknown	`sub_13FBE20`	Possibly alias analysis
Live variable analysis	`sub_1BFC830`	Builds live-in/live-out bitvector sets
Unknown	`sub_1BFB430`	Possibly register pressure estimation

Main Algorithm (`sub_1CE7DD0`, 67KB)

Complexity. Let B = number of basic blocks, I = total instructions, and L = number of live-in values. The live-in analysis uses hardware popcnt on bitvectors of size ceil(I / 64) per block, giving O(B * I / 64) per iteration. The intersection of live-in sets (bitwise AND) is O(B * I / 64). The rematizability check for each candidate walks its def chain: O(D) where D is the def-chain depth (bounded by max-recurse-depth). The pull-in cost model (sub_1CE3AF0) scores each candidate in O(U * D) where U = uses per candidate. Candidate sorting is O(K^2) via selection sort where K = candidates selected. The block executor clones instructions in O(K * B). The outer loop runs at most 5 iterations. Overall IR-level: O(5 * (B * I / 64 + K * U * D + K * B)). For the machine-level pass (sub_2186D90): max-live computation is O(I) per block (reverse walk), giving O(I) total. Candidate classification is O(I) for the initial scan, plus O(K * 50) for recursive pullability checks (depth bounded at 50). The second-chance heuristic iterates until convergence -- bounded by the candidate count K. The outer loop runs at most nv-remat-max-times (default 10) iterations. Overall machine-level: O(10 * (I + K^2)).

The driver implements an iterative register pressure reduction loop with up to 5 iterations. The high-level flow:

Function exclusion check: The no-remat knob stores a comma-separated list of function names. If the current function matches, the pass prints "Skip rematerialization on <funcname>" and bails.
Master gate: If all three sub-passes are disabled (do-remat, remat-iv, remat-load all zero), return immediately.
Live-in/live-out analysis: For each basic block, the pass looks up the block's live-in bitvector from the analysis (sub_1BFDF20), counts live-in values via hardware popcnt (sub_39FAC40), and stores per-block counts in a hash map. The maximum live-in across all blocks becomes the pressure target baseline. At dump-remat >= 2, the pass prints "Block %s: live-in = %d".
Register target computation: The algorithm computes how many registers it wants to reduce to:
- If remat-maxreg-ceiling is set and lower than the actual register count, cap at that value.
- If remat-for-occ is non-zero (default 120): call sub_1BFBA30 for register usage, then sub_1C01730 for an occupancy-based target. Apply heuristic adjustments based on occupancy level.
- Otherwise: target = 80% of the current register count.
Iterative loop (up to 5 iterations):
- If max live-in is already at or below the target, skip to the IV/load phases.
- Compute the intersection of live-in bitvectors across blocks (bitwise AND). Values that are live-in everywhere are the best rematerialization candidates because pulling them in at each use site eliminates a register everywhere.
- Walk the intersection bitvector. For each candidate, check rematerizability via sub_1CD06C0. Partition into rematizable and non-rematizable sets.
- Call sub_1CE3AF0 (pull-in cost analysis) to rank candidates by cost.
- Build a per-block rematerialization plan and execute via sub_1CE67D0.
- Recompute max live-in. If it decreased, continue iterating.
Post-remat phases: After the main loop, run IV demotion (sub_1CD74B0) if remat-iv is enabled, then load rematerialization (sub_1CDE4D0) if remat-load is enabled, then cleanup (sub_1CD2540).
Expression factoring: When remat-add is non-zero, the pass also performs strength reduction on chains of add/mul/GEP instructions, factoring common sub-expressions into "factor" named values. This is a mini-pass embedded within rematerialization.

Block-Level Executor (`sub_1CE67D0`, 32KB)

This function processes one basic block at a time, creating two kinds of instruction clones distinguished by their name prefixes:

remat_ prefix: The value was live-in to the block and is being recomputed from scratch. The defining instruction is duplicated via sub_15F4880, named with the "remat_" prefix via sub_164B780, and inserted at the use site. This is full rematerialization.

uclone_ prefix: The value already has a definition in the block's dominance chain, but a local copy is needed to shorten the live range. The instruction is cloned and named "uclone_". This is a use-level clone for live range splitting, not pure rematerialization.

After cloning, both variants update use-def chains via sub_1648780 and set debug locations via sub_15F22F0.

Pull-In Cost Model (`sub_1CE3AF0`, 56KB)

The cost model evaluates each candidate for rematerialization by computing:

pull_in_cost = base_cost * use_factor

Where base_cost is the sum of per-instruction costs along the value's def chain (sub_1CD0460), and use_factor is accumulated from per-use costs (sub_1CD3A10), with different cost tables for uses in different loop nests.

Candidates are filtered by three thresholds:

Filter	Condition	Default
Use limit	`use_count > remat-use-limit` AND `use_factor >= remat-loop-trip`	10 uses, 20 trips
GEP cost	`cost > remat-gep-cost` AND opcode is GEP	6000
Single cost	`cost > remat-single-cost-limit` (unless `remat-ignore-single-cost`)	6000

After scoring, candidates are sorted by cost (cheapest first via selection sort), and the cheapest N are selected where N is the target reduction count. At dump-remat >= 4, the pass prints "Total pull-in cost = %d".

NLO -- Simplify Live Output (`sub_1CE10B0` + `sub_1CDC1F0`)

The NLO sub-pass normalizes live-out values at block boundaries to reduce register pressure. Controlled by simplify-live-out (default 2):

Level 1: Basic normalization only.
Level 2 (default): Full normalization. Walks each block's live-out set and replaces values with simpler expressions.
Level 3+: Extended patterns.

NLO creates two kinds of synthetic instructions:

nloNewBit: A bit-level operation (AND, extract, truncation) to reduce a live-out value to its actually-used bit width.
nloNewAdd: A local add instruction to recompute an address/offset that was previously live-out, replacing it with a local computation.

IV Demotion (`sub_1CD74B0`, 75KB)

The induction variable demotion sub-pass reduces register pressure by narrowing wide IVs (typically 64-bit to 32-bit). Controlled by remat-iv (default 4, meaning full demotion):

Level	Behavior
0	Disabled
1-2	Basic IV demotion
3	Extended IV demotion
4	Full demotion including complex patterns (default)
5+	Aggressive mode

The algorithm identifies PHI nodes at loop headers, checks whether the IV's value range fits in a smaller type (for 64-bit IVs: (val + 0x80000000) <= 0xFFFFFFFF), and creates narrower replacements:

demoteIV: A truncation of the original IV to a narrower type.
newBaseIV: A new narrow PHI node to replace the wide loop IV.
iv_base_clone_: A clone of the IV's base value for use in comparisons that need the original width.
substIV: Replaces uses of the old IV with the demoted version.

Multi-Pass Data Flow: Rematerialization / IV Demotion / NLO

The IR-level rematerialization pass (nvvmrematerialize) contains three cooperating sub-passes that execute in a fixed sequence within a single pass invocation. The following diagram shows the data each sub-pass produces and consumes, and the feedback loop that drives iterative pressure reduction.

               Live Variable Analysis (prerequisite)
               +------------------------------------+
               | Builds per-block live-in/live-out   |
               | bitvector sets via sub_1BFDF20     |
               | Produces:                           |
               |  - live-in bitvector per BB         |
               |  - live-out bitvector per BB        |
               |  - max live-in count (pressure)     |
               +------------------+-----------------+
                                  |
                                  v
  +===============================================================+
  |  MAIN REMATERIALIZATION LOOP (sub_1CE7DD0, up to 5 iterations)|
  |                                                               |
  |  Inputs:                                                      |
  |   - live-in bitvectors (from analysis above)                  |
  |   - register target (from occupancy model or 80% heuristic)  |
  |   - remat cost thresholds (knobs)                             |
  |                                                               |
  |  +----------------------------------------------------------+ |
  |  | Step 1: Compute intersection of live-in sets              | |
  |  | (bitwise AND across all blocks)                           | |
  |  | --> Values live everywhere = best candidates              | |
  |  +---------------------------+------------------------------+ |
  |                              |                                |
  |                              | candidate value set            |
  |                              v                                |
  |  +---------------------------+------------------------------+ |
  |  | Step 2: Pull-In Cost Analysis (sub_1CE3AF0)              | |
  |  | For each candidate:                                       | |
  |  |   cost = base_cost(def chain) * use_factor(loop nesting) | |
  |  | Filter by: remat-use-limit, remat-gep-cost,              | |
  |  |            remat-single-cost-limit                        | |
  |  | Sort by cost (cheapest first)                             | |
  |  | Produces: ranked list of N cheapest candidates            | |
  |  +---------------------------+------------------------------+ |
  |                              |                                |
  |                              | remat plan per block           |
  |                              v                                |
  |  +---------------------------+------------------------------+ |
  |  | Step 3: Block Executor (sub_1CE67D0)                     | |
  |  | For each selected candidate in each block:                | |
  |  |   "remat_" clone: full rematerialization at use site     | |
  |  |   "uclone_" clone: live range split within dom chain     | |
  |  | Produces:                                                 | |
  |  |   - cloned instructions at use sites                      | |
  |  |   - reduced live-in counts per block                      | |
  |  +---------------------------+------------------------------+ |
  |                              |                                |
  |                              | updated IR                     |
  |                              v                                |
  |  Recompute max live-in. If decreased and < 5 iters, loop.   |
  +=======================+=====================================+
                          |
                          | IR with reduced register pressure
                          v
  +=======================+=====================================+
  | IV DEMOTION (sub_1CD74B0, controlled by remat-iv)           |
  |                                                              |
  | Consumes:                                                    |
  |  - Loop header PHI nodes (from LoopInfo)                     |
  |  - Type widths (from DataLayout)                             |
  |  - post-remat IR (live ranges already shortened)             |
  |                                                              |
  | Algorithm:                                                   |
  |  for each loop L:                                            |
  |    for each 64-bit PHI in L.header:                          |
  |      if (val + 0x80000000) <= 0xFFFFFFFF:                    |
  |        create "demoteIV" (trunc to i32)                      |
  |        create "newBaseIV" (narrow PHI replacement)            |
  |        rewrite uses with "substIV"                            |
  |                                                              |
  | Produces:                                                    |
  |  - narrowed IVs (64->32 bit, halving register cost)          |
  |  - "iv_base_clone_" values for comparisons needing           |
  |    original width                                            |
  |  - updated loop exit conditions                              |
  +=======================+=====================================+
                          |
                          | IR with narrowed IVs
                          v
  +=======================+=====================================+
  | NLO -- SIMPLIFY LIVE OUTPUT (sub_1CE10B0, simplify-live-out)|
  |                                                              |
  | Consumes:                                                    |
  |  - per-block live-out bitvector sets                          |
  |  - post-IV-demotion IR                                       |
  |                                                              |
  | For each block's live-out set:                               |
  |  - If a value is live-out but only its low bits are used     |
  |    downstream: create "nloNewBit" (AND/extract/trunc)        |
  |  - If a value is an address live-out that can be recomputed  |
  |    locally in successors: create "nloNewAdd" (local add)     |
  |                                                              |
  | Produces:                                                    |
  |  - "nloNewBit" bit-narrowing instructions                    |
  |  - "nloNewAdd" local recomputation instructions              |
  |  - reduced live-out register count at block boundaries       |
  +=======================+=====================================+
                          |
                          | Final IR: pressure-reduced,
                          | IVs narrowed, live-outs simplified
                          v
  +-------------------------------------------------------+
  | Downstream consumers:                                  |
  |  - Instruction selection (register model now concrete) |
  |  - Machine-level remat (nv-remat-block, second pass)  |
  |  - Register allocation (lower pressure = higher occ.) |
  +-------------------------------------------------------+

Data flow summary:

Producer	Data	Consumer
Live Variable Analysis	Per-block live-in/live-out bitvectors	Main remat loop
Occupancy model (`sub_1C01730`)	Register pressure target	Main remat loop
Main remat loop	`remat_`/`uclone_` cloned instructions	Updated IR for IV demotion
IV Demotion	`demoteIV`, `newBaseIV`, `substIV` narrowed values	NLO and downstream
NLO	`nloNewBit`, `nloNewAdd` local recomputations	Final IR for instruction selection
All three sub-passes	Cumulative register pressure reduction	Machine-level remat (`nv-remat-block`)

The sequencing is important: the main loop reduces cross-block live-in pressure first (the broadest and cheapest wins), IV demotion then halves the cost of loop induction variables (converting two registers to one), and NLO cleans up block-boundary live-out values that survived both earlier phases. The machine-level nv-remat-block pass runs much later in the pipeline (after instruction selection and register allocation) as a final safety net, operating on concrete register assignments rather than abstract SSA values.

Machine-Level Block Rematerialization (`nv-remat-block`)

Registration

Registered at ctor_361_0 (address 0x5108E0) with pass name "nv-remat-block" and description "Do Remat Machine Block". Main entry point: sub_2186D90 (47KB, ~1742 lines).

Algorithm Overview

The machine-level pass implements a sophisticated iterative pull-in algorithm operating on MachineIR after instruction selection:

Measure: Compute max-live register pressure across all blocks via sub_2186590. Prints "Max-Live-Function(<num_blocks>) = <max_live>".
Identify: For each block where pressure exceeds the target, enumerate live-out registers.
Classify: For each live-out register, determine pullability:
- MULTIDEF check (sub_217E810): The register must have exactly one non-dead, non-debug definition. Registers with multiple definitions print "MULTIDEF" and are rejected.
- Opcode exclusion: A large switch/comparison tree excludes memory ops, atomics, barriers, texture ops, surface ops, and other side-effecting instructions. Specific exclusions exist for sm_62 (opcodes 380-396).
- Operand safety: Instructions that define additional tied registers beyond the target are rejected.
- Recursive verification (sub_2181550): All operands of the defining instruction must themselves be pullable, checked recursively up to depth 50.
Second-chance heuristic (sub_2181870): Registers initially rejected because one of their operands was non-pullable are re-evaluated when those operands become pullable. This iterates until convergence, using a visit-count mechanism to prevent infinite loops. The hash function throughout is h(regID) = 37 * regID. Debug: "After pre-check, <N> good candidates, <N> given second-chance", "ADD <N> candidates from second-chance".
Cost analysis (sub_2183E30): Each candidate receives a clone cost. Candidates with cost 0 are non-rematerializable.
Selection: Sort candidates by cost (ascending). Greedily select the cheapest candidates until pressure is reduced to target. Double-wide register classes (size > 32) count as 2 for pressure purposes and have their cost doubled. Debug: "Really Final Pull-in: <count> (<total_cost>)".
Execute: For each selected register:
- Clear from live-out bitmap via sub_217F620.
- Propagate backward through predecessors via sub_2185250.
- Clone the defining instruction at use sites via sub_217E1F0.
- Replace register references via sub_21810D0.
- Remove now-dead original definitions.
Iterate: Repeat up to nv-remat-max-times (default 10) iterations until max pressure is at or below target, or no further progress is made.

Instruction Replacement (`sub_21810D0`)

When replacing a rematerialized register:

Create a new virtual register of the same class via sub_1E6B9A0.
Call the target's replaceRegWith method (vtable offset 152).
Walk all uses of the original register ID and rewrite operands via sub_1E310D0.
Handle special cases: DBG_VALUE (opcode 45) and NOP/PHI (opcode 0) instructions use stride-2 operand scanning.

Register Pressure Computation (`sub_2186590`)

Per-block pressure is computed by starting with the live-out set size, walking instructions in reverse, tracking register births (defs) and deaths (last uses), and recording the peak pressure point. The maximum across all blocks is returned.

Key Functions

IR-Level

Function	Address	Size	Role
Pass registration	`sub_1CD0BE0`	--	Registers `"nvvmrematerialize"`
Main driver	`sub_1CE7DD0`	67KB	Iterative live-in reduction loop
Block executor	`sub_1CE67D0`	32KB	`"remat_"` / `"uclone_"` creation
Pull-in cost	`sub_1CE3AF0`	56KB	Cost model and candidate selection
NLO main	`sub_1CE10B0`	48KB	Live-out normalization
NLO helper	`sub_1CDC1F0`	35KB	Inter-block NLO propagation
IV demotion	`sub_1CD74B0`	75KB	Induction variable narrowing
Load remat	`sub_1CDE4D0`	--	Load rematerialization sub-pass
Per-function init	`sub_1CDA600`	--	Data structure initialization
Rematizability check	`sub_1CD06C0`	--	Determines if a value can be recomputed

Machine-Level

Function	Address	Size	Role
Main engine	`sub_2186D90`	47KB	Iterative pull-in algorithm
Max-live computation	`sub_2186590`	--	Per-block pressure analysis
MULTIDEF check	`sub_217E810`	~230 lines	Single-definition verification
Recursive pullability	`sub_2181550`	~110 lines	Operand chain verification (depth 50)
Second-chance	`sub_2181870`	~800 lines	Re-evaluation of rejected candidates
Cost evaluator	`sub_2183E30`	--	Clone cost computation
Liveness propagation	`sub_2185250`	~650 lines	Backward propagation + cloning
Instruction replacement	`sub_21810D0`	~290 lines	Register use rewriting
Remat allocation helper	`sub_2184890`	~477 lines	Pressure simulation

Configuration Knobs

IR-Level Knobs (ctor_277_0 at `0x4F7BE0`)

Knob	Global	Default	Description
`do-remat`	`dword_4FC05C0`	3	Master control. 0=off, 1=conservative, 2=normal, 3=full.
`no-remat`	`qword_4FC0440`	(empty)	Comma-separated function exclusion list
`remat-iv`	`dword_4FBFB40`	4	IV demotion level. 0=off, 4=full.
`remat-load`	`dword_4FBFA60`	1	Load rematerialization. 0=off, 1=on.
`remat-add`	`dword_4FBF980`	0	Add/GEP factoring. 0=off.
`remat-single-cost-limit`	`dword_4FC0080`	6000	Max cost per single live-in reduction
`remat-loop-trip`	`dword_4FBFFA0`	20	Default assumed loop trip count
`remat-gep-cost`	`dword_4FBFEC0`	6000	Max cost for GEP rematerialization
`remat-use-limit`	`dword_4FBFDE0`	10	Max number of uses for a candidate
`remat-max-live-limit`	`dword_4FBFD00`	10	Max live-in limit for rematerialization
`remat-maxreg-ceiling`	`dword_4FBF600`	0	Register ceiling (0 = uncapped)
`remat-for-occ`	`dword_4FBF8A0`	120	Occupancy-driven rematerialization target
`remat-lli-factor`	`dword_4FC0320`	10	Long-latency instruction cost factor
`remat-ignore-single-cost`	`byte_4FBFC20`	false	Bypass per-value cost filter
`remat-move`	`byte_4FC0400`	false	Remat move instructions
`simplify-live-out`	`dword_4FBF520`	2	NLO level. 0=off, 2=full.
`dump-remat`	`dword_4FC0240`	0	Debug dump level (0-4+)
`dump-remat-iv`	`dword_4FC0160`	0	IV remat debug dump
`dump-remat-load`	`dword_4FBF720`	0	Load remat debug dump
`dump-remat-add`	`dword_4FBF640`	0	Add remat debug dump
`dump-simplify-live-out`	`byte_4FBF400`	false	NLO debug dump

Machine-Level Knobs (ctor_361_0 at `0x5108E0`)

Knob	Global	Default	Description
`nv-remat-block`	`dword_4FD3820`	14	Bitmask controlling remat modes (bits 0-3)
`nv-remat-max-times`	`dword_4FD3740`	10	Max outer loop iterations
`nv-remat-block-single-cost`	`dword_4FD3660`	10	Max cost per single live value pull-in
`nv-remat-block-map-size-limit`	`dword_4FD3580`	6	Map size limit for single pull-in
`nv-remat-block-max-cost`	`dword_4FD3040`	100	Max total clone cost per live value reduction
`nv-remat-block-liveout-min-percentage`	`dword_4FD3120`	70	Min liveout % for special consideration
`nv-remat-block-loop-cost-factor`	`unk_4FD3400`	20	Loop cost multiplier
`nv-remat-default-max-reg`	`unk_4FD3320`	70	Default max register pressure target
`nv-remat-block-load-cost`	`unk_4FD2EC0`	10	Cost assigned to load instructions
`nv-remat-threshold-for-spec-reg`	`unk_4FD3860`	20	Threshold for special register remat
`nv-dump-remat-block`	`byte_4FD2E80`	false	Debug dump toggle
`nv-remat-check-internal-live`	`byte_4FD2DA0`	false	Check internal liveness during MaxLive
`max-reg-kind`	`qword_4FD2C20`	0	Kind of max register pressure info
`no-mi-remat`	`qword_4FD2BE0`	(empty)	Skip remat for named functions
`load-remat`	`word_4FD32F0`	true	Enable load rematerialization
`vasp-fix1`	`word_4FD3210`	false	VASP fix for volatile/addsp

Complementary ptxas-side Knobs

The assembler (ptxas) has its own rematerialization controls that complement the CICC passes:

RegAllocRematEnable=1
RegAllocEnableOptimizedRemat=1
RematEnable=1
SinkRematEnable=1
RematBackOffRegTargetFactor=N

Optimization Level Behavior

Level	IR-Level Remat (`nvvmrematerialize`)	Machine-Level Remat (`nv-remat-block`)
O0	Not run	Not run
Ofcmax	Not run	Not run
Ofcmid	Runs with `do-remat=3` (full)	Not run
O1	Runs with `do-remat=3`, `remat-iv=4`, `remat-load=1`	Runs with `nv-remat-block=14` (default bitmask)
O2	Same as O1	Same as O1
O3	Same as O1; may see more candidates due to additional inlining/unrolling	Same as O1; operates on more aggressively optimized MIR

The do-remat master control (default 3) enables all rematerialization sub-phases at O1+. The machine-level pass is gated by its own NVVMPassOptions slot and runs only when the codegen pipeline includes the full register allocation sequence. At Ofcmax, neither pass runs because the fast-compile pipeline skips the full optimization and codegen stack. See Optimization Levels for the complete pipeline tier structure.

Diagnostic Strings

"Skip rematerialization on <funcname>"
"Block %s: live-in = %d"
"Total pull-in cost = %d"
"remat_"
"uclone_"
"nloNewBit"
"nloNewAdd"
"demoteIV"
"newBaseIV"
"iv_base_clone_"
"substIV"
"factor"
"Max-Live-Function(<num_blocks>) = <max_live>"
"Really Final Pull-in: <count> (<total_cost>)"
"MULTIDEF"
"Skip machine-instruction rematerialization on <name>"
"After pre-check, <N> good candidates, <N> given second-chance"
"ADD <N> candidates from second-chance"
"Pullable: <count>"
"live-out = <count>"
"Total Pullable before considering cost: <count>"

Reimplementation Checklist

Live-in/live-out bitvector analysis. Build per-basic-block bitvector sets tracking which values are live-in and live-out, compute max live-in via hardware popcnt, and maintain a hash map of per-block counts.
Occupancy-driven register target. Query the occupancy model to compute a target register count (default: remat-for-occ=120), apply heuristic adjustments based on occupancy cliff boundaries, and cap at remat-maxreg-ceiling when set.
Candidate selection and cost model. Compute the live-in intersection across all blocks (bitwise AND), check rematerizability of each candidate via def-chain walking (bounded by max-recurse-depth), score candidates as base_cost * use_factor with loop-nesting scaling, filter by remat-use-limit/remat-gep-cost/remat-single-cost-limit, and sort cheapest-first.
Block-level instruction cloning. Implement two clone types: remat_ prefix clones (full rematerialization of live-in values at use sites) and uclone_ prefix clones (use-level copies for live range splitting within the dominance chain), with proper use-def chain and debug location updates.
IV demotion sub-pass. Identify 64-bit loop-header PHI nodes whose value range fits in 32 bits ((val + 0x80000000) <= 0xFFFFFFFF), create narrowed PHI replacements (demoteIV/newBaseIV/substIV), and rewrite loop exit conditions.
NLO live-out simplification. Walk each block's live-out set, create nloNewBit instructions (AND/extract/trunc to actual used bit-width) and nloNewAdd instructions (local address recomputations) to reduce live-out register count at block boundaries.
Machine-level pull-in algorithm (nv-remat-block). Implement the iterative MachineIR rematerialization engine: max-live computation via reverse instruction walk, MULTIDEF verification, recursive pullability checking (depth 50), second-chance heuristic for re-evaluating rejected candidates, cost-sorted greedy selection, and liveness propagation with instruction cloning at use sites.
Iterative convergence loop. Wrap the IR-level pass in an up-to-5-iteration loop (recompute max live-in after each round, stop when target is met) and the machine-level pass in an up-to-nv-remat-max-times loop.

Architecture-Specific Behavior

The machine-level MULTIDEF checker (sub_217E810) contains architecture-specific opcode exclusions: opcodes 380-396 are rejected only when the target SM is sm_62 (GP106, mid-range Pascal), suggesting these instructions have rematerialization hazards specific to that microarchitecture. All other opcode exclusions apply uniformly across SM targets.

Test This

The following kernel creates high register pressure by keeping many independent values alive simultaneously. Compile with nvcc -ptx -arch=sm_90 -maxrregcount=32 to force a low register cap and observe rematerialization in action.

__global__ void remat_test(const float* __restrict__ in, float* __restrict__ out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= n) return;

    float a = in[tid];
    float b = in[tid + n];
    float c = in[tid + 2*n];
    float d = in[tid + 3*n];
    float e = in[tid + 4*n];
    float f = in[tid + 5*n];
    float g = in[tid + 6*n];
    float h = in[tid + 7*n];

    float r0 = a * b + c;
    float r1 = d * e + f;
    float r2 = g * h + a;
    float r3 = b * c + d;
    float r4 = e * f + g;

    out[tid]       = r0 + r1;
    out[tid + n]   = r2 + r3;
    out[tid + 2*n] = r4 + r0;
}

What to look for in PTX:

Address recomputation: the expressions tid + k*n are cheap to recompute. With -maxrregcount=32, the pass should rematerialize these address calculations at use sites rather than keeping them in registers. Look for repeated mad.lo.s32 or add.s32 instructions computing the same offset near each ld.global instead of a single computation early on.
Compare the .nreg directive value between -maxrregcount=32 and the default. The rematerialization pass trades extra ALU instructions for fewer registers to hit the lower target.
With -Xcicc -dump-remat=4, cicc prints "Total pull-in cost = %d" for each candidate, showing the cost/benefit analysis.
The remat_ prefix on SSA names in LLVM IR dumps identifies rematerialized values.

Pipeline Interaction

The IR-level pass runs after live variable analysis has been computed and before instruction selection. Its register pressure reduction directly influences the occupancy achievable by the final kernel. The machine-level pass runs later, after instruction selection and register allocation, providing a second opportunity to reduce pressure on MachineIR where the register model is concrete rather than abstract. Together, the two passes form a layered rematerialization strategy: the IR pass makes broad, cost-effective reductions early, and the machine pass performs precise, targeted reductions late. Both passes interact with the register pressure analysis (rpa / machine-rpa) that feeds pressure estimates into scheduling and allocation decisions throughout the pipeline.

IV Demotion

IV demotion is NVIDIA's proprietary induction variable narrowing sub-pass, embedded within the IR-level rematerialization pass (nvvmrematerialize). It reduces register pressure by converting wide induction variables -- typically 64-bit -- to narrower types, typically 32-bit. On NVIDIA GPUs this is a high-impact optimization: the NVPTX ISA provides native 32-bit integer arithmetic in a single instruction, while 64-bit operations require multi-instruction sequences (add.cc + addc for a single 64-bit add, for example). A 64-bit loop induction variable that provably fits in 32 bits wastes two registers where one would suffice, and every arithmetic operation on it costs roughly twice the instruction count.

The sub-pass is large -- 75KB of compiled code, larger than the main rematerialization driver itself -- reflecting the complexity of proving that narrowing is safe across all uses of an IV, rewriting PHI nodes, adjusting comparisons, and handling edge cases where some uses require the original width while others can consume the narrowed version.

Key Facts

Property	Value
Entry point	`sub_1CD74B0` (75KB, ~2500 lines)
Parent pass	`nvvmrematerialize` (IR-level rematerialization)
Invocation site	`sub_1CE7DD0` line ~2276 (post-remat phase)
Primary knob	`remat-iv` (default 4 = full demotion)
Debug knob	`dump-remat-iv` (default 0)
Gate condition	`dword_4FBFB40 != 0` (non-zero enables the sub-pass)
Helper: IV analysis	`sub_1CD5F30`
Helper: IV base lookup	`sub_1CD5400`
Helper: cleanup	`sub_1CD0600`
IR builder	`sub_15FB440` (opcode, type, operand, name, insertpt)
Width query	`sub_127FA20` (`DataLayout::getTypeStoreSize`)

Demotion Levels

The remat-iv knob controls five demotion aggressiveness levels:

Level	Behavior	Gate in binary
0	Disabled -- IV demotion entirely skipped	`dword_4FBFB40 == 0`
1--2	Basic IV demotion. Only simple induction variables with constant step and all uses in the same loop body.	Default path
3	Extended IV demotion. Enables demotion of IVs whose uses extend to loop-exit comparisons and address computations outside the innermost loop.	line 1380: `if (dword > 3)`
4	Full demotion (default). Includes complex patterns: IVs used in GEP chains, IVs with multiple PHI consumers, and IVs that feed into both narrow and wide downstream computations.	line 1546: `if (dword <= 4)`
5+	Aggressive mode. Relaxes safety margins on range proofs, allowing demotion when the range check is tight (no headroom).

Level 4 is the default because it captures the vast majority of profitable demotion opportunities in real CUDA kernels without the correctness risk of aggressive mode.

Algorithm

Phase 1: Loop Iteration and PHI Identification

The algorithm iterates over every loop in the function (obtained from LoopInfo, sub_1440EE0). For each loop, it examines the loop header block's PHI nodes. Each PHI node is a candidate induction variable. The pass checks the PHI's type width via sub_127FA20 (DataLayout::getTypeStoreSize).

for each loop L in function:
    header = L.getHeader()
    for each PHI in header:
        width = getTypeStoreSize(PHI.getType())    // sub_127FA20
        if width != 64:
            continue   // only demote 64-bit IVs to 32-bit

Phase 2: Increment Pattern Analysis

For each 64-bit PHI, the pass identifies the increment pattern -- the value feeding back from the latch block. It verifies the pattern is a simple add/sub by a constant. The helper sub_1CD5F30 (IV analysis helper) walks the def-use chain of the PHI's backedge operand to extract the step value and verify linearity.

backedge_val = PHI.getIncomingValueForBlock(latch)
if backedge_val is not (PHI + constant) and
   backedge_val is not (PHI - constant):
    skip this PHI    // non-linear IV, cannot demote
step = extract_constant(backedge_val)

Phase 3: Value Range Fitting

The critical safety check. The pass must prove that the IV's value never exceeds the 32-bit signed range throughout the loop's execution. The check uses an unsigned comparison trick:

(val + 0x80000000) <= 0xFFFFFFFF

This is equivalent to checking -2^31 <= val <= 2^31 - 1 (the signed i32 range). Adding 0x80000000 shifts the signed range to [0, 0xFFFFFFFF], which can be checked with a single unsigned comparison. The pass evaluates this condition on:

The initial value (from the preheader incoming edge of the PHI).
The final value (derived from the loop trip count and step).
Conservatively, any intermediate values if the step is not +1/-1.

The initial value and trip count information come from the loop analysis infrastructure. The pass does not directly invoke SCEV (ScalarEvolution) -- it operates on NVIDIA's own IR-level live variable analysis and loop info passes (sub_1440EE0 for loop structure, sub_1BFC830 for live variable analysis). However, upstream LLVM's IndVarSimplify (sub_1945A50) may have already widened or simplified IVs using SCEV before this pass runs. The IV demotion pass operates on whatever IV structure remains after the main optimization pipeline.

If the range check fails, the IV is skipped. There is no speculative demotion with runtime guards.

Phase 4: Use Analysis and Classification

Before rewriting, the pass classifies every use of the original 64-bit IV:

Narrow-safe uses: Arithmetic (add, sub, mul, shift), array indexing within the loop body. These can consume the 32-bit value directly.
Comparison uses: Loop exit conditions (icmp). These need a narrow comparison instruction (newICmp).
Address uses: GEP instructions that use the IV as an index. At level 4+, these are handled by cloning the base address computation (iv_base_clone_).
Escape uses: Uses outside the loop (LCSSA PHIs, return values). These require sign/zero extension back to 64-bit.

The level knob gates which use categories are eligible:

Use category	Minimum level
Same-block arithmetic	1
Loop exit comparisons	2
Cross-block GEP indexing	3
Multi-PHI consumers	4
Tight-range speculation	5

Phase 5: Instruction Generation

Once an IV is approved for demotion, the pass generates four types of synthetic instructions:

`demoteIV` -- Truncation

v475 = "demoteIV";
v366 = sub_15FB440(11, destg, v401, &v475, v115);
// opcode 11 = trunc

Creates a trunc i64 %iv to i32 instruction, inserted at the point where the original IV was defined. This is the primary demotion: the new 32-bit value replaces the old 64-bit value for all narrow-safe uses.

IR before:

%iv = phi i64 [ %init, %preheader ], [ %iv.next, %latch ]
%iv.next = add i64 %iv, 1

IR after (demoteIV inserted):

%iv = phi i64 [ %init, %preheader ], [ %iv.next, %latch ]
%demoteIV = trunc i64 %iv to i32

`newBaseIV` -- Narrow PHI Replacement

v475 = "newBaseIV";
desth = sub_15FB440(11, v289, v427, &v475, destd);

When the entire loop can use a 32-bit IV, the pass creates a completely new PHI node with i32 type in the loop header. The old 64-bit PHI is not simply truncated -- a new narrow induction cycle is constructed:

A narrow initial value: %newInit = trunc i64 %init to i32
A narrow PHI: %newBaseIV = phi i32 [ %newInit, %preheader ], [ %newInc, %latch ]
A narrow increment: %newInc = add i32 %newBaseIV, <step32>

The old 64-bit IV becomes dead if all uses are successfully rewritten.

IR after (full base IV replacement):

%newInit = trunc i64 %init to i32
%newBaseIV = phi i32 [ %newInit, %preheader ], [ %newInc, %latch ]
%newInc = add i32 %newBaseIV, 1

`iv_base_clone_` -- Comparison Clone

v475 = "iv_base_clone_";
v214 = sub_15F4880(v210);         // clone instruction
sub_164B780(v214, &v475);         // set name
sub_15F2120(v395, v198);          // insert into block

When some uses of the IV require the original 64-bit width -- typically the loop exit comparison or an address computation that cannot be narrowed -- the pass clones the IV's base value. The clone instruction preserves the original semantics while allowing the primary loop computation to proceed with the narrow type. The clone is placed at the specific use site rather than at the loop header, avoiding the register pressure cost of keeping the wide value live across the entire loop body.

`substIV` -- Use Replacement

After generating the narrow IV infrastructure, the pass walks all uses of the original wide IV and replaces them with the demoted version. This is the final rewriting step:

Arithmetic uses: replaced with uses of %newBaseIV or %demoteIV.
Comparison uses: replaced with narrow comparisons (newICmp) on the demoted value.
PHI uses at LCSSA boundaries: a sext/zext is inserted to restore 64-bit width for consumers outside the loop.

The pass also creates newICmp instructions -- narrower comparison instructions that compare i32 values instead of i64 values, rewriting the loop exit condition to match the demoted IV.

After all use replacement, sub_1CD0600 performs dead code cleanup: if the original 64-bit IV has no remaining uses, the wide PHI and its increment chain are deleted.

GPU Motivation: 32-bit vs. 64-bit Performance

The performance gap between 32-bit and 64-bit integer operations on NVIDIA GPUs is substantial and architectural, not merely a throughput difference:

Instruction count. 64-bit integer addition on PTX compiles to two machine instructions (add.cc.u32 + addc.u32) because the hardware ALU is 32-bit wide. A 64-bit multiply is even worse: it decomposes into multiple 32-bit multiplies and adds. Every loop iteration with a 64-bit IV pays this tax on the increment alone.

Register pressure. A single i64 value occupies a pair of 32-bit registers. In a loop with 3 IVs, demoting all three frees 3 registers -- enough to cross an occupancy cliff and gain an entire warp group on many kernels.

Address arithmetic. CUDA uses 64-bit pointers (nvptx64 target), so loop index computations are promoted to i64 by default during LLVM IR generation. But most CUDA kernels operate on arrays smaller than 4 GB, making the upper 32 bits of the index perpetually zero. The IV demotion pass recovers this wasted precision.

Pipeline utilization. GPU SM pipelines have limited integer execution units. Halving the instruction count for IV arithmetic directly translates to higher utilization of other functional units (FP, memory) in the same warp cycle.

Configuration

Knobs (registered at `ctor_277_0`, address `0x4F7BE0`)

Knob	Global	Default	Description
`remat-iv`	`dword_4FBFB40`	4	IV demotion level. 0=off, 1-2=basic, 3=extended, 4=full, 5+=aggressive.
`dump-remat-iv`	`dword_4FC0160`	0	Debug dump verbosity for IV demotion. Non-zero enables diagnostic output.

The remat-iv knob is read by the main rematerialization driver (sub_1CE7DD0) at the post-remat phase gate. When non-zero, sub_1CD74B0 is invoked. The level value is then read inside sub_1CD74B0 to control which demotion patterns are attempted.

Interaction with ptxas

The ptxas assembler has its own rematerialization controls (--knob RegAllocRematEnable, RematEnable, etc.) but does not have an IV demotion equivalent. IV demotion is purely an IR-level transformation -- by the time ptxas sees the code, the IVs are already narrow. The ptxas knob --advanced-remat (0/1/2) controls machine-level rematerialization but does not perform type narrowing.

Diagnostic Strings

All strings emitted by sub_1CD74B0:

"phiNode"           -- PHI node identification during loop header scan
"demoteIV"          -- Truncation instruction creation
"newInit"           -- Narrow initial value for new base IV
"newInc"            -- Narrow increment for new base IV
"argBaseIV"         -- Base IV argument lookup
"newBaseIV"         -- New narrow PHI node creation
"newICmp"           -- Narrow comparison instruction creation
"iv_base_clone_"    -- Clone of IV base for original-width uses
"substIV"           -- Use replacement pass

These strings are set as instruction name prefixes via sub_164B780 (for cloned instructions) or passed directly to the IR builder sub_15FB440. They appear in IR dumps when dump-remat-iv is non-zero or when the module is printed after the rematerialization pass.

Differences from Upstream LLVM

Upstream LLVM's IndVarSimplify pass (indvars) performs IV widening and narrowing through SCEV-based analysis. NVIDIA's IV demotion sub-pass is a completely separate implementation with several key differences:

Aspect	Upstream `IndVarSimplify`	NVIDIA IV Demotion
Analysis framework	SCEV (`ScalarEvolution`)	NVIDIA live variable analysis + LoopInfo
Direction	Primarily widens narrow IVs to canonical form	Narrows wide IVs to reduce register pressure
Motivation	Canonical form for other optimizations	Register pressure reduction for GPU occupancy
Placement	Early in optimization pipeline	Late, inside rematerialization (post-optimization)
Range proof	SCEV range analysis	Direct `(val + 0x80000000) <= 0xFFFFFFFF` check
IV creation	SCEV expander	Direct IR builder calls (`sub_15FB440`)
Configuration	`indvars-widen-indvars` (bool)	`remat-iv` (6-level integer knob)

The two passes are complementary. IndVarSimplify runs early and may widen IVs for canonical form. Later, IV demotion runs inside rematerialization and narrows them back when the wide form causes excessive register pressure. This is not redundant work -- the early widening enables other optimizations (loop vectorization, strength reduction), and the late narrowing recovers the register cost after those optimizations have completed.

Function Map

Function	Address	Size	Role
IV demotion entry	`sub_1CD74B0`	75KB	Main algorithm: PHI scan, range check, rewrite
IV analysis helper	`sub_1CD5F30`	--	Walks def-use chain to extract step/linearity
IV base lookup	`sub_1CD5400`	--	Finds base value of induction variable
Dead IV cleanup	`sub_1CD0600`	--	Removes unreferenced wide IVs after demotion
IR builder	`sub_15FB440`	--	Creates instruction (opcode, type, operand, name, insertpt)
Clone instruction	`sub_15F4880`	--	Clones an IR instruction (for `iv_base_clone_`)
Set name prefix	`sub_164B780`	--	Sets name string on cloned instruction
Insert into block	`sub_15F2120`	--	Inserts instruction at specified position
Replace uses	`sub_1648780`	--	Rewrites all uses of a value to a new value
Delete dead instr	`sub_15F20C0`	--	Erases instruction from parent block
Type store size	`sub_127FA20`	--	`DataLayout::getTypeStoreSize` -- returns bit width

Cross-References

Rematerialization -- parent pass; IV demotion is invoked in the post-remat phase
ScalarEvolution -- upstream SCEV framework; not used directly by IV demotion but related
IndVarSimplify -- upstream IV canonicalization pass
LLVM Optimizer -- pipeline context showing where rematerialization runs
Knobs -- central knob inventory

Base Address Strength Reduction

Address computation is a disproportionately expensive category of work on NVIDIA GPUs. The integer ALU units that compute memory addresses are a scarce resource relative to the FP/tensor throughput the hardware is designed to maximize. A typical unrolled loop body touching four arrays at A[tid + i], B[tid + i], C[tid + i], D[tid + i] -- where tid is a function of threadIdx.x, blockIdx.x, and blockDim.x -- may emit four independent 64-bit multiply-add chains per iteration, each recomputing the same base expression base_ptr + tid * element_size. Reducing those four chains to one base computation plus three cheap constant-offset additions can halve the integer instruction count in the loop body and free address registers that would otherwise stay live across the entire loop.

Base Address Strength Reduction (BASR) is an NVIDIA-proprietary IR-level pass that performs exactly this transformation. It scans loop bodies for memory operations that share a common base pointer expression, finds the one with the minimum constant offset (the "anchor"), hoists the anchor computation, and rewrites all remaining addresses as (anchor + relative_offset). The pass is confirmed by the string "BaseAddressStrengthReduce" at decompiled line 457 of sub_1C67780.

Key Facts

Property	Value
Pass name	`BaseAddressStrengthReduce`
Entry point	`sub_1C67780` (Legacy PM), `sub_2CA4A10` (New PM)
Binary size	58 KB (~1,400 decompiled lines)
Pass type	NVIDIA-proprietary, IR-level, loop body transform
Primary knobs	`do-base-address-strength-reduce` (two levels: 1 = no conditions, 2 = with conditions)
Chain variant	`do-base-address-strength-reduce-chain` (separate boolean toggle)
Negative offset control	`dword_4FBCAE0` (aggressiveness for negative-offset patterns)
IV limit	`base-address-strength-reduce-iv-limit` (parametric)
Max IV	`base-address-strength-reduce-max-iv` (parametric)
Debug dump	`dump-base-address-strength-reduce`
Required analyses	LoopInfo (`sub_1632FA0`), DataLayout
Option registration	`ctor_263_0` at `0x4F36F0` (shared with SCEV-CGP, 44 strings total)
Companion pass	Common Base Elimination (`sub_1C5DFC0`)
Helper	Bitcast helper at `sub_1C637F0` (28 KB, strings `"baseValue"`, `"bitCastEnd"`)

Algorithm

The pass operates in six phases, executing once per function. It processes all loop bodies simultaneously using worklists seeded from LoopInfo.

Phase 1 -- Initialization (lines 452-497)

The entry function retrieves LoopInfo via sub_1632FA0 and extracts the module's DataLayout from the function object (path: (a1+184)->field+24->field+40). It then allocates bookkeeping state:

Eight hash maps at stack offsets v374-v399, keyed by Value* (the base pointer). Each map entry holds a linked list of memory instructions that share that base.
Multiple worklists for basic blocks containing loads vs. stores.
Threshold: v429 = 2 -- the minimum number of uses of the same base before the pass considers strength reduction worthwhile.
Pass counter: v438 = 1 -- the initial pass number (the pass may iterate).

Phase 2 -- Address Pattern Collection (lines 518-600)

For each instruction in the target basic blocks (drawn from the a4 worklist):

sub_1C57390 classifies the address expression, extracting its structural form.
sub_1CCB2B0 computes alignment information from the DataLayout.
sub_1456040 extracts the base pointer from the address expression.

The base pointer is then categorized into one of two buckets:

Category	Condition	Hash map	Worklist	Description
Non-pointer-type base	`type_id != 15`	`v382`	`v363`	Integer/GEP-derived base addresses
Pointer-type base	`type_id == 15`	`v378`	`v360`	Bases that are raw pointers to globals

For pointer-type bases, sub_1CCDC20 further extracts the underlying global variable, allowing grouping of addresses to the same global even when accessed through different local pointer variables.

Hash map insertion uses sub_1C50900. If the base pointer is new (not yet in the map), the instruction list is initialized and the base is appended to the corresponding worklist. Otherwise, the instruction is appended to the existing list for that base.

for each instruction I in target BBs:
    addr_info = classify_address(I)          // sub_1C57390
    alignment = compute_alignment(addr_info)  // sub_1CCB2B0
    base_ptr  = extract_base(addr_info)       // sub_1456040

    if type_of(base_ptr) != POINTER_TYPE:
        map_insert(hash_map_v382, base_ptr, I)    // sub_1C50900
        if is_new_entry:
            worklist_v363.append(base_ptr)
    else:
        global = extract_global(base_ptr)         // sub_1CCDC20
        map_insert(hash_map_v378, global, I)
        if is_new_entry:
            worklist_v360.append(global)

Phase 3 -- Anchor Finding (lines 430-470)

For each base pointer that has accumulated at least v429 (2) uses, the pass determines the "anchor" -- the use with the minimum constant offset. This is the instruction whose address computation will be hoisted and shared.

For each candidate base:

sub_1C53170 decomposes each address expression into a (base, constant_offset) pair.
The pass iterates over all uses and finds the one with the smallest constant offset:
- For offsets that fit in 64 bits: direct integer comparison via sign-extended values.
- For offsets wider than 64 bits: reads from extended-precision word arrays and compares word-by-word.
The minimum-offset use becomes the anchor.

function find_anchor(base_ptr, use_list):
    min_offset = +INF
    anchor = null

    for each use U in use_list:
        (base, offset) = decompose_address(U)  // sub_1C53170

        if bit_width(offset) <= 64:
            val = sign_extend_64(offset)
        else:
            val = read_extended_precision(offset)

        if val < min_offset:
            min_offset = val
            anchor = U

    return (anchor, min_offset)

Phase 4 -- Address Rewriting (lines 578-600)

Once the anchor is identified:

sub_13A5B00 creates a new base address instruction from the anchor's address computation. This instruction is placed at the loop preheader or the dominating point of all uses.
For every other instruction sharing the same base, the pass computes the relative offset: relative_offset = original_offset - anchor_offset.
sub_14806B0 creates a new address expression (new_base + relative_offset) and replaces the original address operand.

function rewrite_addresses(anchor, anchor_offset, use_list):
    new_base = create_base_instruction(anchor)  // sub_13A5B00

    for each use U in use_list:
        if U == anchor:
            replace_address(U, new_base)
        else:
            (_, orig_offset) = decompose_address(U)
            rel_offset = orig_offset - anchor_offset
            new_addr = create_offset_add(new_base, rel_offset)  // sub_14806B0
            replace_address(U, new_addr)

After this transformation, a loop body that previously contained:

load (base + tid*stride + 0)    // original: full GEP chain
load (base + tid*stride + 16)   // original: full GEP chain
store (base + tid*stride + 32)  // original: full GEP chain
store (base + tid*stride + 48)  // original: full GEP chain

Becomes:

anchor = base + tid*stride + 0  // hoisted once
load anchor                     // offset 0: use anchor directly
load (anchor + 16)              // cheap add
store (anchor + 32)             // cheap add
store (anchor + 48)             // cheap add

The three 64-bit multiply-add chains are replaced by three 64-bit immediate additions.

Phase 5 -- Negative Offset Handling (lines 512-520)

When dword_4FBCAE0 > 1 (the aggressiveness knob is set above default), the pass also considers address groups where the maximum offset has a negative sign bit. These represent patterns like:

load (base + tid*stride - 32)
load (base + tid*stride + 0)
load (base + tid*stride + 32)

Without this phase, the anchor would be the instruction at offset -32, producing negative relative offsets for the first use. Some hardware addressing modes handle negative offsets less efficiently, so this phase is gated separately.

For negative-offset candidates, the pass:

Checks whether the base is loop-invariant via sub_1C51340.
If loop-invariant, creates a separate common base via sub_1C55CE0 that absorbs the negative component.

Phase 6 -- Red-Black Tree Tracking

The pass uses a red-black tree infrastructure (sub_220F040 for insertion, sub_220EF80 for lookup) shared with other NVIDIA passes. This provides O(log n) sorted-set operations for maintaining collections of instruction pointers and efficiently checking membership during the rewriting phase.

Hash Map Implementation

The address pattern hash maps use the standard DenseMap growth policy (75% load factor, 12.5% tombstone compaction) with NVVM-layer sentinels (-8 / -16). The resize/rehash logic lives in sub_1C54050 -- the same function used by Common Base Elimination. Hash keys are Value* pointers with linear probing. See Hash Table and Collection Infrastructure for the hash function and probing strategy.

Relationship with Common Base Elimination

BASR and Common Base Elimination (sub_1C5DFC0) attack the same problem -- redundant address computation -- but at different scopes and with different strategies:

Dimension	Base Address Strength Reduction	Common Base Elimination
Scope	Intra-loop: operates within a single loop body	Inter-block: operates across the CFG using dominance
Grouping	Groups addresses by shared induction-variable-based base	Groups addresses by shared base pointer to the same global
Placement	Anchor placed at loop preheader	Anchor placed at common dominator of all uses
Offset model	Constant offsets relative to IV-derived base	Constant offsets relative to global-derived base
Entry point	`sub_1C67780`	`sub_1C5DFC0`
Size	58 KB	38 KB

The two-pass approach is deliberate. Common Base Elimination runs first at the IR level, hoisting shared base expressions across control flow boundaries. BASR then runs within loop bodies, strength-reducing the IV-dependent address chains that CBE cannot handle because the IV changes each iteration.

Both passes share the same address decomposition helper (sub_1C53170), the same hash map infrastructure (sub_1C50900, sub_1C54050), and the same instruction creation routines (sub_13A5B00, sub_14806B0).

Relationship with SCEV-CGP

The BASR knobs are registered together with SCEV-CGP (Scalar-Evolution-based CodeGenPrepare) in ctor_263_0 at 0x4F36F0. This constructor registers 44 option strings total, covering both SCEV-CGP and BASR. The do-base-address-strength-reduce and do-scev-cgp knobs are stored in the same ctor_526_0 option block.

SCEV-CGP is a broader pass that performs SCEV-based address optimization using thread ID as an induction variable (scev-cgp-tid-max-value controls the maximum thread ID value for analysis). BASR is a sub-transformation within this address optimization framework -- it handles the specific case of multiple memory operations sharing a base, while SCEV-CGP handles the broader case of rewriting address expressions using scalar evolution.

Related SCEV-CGP knobs that interact with BASR:

Knob	Purpose
`scev-cgp-old-base`	Controls whether SCEV-CGP creates new base expressions
`ignore-bad-base`	Bypasses validity checks on base pointer classification
`ignore-32-bit-overflow`	Skips 32-bit overflow checks in address arithmetic
`ignore-signed-32-bit-overflow`	Skips signed 32-bit overflow checks
`topo-sort-begin`	Controls topological sort start point for address chains
`special-reassociate-for-threadid`	Prevents reassociation from moving threadId-dependent expressions

Configuration

Boolean Knobs

Knob	Default	Description
`do-base-address-strength-reduce`	Enabled (level 2)	Master enable. Level 1 = unconditional; level 2 = with conditions (default). 0 = disabled.
`do-base-address-strength-reduce-chain`	Enabled	Enables the chain variant, which strength-reduces chains of dependent address computations
`dump-base-address-strength-reduce`	false	Prints diagnostic output when set

Parametric Knobs

Knob	Description
`base-address-strength-reduce-iv-limit`	Maximum number of induction variables to consider per loop
`base-address-strength-reduce-max-iv`	Maximum IV value for strength reduction eligibility

Global Variables

Global	Purpose
`dword_4FBCAE0`	Negative offset aggressiveness. When > 1, enables strength reduction of address groups with negative offsets. Also used as a special minimum-selection mode in MemorySpaceOpt.

Diagnostic Strings

"BaseAddressStrengthReduce"   -- Pass identification (line 457)
"baseValue"                   -- Bitcast helper: base value operand name (sub_1C637F0)
"bitCastEnd"                  -- Bitcast helper: end-of-chain marker (sub_1C637F0)

When dump-base-address-strength-reduce is enabled, the pass emits additional diagnostic output showing which base pointers were grouped, which anchor was selected, and which addresses were rewritten.

Key Functions

Function	Address (Legacy)	Size	Role
Main entry	`sub_1C67780`	58 KB	Pass driver: initialization, collection, anchor finding, rewriting
Main entry (New PM)	`sub_2CA4A10`	62 KB	New Pass Manager variant
Address classifier	`sub_1C57390`	--	Classifies address expression structure
Address decomposer	`sub_1C53170`	--	Decomposes address into `(base, constant_offset)` pairs
Hash map insert	`sub_1C50900`	--	Inserts base pointer into pattern hash map
Hash map resize	`sub_1C54050`	--	Load-factor-based resize/rehash
Loop invariance check	`sub_1C51340`	--	Tests whether a value is loop-invariant
Negative offset handler	`sub_1C55CE0`	--	Creates common base for negative-offset patterns
Base instruction creation	`sub_13A5B00`	--	Creates the hoisted anchor address instruction
Offset rewriting	`sub_14806B0`	--	Creates `(base + relative_offset)` replacement
Base extraction	`sub_1456040`	--	Extracts base pointer from address expression
Global extraction	`sub_1CCDC20`	--	Extracts underlying global variable from pointer chains
Alignment computation	`sub_1CCB2B0`	--	Computes alignment from DataLayout
Bitcast helper	`sub_1C637F0`	28 KB	Handles bitcast chains in base address expressions
RB-tree insert	`sub_220F040`	--	Red-black tree insertion (shared infrastructure)
RB-tree lookup	`sub_220EF80`	--	Red-black tree membership check
LoopInfo retrieval	`sub_1632FA0`	--	Gets LoopInfo analysis for the function

Cross-References

Common Base Elimination -- the complementary inter-block pass
Pass Overview & Inventory -- master pass listing
Optimizer Pipeline -- pipeline position and option registration
Rematerialization -- another pass trading computation for register pressure
SCEV -- the scalar evolution analysis that SCEV-CGP (and indirectly BASR) depends on

Common Base Elimination

The Common Base Elimination pass hoists shared base address expressions to dominating points in the control flow graph, eliminating redundant recomputations of the same base pointer across multiple basic blocks. Where Base Address Strength Reduction targets intra-loop patterns driven by induction variables, Common Base Elimination operates at the inter-block level: it groups memory operations that share the same base pointer regardless of loop structure, finds their common dominator, and creates a single base computation at that dominator. Every original address is then rewritten as (hoisted_base + relative_offset).

This is a strictly GPU-motivated optimization. NVIDIA GPUs have limited integer ALU throughput relative to their floating-point pipelines, so any reduction in address arithmetic directly translates to freed execution slots for other work. On a typical CUDA kernel performing strided accesses across multiple branches (e.g., different cases of a switch over tile indices), the pass can eliminate dozens of redundant GEP chains that independently recompute the same base address.

The two-pass approach -- Common Base Elimination first at the IR level for inter-block redundancies, then Base Address Strength Reduction for intra-loop induction-variable patterns -- ensures comprehensive coverage of GPU address computation overhead.

Key Facts

Property	Value
Pass name	`"Common Base Elimination"`
Entry point	`sub_1C5DFC0`
Binary offset	`0x1C5DFC0`
Binary size	38 KB (~850 decompiled lines)
Scope	Function-level
IR level	LLVM IR (pre-codegen)
Upstream equivalent	None -- entirely NVIDIA-proprietary
Complementary pass	Base Address Strength Reduction (`sub_1C67780`)
Primary knobs	`scev-cgp-cross-block-limit` -- limits common bases from a single block
Required analysis	Dominator tree (`a1[23]`), DataLayout

Algorithm

The pass has four major phases: address decomposition, base pointer grouping, dominator-based hoisting, and address rewriting.

Phase 1 -- Address Expression Decomposition

For every memory operation (load, store, GEP-based address) in the function, the pass calls sub_1C53170 to decompose the address into a structured form:

struct AddressExpr {
    Value *base_ptr;          // The root pointer (alloca, global, argument)
    Operand  operands[];      // List of (index, constant_offset) pairs
    unsigned operand_count;   // Number of index terms
};

The result is stored as a (base_ptr, operand_list, operand_count) tuple. The decomposition strips away GEP chains to expose the underlying base pointer and accumulates constant offset terms separately from variable index terms. This is the same decomposition helper used by BASR (sub_1C67780), ensuring both passes reason about addresses in a compatible representation.

Phase 2 -- Base Pointer Grouping

The pass maintains two hash maps for grouping addresses:

Non-pointer-type bases (hash map at v382, keyed by base pointer value):

Each memory operation whose decomposed base is not a pointer type (type_id != 15) is inserted via sub_1C50900.
The hash entry accumulates a list of all instructions sharing that base.
New bases are appended to worklist v363.

Pointer-to-global bases (hash map at v378, keyed by underlying global variable):

For pointer-type bases, sub_1CCDC20 extracts the underlying global variable by walking through bitcast and GEP chains.
This allows grouping addresses to the same global even when accessed through different local pointer variables.
New globals are appended to worklist v360.

The hash maps use the standard DenseMap growth policy (75% load factor, 12.5% tombstone compaction) with NVVM-layer sentinels (-8 / -16). sub_1C54050 handles both resize and in-place rehash. See Hash Table and Collection Infrastructure for the complete specification.

Phase 3 -- Dominator Walk and Base Hoisting

For each base pointer group containing two or more uses, the pass:

Finds the anchor. Among all constant offsets in the group, the operand with the minimum constant offset becomes the anchor. For offsets up to 64 bits, the constant is extracted directly from the GEP operand. For wider offsets (> 64 bits), the pass reads from extended-precision word arrays. Sign-extended comparisons determine the minimum.
Computes the common dominator. The pass reads the function's dominator tree from a1[23] and walks it to find the nearest block that dominates all use sites. This is the standard findNearestCommonDominator operation -- iteratively walk both paths toward the root until they meet.
Inserts the hoisted base. sub_13A5B00 creates a new base address computation (a GEP or add instruction) at the terminator insertion point of the common dominator block. The hoisted instruction computes base_ptr + min_offset, which is the anchor's address.
Rewrites all uses. For each original memory operation in the group, sub_14806B0 rewrites the address as (hoisted_base + (original_offset - min_offset)). Since the anchor's own relative offset is zero, it becomes a direct use of the hoisted base.

In pseudocode:

fn run_common_base_elimination(F: &Function):
    let dom_tree = F.dominator_tree    // a1[23]
    let data_layout = F.module.data_layout

    // Phase 1+2: decompose and group
    let base_groups: HashMap<Value*, Vec<(Instruction*, ConstantOffset)>> = {}
    let global_groups: HashMap<GlobalVariable*, Vec<(Instruction*, ConstantOffset)>> = {}

    for bb in F.basic_blocks():
        for inst in bb.instructions():
            if !is_memory_op(inst): continue
            let (base, offsets, count) = sub_1C53170(inst)

            if base.type_id != POINTER_TYPE:
                base_groups[base].push((inst, offsets))
            else:
                let gv = sub_1CCDC20(base)   // extract global
                global_groups[gv].push((inst, offsets))

    // Phase 3+4: hoist and rewrite
    for (base, uses) in chain(base_groups, global_groups):
        if uses.len() < 2: continue

        let min_offset = uses.iter().map(|u| u.offset).min()
        let anchor_inst = uses.find(|u| u.offset == min_offset).inst

        // Find common dominator of all use blocks
        let dom_block = uses[0].inst.parent
        for use in uses[1..]:
            dom_block = dom_tree.find_nearest_common_dominator(
                            dom_block, use.inst.parent)

        // Hoist: create base+min_offset at dominator
        let hoisted = sub_13A5B00(dom_block, base, min_offset)

        // Rewrite all uses
        for (inst, offset) in uses:
            let relative = offset - min_offset
            sub_14806B0(inst, hoisted, relative)

Phase 4 -- Pointer-to-Global Grouping

The global-variable grouping deserves special attention. Consider two local pointers p and q that both derive from the same global array g:

%p = getelementptr [1024 x float], ptr @g, i64 0, i64 %tid
%q = getelementptr [1024 x float], ptr @g, i64 0, i64 %tid2

Without the global extraction step, these would be in different groups (keyed by %p vs %q). The sub_1CCDC20 helper walks through the pointer chain to find the underlying @g, allowing the pass to recognize that both addresses target the same global and can share a hoisted base.

Cost-Benefit Analysis

The pass trades register pressure at the dominator for reduced address computation at use sites. This trade-off is particularly favorable on GPUs for two reasons:

Benefit -- Reduced integer ALU pressure. Each eliminated GEP chain frees integer ALU slots. On SM architectures, integer instructions compete for the same warp scheduler slots as floating-point instructions. A kernel with N memory operations sharing the same base saves up to (N-1) complete base address recomputations. For a kernel doing 8 loads from the same struct through different control-flow paths, this eliminates 7 redundant address computations.

Cost -- Extended live range at the dominator. The hoisted base must remain live from the dominator block down to every use site. On GPUs, each additional live register reduces occupancy (the number of concurrent warps per SM). The pass implicitly relies on the subsequent rematerialization pass (sub_1CE7DD0) to undo any hoisting decisions that prove too costly for register pressure -- if the hoisted value's live range crosses too many basic blocks, rematerialization will re-derive it closer to the use point.

The SCEV-CGP knob scev-cgp-cross-block-limit provides an explicit limit on how many common bases can be created from a single block, acting as a safety valve against excessive register pressure growth. The related scev-cgp-idom-level-limit constrains how far up the dominator tree the pass is willing to hoist.

Relationship with Base Address Strength Reduction

The two passes operate at different granularities and are intentionally complementary:

Aspect	Common Base Elimination	Base Address Strength Reduction
Scope	Inter-block (dominator-based)	Intra-loop (induction-variable-based)
Target pattern	Multiple BBs accessing the same base	Loop body with `base + stride * iv`
Mechanism	Hoist to common dominator	Factor out common base, use incremented pointer
Key helper	`sub_1C53170` (address decomposition)	`sub_1C53170` (same decomposition)
Offset handling	Minimum-offset anchor	Minimum-offset anchor (same strategy)
Pipeline order	Runs first	Runs after CBE

The shared address decomposition helper (sub_1C53170) and the shared rewriting infrastructure (sub_13A5B00 for creating new base computations, sub_14806B0 for rewriting addresses) confirm that these passes were designed as a coordinated pair. Common Base Elimination runs first to eliminate inter-block redundancies, leaving BASR to focus on the remaining intra-loop stride patterns. Without CBE running first, BASR would encounter more diverse base expressions in loop bodies, reducing its grouping effectiveness.

Both passes share the same 0x1C50000-0x1CCFFFF address range in the binary, and BASR's helper functions (e.g., sub_1C637F0 -- base address bitcast helper, strings "baseValue", "bitCastEnd") are directly adjacent to CBE's entry point.

Configuration

Direct Knobs

No CBE-specific enable/disable knob has been identified in the binary. The pass appears to be unconditionally enabled when the SCEV-CGP subsystem is active.

Knob	Type	Description
`scev-cgp-cross-block-limit`	int	Maximum number of common bases that can be created from a single block. Limits the register pressure increase from hoisting.
`scev-cgp-idom-level-limit`	int	Maximum dominator tree depth for hoisting. Prevents hoisting too far from use sites.
`do-scev-cgp`	bool	Master enable for the SCEV-CGP subsystem. Disabling this may also disable CBE.
`do-base-address-strength-reduce`	int	Two levels: 1 = basic, 2 = with conditions. Controls the companion BASR pass.
`do-base-address-strength-reduce-chain`	bool	Enables chained strength reduction in BASR.
`base-address-strength-reduce-iv-limit`	int	IV limit for BASR.
`base-address-strength-reduce-max-iv`	int	Maximum IVs considered by BASR.

BASR Aggressiveness Knob

The global dword_4FBCAE0 controls aggressiveness for negative-offset handling in the BASR companion pass. When dword_4FBCAE0 > 1, BASR also considers base groups where the maximum offset has a negative sign bit, checking via sub_1C51340 whether the base is loop-invariant before creating a separate common base via sub_1C55CE0. This knob does not directly affect CBE but influences how much address redundancy remains for CBE to handle.

Diagnostic Strings

"Common Base Elimination"

The pass registers a single diagnostic string (its name). No additional debug/dump strings have been identified. The pass does not appear to have a dedicated dump knob analogous to dump-base-address-strength-reduce for BASR.

Function Map

Function	Address	Size	Role
`CommonBaseElimination::run`	`sub_1C5DFC0`	38 KB	Main entry point -- orchestrates all four phases
`decomposeAddress`	`sub_1C53170`	--	Decomposes a memory address into `(base, offset_list, count)` tuple. Shared with BASR.
`hashMapGrowOrRehash`	`sub_1C54050`	--	Hash map resize/rehash with load-factor policy
`hashMapInsertOrLookup`	`sub_1C50900`	--	Insert into or look up in the base-pointer hash map
`extractGlobalFromPointerChain`	`sub_1CCDC20`	--	Walks bitcast/GEP chains to find the underlying `GlobalVariable`
`createCommonBaseForNegativeOffsets`	`sub_1C55CE0`	--	Creates a separate common base when the max offset is negative. Used by BASR, available to CBE.
`isBaseLoopInvariant`	`sub_1C51340`	--	Checks whether a base address is loop-invariant
`classifyAddressExpression`	`sub_1C57390`	--	Classifies an instruction's address expression type
`createNewBaseInstruction`	`sub_13A5B00`	--	Creates a new base address computation at the insertion point
`rewriteAddressAsBaseOffset`	`sub_14806B0`	--	Rewrites an address as `(new_base + relative_offset)`
`extractBasePointer` (SCEV helper)	`sub_1456040`	--	Extracts the base pointer from an address expression (SCEV `getStart`/`getOperand(0)`)

Cross-References

Base Address Strength Reduction -- the companion intra-loop pass
SCEV-CGP knobs -- knobs controlling cross-block limits and IDOM depth
NVIDIA Custom Passes Overview -- pass inventory and registration
Rematerialization -- downstream pass that can undo costly hoisting by re-deriving values closer to use sites
Other NVIDIA Passes -- summary entries for CBE and BASR
LLVM Optimizer -- two-phase pipeline where CBE runs

CSSA -- Conventional SSA for GPU Divergence

Standard SSA form assumes that a PHI node selects its incoming value based solely on the control flow edge along which execution arrived. On a scalar CPU, exactly one predecessor edge is taken per dynamic execution of the PHI, so this assumption holds trivially. On an NVIDIA GPU, it does not. A warp of 32 threads executes in lockstep, and when control flow diverges -- different threads take different branches -- all paths are eventually serialized and the warp reconverges. At the reconvergence point, a standard PHI node cannot correctly select a single incoming value because the warp carries live values from multiple predecessors simultaneously. The wrong thread could see the wrong value.

CSSA (Conventional SSA) is NVIDIA's transformation that rewrites the IR so that every PHI node is safe under warp-divergent execution. It does this by inserting explicit copy instructions at points where threads reconverge, ensuring that each thread's value is materialized into its own copy before the PHI merges anything. The name "Conventional SSA" comes from the SSA literature: a program is in CSSA form when every PHI node's operands can be simultaneously live without interfering -- the PHI web has no overlapping live ranges. This property is exactly what GPU divergence demands.


Pass location	`sub_3720740` (22KB, ~800 lines decompiled)
Address range	`0x3720740`--`0x3721501` (3521 bytes)
Gate knob	`do-cssa` (NVVMPassOptions boolean toggle)
Coalesce knob	`cssa-coalesce` (controls copy coalescing aggressiveness)
Debug knobs	`cssa-verbosity`, `dump-before-cssa`
Container knob	`CSSACoalescing` (NVVM container format, parsed at `sub_CD9990`)
Debug string	`"IR Module before CSSA:\n"`
Helper cluster	`sub_371F790` (27KB), `sub_371F160`, `sub_371EDF0`, `sub_371CDC0`
Pass-option slot	One of the 221 NVVMPassOptions slots (boolean do/don't pair)
Pipeline position	Late IR, after optimization, before SelectionDAG lowering
Upstream equivalent	None. LLVM has no concept of warp-divergent PHI semantics.

GPU Divergence Background

The Warp Execution Model

NVIDIA GPUs execute threads in warps of 32 under the SIMT model: all threads share a program counter, and divergent branches serialize both paths before the warp reconverges. The full warp execution model and its implications for cicc are documented in the GPU Execution Model.

Why Standard SSA Breaks

Consider a diamond CFG:

         entry
        /     \
     then     else
        \     /
         join        <-- PHI(%x = [then: %a], [else: %b])

On a CPU, the PHI at join works correctly: execution came from exactly one predecessor, so the PHI selects the corresponding value. On a GPU warp where threads 0-15 took then and threads 16-31 took else, both paths executed sequentially. When the warp reconverges at join, the PHI must produce %a for threads 0-15 and %b for threads 16-31 simultaneously in the same register. A naive lowering of the PHI to a simple register copy is incorrect -- whichever path executed last would overwrite the value from the first path.

The CSSA Solution

CSSA transforms the IR so that the PHI web has non-interfering live ranges. Concretely, it inserts copy instructions at the end of each predecessor block so that each thread's value is written into a dedicated copy before the warp reconverges:

         entry
        /     \
     then     else
     %a_copy  %b_copy     <-- inserted copies (one per predecessor)
        \     /
         join
     %x = PHI [then: %a_copy], [else: %b_copy]

Now the PHI's operands occupy distinct virtual registers. During later register allocation, the allocator can assign them the same physical register only when their live ranges truly do not overlap -- which is the correct condition for divergent execution. The copies give the allocator the freedom to keep the values separate when divergence requires it.

Algorithm

The sub_3720740 function implements CSSA in several phases:

Phase 1: Basic Block Ordering and Numbering

The function begins by iterating over all basic blocks in the LLVM function (accessed via [r15], the LLVM Module/Function pointer) and assigning sequential numbering. Each basic block receives an ordinal stored at offset +0x48 (preorder index) and +0x4C (reverse postorder index). These indices are used later for dominance and reconvergence queries. The block list is walked via the standard LLVM doubly-linked list at function offsets +0x48/+0x50 (begin/end sentinels), with a secondary worklist stored in a dynamic array at [rbp-0x240] that grows via the standard SmallVector growth function sub_C8D5F0.

After ordering, the function sets byte [r8+0x70] = 1 and dword [r8+0x74] = 0 on the pass state object (at [r15+8]), marking the ordering phase as complete. If the ordering was already done (byte [r8+0x70] is non-zero on entry), the function skips directly to phase 2.

Phase 2: PHI Node Scanning and Hash Map Population

The function iterates over every basic block (outer loop at 0x37208C0) and within each block walks the instruction use-list (inner loop at 0x3720930). Instructions are identified by checking byte [rbx-0x18] (the LLVM Value tag / opcode byte) against 0x54 (decimal 84), which is the LLVM PHI node opcode. Non-PHI instructions are skipped.

For each PHI node found, the function:

Increments a monotonic counter at [r15+0x78] to assign a unique PHI ID.
Computes a hash of the PHI's pointer value using the standard NVIDIA hash: h = (ptr >> 4) ^ (ptr >> 9), masked by (table_size - 1). This is the same hash function used across CICC's DenseMap infrastructure.
Inserts the PHI (or looks it up) in the hash map at [r15+0x60] with metadata fields: key at [slot+0], PHI ID at [slot+8]. The hash table uses LLVM-layer sentinels (-4096 / -8192); see Hash Table and Collection Infrastructure for the probing and growth policy.
Calls sub_A41E30 to resize the hash table when the load factor exceeds the 75% threshold.

Phase 3: Copy Insertion at Reconvergence Points

After populating the PHI map, the function enters the copy-insertion phase. For each basic block that contains PHI nodes, it:

Walks the PHI's incoming values (the use-list at offset +0x18 through the instruction's operand chain at 32-byte stride).
For each incoming value, calls sub_371F160 with r8d=1 (the "insert copy" flag). This helper creates a copy instruction at the end of the predecessor block, before the terminator. The copy is named with the "pcp" (PHI copy propagation) prefix string, as evidenced by the lea rax, aPcp instruction at 0x3720D34.
Calls sub_ACA8A0 to set the name on the newly created copy instruction.
Calls sub_371CDC0 with an instruction builder struct to create the actual copy/move IR instruction. The call passes opcode 0x22D7 (8919 decimal) as the first argument via edi -- this is likely an NVVM-internal opcode for a divergence-safe copy.
Calls sub_371EDF0 to insert the new copy instruction into the predecessor block's instruction list. This is followed by sub_BD84D0 (the standard LLVM insertBefore/insertAfter) to splice the instruction into position.
Updates the PHI node's use chain: the operand that previously pointed to the original value now points to the copy. This rewiring is done at 0x3720C87--0x3720CDD by manipulating the 32-byte use-def chain entries (pointer at [use+0], predecessor at [use+8], backlink at [use+10]).

Phase 4: Instruction-Level Copy Propagation

After copy insertion, the function iterates over all basic blocks a second time (0x3720A2F--0x3720A62). For each instruction in each block, it calls sub_371F790 (27KB, the "NVPTX intrinsic operand builder" / copy propagation helper). This function propagates the "pcp" copies through the instruction graph, replacing uses of the original values with uses of the copies where appropriate, and eliminating redundant copies where the original value and the copy provably carry the same value for all threads.

Phase 5: Dead Copy Cleanup

The final phase walks a linked list at [r15+0x28] (a cleanup worklist). For each entry, it checks whether the instruction at [entry+8] has zero remaining uses ([rdi+0x10] == 0). If so, it calls sub_B43D60 to erase the dead instruction. This removes copies that were rendered unnecessary by the propagation phase.

Copy Coalescing

The cssa-coalesce knob controls how aggressively the pass coalesces the inserted copies back together. Without coalescing, CSSA inserts one copy per PHI operand per predecessor -- potentially a large number of copies in control flow with many branches. Coalescing identifies cases where two or more copies carry the same value and can share a single register, reducing the copy overhead.

The CSSACoalescing knob in the NVVM container format (parsed by sub_CD9990 from the finalizer knobs structure) provides a separate control path for the same behavior. The container knob is categorized alongside register allocation and scheduling controls (AdvancedRemat, DisablePredication, DisableXBlockSched, ReorderCSE), confirming that CSSA coalescing is considered part of the register allocation subsystem.

deSSA Alternative

The usedessa knob (default value 2, registered at ctor_358_0 at 0x50E8D0, stored in dword_4FD26A0) selects an alternative path for PHI elimination during the transition from SSA to machine code. Despite its name suggesting "de-Static Single Assignment", analysis of the dispatch functions shows it controls the scheduling and PHI elimination pipeline:

Mode	Pre-RA Scheduling	Post-RA Scheduling	Behavior
1	Skipped	Minimal (single pass)	Simple mode -- no pre-RA scheduling
2 (default)	Full (`&unk_4FC8A0C`)	Three passes + StackSlotColoring	Full mode -- complete scheduling pipeline

The deSSA mode and CSSA transformation are complementary. CSSA operates at the LLVM IR level, converting PHI nodes into a form safe for GPU divergence before instruction selection. The usedessa mode controls how PHI nodes are ultimately eliminated during the MachineIR lowering, after SelectionDAG has already consumed the CSSA-transformed IR. When usedessa=2 (default), the full scheduling pipeline runs, giving the register allocator maximum flexibility to handle the extra copies that CSSA introduced. When usedessa=1, the minimal scheduling mode may be appropriate for debugging or for kernels where scheduling causes regressions.

Configuration Knobs

NVVMPassOptions Knob

Knob	Type	Description
`do-cssa`	bool	Master enable/disable for the CSSA pass

Set via -opt "-do-cssa=0" to disable the pass entirely.

cl::opt Knobs (ctor_705 at `0x5BD430`)

Knob	Type	Default	Global	Description
`cssa-coalesce`	int	(unknown)	(ctor_705 data)	Controls PHI operand coalescing aggressiveness. Higher values = more aggressive coalescing = fewer copies but higher risk of incorrect merging under divergence.
`cssa-verbosity`	int	0	(ctor_705 data)	Verbosity level for diagnostic output during the CSSA transformation.
`dump-before-cssa`	bool	false	`qword_5050A28`	When non-zero, dumps the entire IR module before CSSA runs. Triggers the `"IR Module before CSSA:\n"` output followed by `sub_A69980` (Module::print).

Container-Format Knob

Knob	Parsed At	Category	Description
`CSSACoalescing`	`sub_CD9990`	Register allocation / scheduling	Controls CSSA coalescing from the NVVM container format. Parsed alongside `AdvancedRemat`, `DisablePredication`, `DisableXBlockSched`.

Knob	Type	Default	Global	Description
`usedessa`	int	2	`dword_4FD26A0`	Selects deSSA method / scheduling pipeline mode. Mode 1 = simple (no pre-RA scheduling), mode 2 = full.

Diagnostic Strings

"IR Module before CSSA:\n"          -- Module dump header (dump-before-cssa)
"pcp"                               -- PHI copy propagation instruction name prefix

The "pcp" prefix is assigned to all copy instructions created by the CSSA pass. These copies can be identified in IR dumps by their %pcp naming. After register allocation, these copies may be eliminated (coalesced into the same physical register) or materialized as actual move instructions in the final PTX.

Function Map

Function	Address	Size	Role
CSSA main	`sub_3720740`	22KB	BB ordering, PHI scanning, copy insertion, cleanup
PCP builder	`sub_371F790`	27KB	PHI copy propagation / intrinsic operand builder
Copy insertion helper	`sub_371F160`	--	Creates copy instruction in predecessor block
Copy instruction creator	`sub_371EDF0`	--	Inserts copy into instruction list
Copy IR builder	`sub_371CDC0`	--	Builds the copy instruction IR node
Hash table grow	`sub_A41E30`	--	DenseMap resize for PHI hash table
Module printer	`sub_A69980`	--	Module::print (for dump-before-cssa)
raw_ostream::write	`sub_CB6200`	--	String output for debug dump
Debug stream getter	`sub_C5F790`	--	Returns current debug output stream
Instruction eraser	`sub_B43D60`	--	Erases dead instruction from parent block
Instruction insert	`sub_BD84D0`	--	BasicBlock::insert (instruction splice)
Name setter	`sub_ACA8A0`	--	Value::setName for `"pcp"` prefix
Use chain rewrite	`sub_B96E90`	--	replaceAllUsesWith on operand
Use helper	`sub_B91220`	--	Use-list manipulation
DenseMap grow helper	`sub_C8D5F0`	--	SmallVector/DenseMap capacity growth
Knob registration	`ctor_705` (`0x5BD430`)	5.4KB	Registers `cssa-coalesce`, `cssa-verbosity`, `dump-before-cssa`
Container knob parser	`sub_CD9990`	31KB	Parses `CSSACoalescing` from NVVM container
deSSA dispatch (post-RA)	`sub_21668D0`	--	Scheduling pipeline mode selector
deSSA dispatch (pre-RA)	`sub_2165850`	--	Pre-RA scheduling mode selector

Differences from Upstream LLVM

LLVM's standard PHI elimination pass (llvm::PHIEliminationPass, registered as "phi-node-elimination" at pipeline slot 493 in CICC's pass parser) lowers PHI nodes to machine copies during the SelectionDAG-to-MachineIR transition. It operates under the assumption that PHI semantics follow scalar control flow -- exactly one predecessor contributes a value at each dynamic execution.

NVIDIA's CSSA pass runs before instruction selection, at the LLVM IR level, and transforms the IR into a form where PHI elimination can proceed safely even when the underlying execution model is SIMT. The two passes are not alternatives -- CSSA runs first to prepare the IR, then standard PHI elimination runs later to lower the CSSA-safe PHI nodes to machine copies.

This is one of the fundamental semantic gaps between LLVM's CPU-centric IR model and GPU reality. LLVM assumes sequential scalar semantics; NVIDIA's CSSA pass bridges that gap by making the implicit thread-level parallelism explicit in the copy structure of the IR.

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent CSSA transformation for GPU targets.

1. Inserting copies only at the merge block instead of at the end of each predecessor. The entire point of CSSA is that copies must be placed before the warp reconverges, not at the reconvergence point. If you insert the copy instruction at the beginning of the merge block (after the PHI), the warp has already reconverged and whichever path executed last has overwritten the register value for all threads. Copies must be at the terminator position of each predecessor block, before control leaves that block. This is the fundamental GPU-vs-CPU distinction: on a CPU, only one predecessor executes so placement does not matter; on a GPU, all predecessors may execute sequentially within the same warp.

2. Coalescing copies that have divergent live ranges. The cssa-coalesce knob controls how aggressively copies are merged back together. Over-aggressive coalescing can assign two copies to the same physical register when their live ranges overlap under divergence -- threads from different predecessor paths would see each other's values. The coalescer must verify that live ranges are truly non-interfering under the SIMT execution model, not just under the sequential CFG model. A reimplementation that reuses a standard LLVM register coalescer without divergence-aware interference checking will produce silent miscompilation on any kernel with divergent control flow.

3. Failing to insert copies for uniform PHI nodes that become divergent after later transformations. CSSA runs before instruction selection, but divergence analysis at that point may be imprecise. A PHI node classified as uniform (all threads agree on the incoming edge) may become effectively divergent after subsequent loop transformations or predication changes the control flow. The safe approach is to insert copies for all PHI nodes and let the coalescing phase remove unnecessary ones. A reimplementation that skips "uniform" PHI nodes based on divergence analysis risks correctness if that analysis is later invalidated.

4. Using a standard LLVM PHIElimination pass without the CSSA preprocessing step. LLVM's built-in PHI elimination assumes scalar control flow semantics (exactly one predecessor contributes at runtime). Running it directly on GPU IR without first converting to CSSA form will produce incorrect register assignments whenever a warp diverges at a branch leading to a PHI merge point. CSSA is not a replacement for PHI elimination -- it is a prerequisite that transforms PHI semantics into a form safe for the standard lowering.

5. Not propagating the "pcp" copy through the instruction graph after insertion. Phase 4 of the algorithm (copy propagation via sub_371F790) replaces uses of original values with uses of the inserted copies. A reimplementation that inserts copies but skips this propagation step will leave the PHI node still referencing the original value, making the copies dead. The subsequent dead-copy cleanup (Phase 5) will then erase them, and the transformation has no effect -- the original divergence-unsafe PHI remains.

Reimplementation Checklist

Basic block ordering and numbering. Assign preorder and reverse-postorder indices to every basic block (stored at block offsets +0x48/+0x4C), used later for dominance and reconvergence queries.
PHI node scanning and hash map population. Walk all instructions across all basic blocks, identify PHI nodes (opcode 0x54), assign monotonic IDs, and insert into a DenseMap using the hash (ptr >> 4) ^ (ptr >> 9) with LLVM-layer sentinels (-4096/-8192) and 75% load-factor growth.
Copy insertion at reconvergence points. For each PHI node's incoming value, insert a "pcp"-prefixed copy instruction at the end of the predecessor block (before the terminator) using opcode 0x22D7 (divergence-safe copy), then rewire the PHI's use chain so the operand points to the copy instead of the original value.
Copy propagation. Iterate all blocks a second time, invoking the PCP builder on each instruction to propagate inserted copies through the instruction graph, replacing uses of original values with uses of copies where appropriate and eliminating redundant copies where original and copy provably carry the same value for all threads.
Dead copy cleanup. Walk the cleanup worklist, check each entry for zero remaining uses, and erase dead copy instructions via eraseFromParent.
Copy coalescing (cssa-coalesce). Implement configurable coalescing that identifies cases where multiple "pcp" copies carry the same value and can share a single register, reducing copy overhead while preserving correctness under warp divergence.

Cross-References

NVIDIA Custom Passes -- CSSA listed as sub_3720740 with knobs cssa-coalesce, cssa-verbosity, dump-before-cssa
Register Allocation -- greedy RA consumes CSSA-prepared IR
Scheduling -- usedessa knob controls pre-RA/post-RA scheduling mode
Code Generation Pipeline -- CSSA's position in the overall compilation flow
StructurizeCFG -- related pass that ensures structured control flow for PTX
Rematerialization -- CSSA copies may interact with remat decisions
Configuration Knobs -- full knob inventory

Minor NVIDIA Passes

This page indexes NVIDIA-proprietary passes that are too small or insufficiently decompiled for dedicated pages. For the ten passes that were previously documented here and now have full pages, see the links below.

Passes with Dedicated Pages

Pass	Page
NVVM IR Verifier	nvvm-verify (Deep Dive)
NVVM Intrinsic Lowering	nvvm-intrinsic-lowering
Dead Synchronization Elimination	dead-sync-elimination
IV Demotion	iv-demotion
Struct/Aggregate Splitting	struct-splitting
Base Address Strength Reduction	base-address-sr
Common Base Elimination	common-base-elim
CSSA (Conventional SSA)	cssa
FP128/I128 Emulation	fp128-emulation
Memmove Unrolling	memmove-unroll

alloca-hoisting -- Entry Block Alloca Consolidation

Field	Value
Pass ID	`alloca-hoisting`
Entry point	`sub_21BC7D0`
Scope	Machine-level pass

PTX requires all stack allocations to reside in the entry block. This pass moves alloca instructions inserted by inlining or loop transforms into the entry block, preserving order and alignment. Without it, non-entry-block allocas produce invalid PTX.

image-optimizer -- Texture/Surface Access Optimization

Field	Value
Pass ID	`nvptx-image-optimizer`
Entry point	`sub_21BCF10`
Scope	Machine-level pass (pre-emission)

Groups related texture loads for cache utilization and merges redundant surface operations. Works in coordination with Replace Image Handles (below). See also Machine-Level Passes.

nvptx-peephole -- Machine-Level Peephole

Field	Value
Pass ID	`nvptx-peephole`
Entry point	`sub_21DB090`
Scope	Machine-level pass (pre-RA)
Knob	`enable-nvvm-peephole` (default: on)

PTX-specific peephole that folds redundant cvta address space conversions, optimizes predicate patterns, and simplifies PTX-specific instruction sequences. Distinct from the IR-level NVVM Peephole. See Machine-Level Passes for pipeline position.

proxy-reg-erasure -- Redundant cvta.to.local Removal

Field	Value
Pass ID	`nvptx-proxy-reg-erasure`
Entry point	`sub_21DA810`
Scope	Machine-level pass (late post-RA)

Removes redundant cvta.to.local instructions left by address space lowering. Runs late in the pipeline after register allocation. See Machine-Level Passes.

valid-global-names -- PTX Identifier Sanitization

Field	Value
Pass ID	`nvptx-assign-valid-global-names`
Entry point	`sub_21BCD80`
Scope	Machine-level pass (pre-emission)

Rewrites global symbol names to comply with PTX naming rules, removing characters illegal in PTX identifiers (@, $, etc.). Runs immediately before PTX emission.

replace-image-handles -- Texture/Surface Handle Substitution

Field	Value
Pass ID	`nvptx-replace-image-handles`
Entry point	`sub_21DBEA0`
Scope	Machine-level pass (pre-emission)

Replaces IR-level texture/surface handle references with PTX-level .tex / .surf declarations. Paired with image-optimizer above. See Machine-Level Passes.

extra-mi-printer -- Register Pressure Diagnostics

Field	Value
Pass ID	`extra-machineinstr-printer`
Entry point	`sub_21E9E80`
Scope	Diagnostic (debug-only)

Prints per-function register pressure statistics. Used for tuning pressure heuristics during development. Not active in release builds.

nvvm-intr-range -- Intrinsic Range Metadata

Field	Value
Pass ID	`nvvm-intr-range`
Entry point	`sub_216F4B0`
Scope	Function pass (IR level)
Knob	`nvvm-intr-range-sm` (`ctor_359`)

Attaches !range metadata to NVVM intrinsics that return hardware-bounded values (threadIdx.x, blockIdx.x, etc.), enabling downstream known-bits analysis and range-based dead code elimination. Tightens ranges when __launch_bounds__ metadata is present. Documented in detail in KnownBits & DemandedBits.

GenericToNVVM -- Global Address Space Migration

Field	Value
Pass ID	`generic-to-nvvm`
Entry point	`sub_215DC20`
Size	36 KB

Moves global variables from generic address space (AS 0) to global address space (AS 1), inserting addrspacecast at use sites. Required because PTX globals must reside in .global memory. Documented in detail in PTX Emission.

Pipeline & Pass Ordering

CICC v13.0 implements the LLVM New Pass Manager pipeline infrastructure, with NVIDIA injecting 33 custom passes into the registration table alongside approximately 493 standard LLVM passes. The master registration function at sub_2342890 populates a StringMap<PassInfo> hash table with every known pass name at startup, and a text-based pipeline parser allows the full pass ordering to be specified as a parenthesized string (e.g., module(function(instcombine,dse))). This page documents the complete pass inventory, the registration mechanism, the NVIDIA-specific additions, and — critically — the runtime pass execution order for each optimization level including the tier system and pass factory addresses.


Master registration	`sub_2342890` (`0x2342890`, ~2,816 lines)
Hash table insert	`sub_E41FB0` (`0xE41FB0`) -- open-addressing, 48-byte entries
String equality	`sub_9691B0` (`0x9691B0`) -- `len==len && memcmp==0`
AA name resolver	`sub_233BD40` (`0x233BD40`) -- chain of string comparisons
AA pipeline parser	`sub_233C0C0` (`0x233C0C0`) -- splits on `,`, special-cases `"default"`
Extension callback	`sub_233C300` (`0x233C300`) -- iterates `[PassBuilder+2208]`, stride 32
Option parser	`sub_233A120` (`0x233A120`) -- splits on `;`, validates tokens
Help/listing	`sub_233C410` (`0x233C410`) -- `--print-pipeline-passes` handler
Pipeline assembler	`sub_12E54A0` (`0x12E54A0`, 49.8KB, 1,553 lines)
AddPass	`sub_12DE0B0` (`0x12DE0B0`, hash-based pass insertion)
Tier 0 sub-pipeline	`sub_12DE330` (`0x12DE330`, ~40 passes)
Tier 1/2/3 sub-pipeline	`sub_12DE8F0` (`0x12DE8F0`, phase-conditional)
Codegen dispatch	`sub_12DFE00` (`0x12DFE00`, 20.7KB)
Total passes	~526 unique registrations
NVIDIA additions	33 passes (12 module, 20 function, 1 loop)

Registration Architecture

The pipeline infrastructure follows the standard LLVM New Pass Manager design. At startup, sub_2342890 is called once and inserts every known pass into a StringMap living at [PassBuilder+8]. The insertion function sub_E41FB0 uses open-addressing with linear probing; each entry occupies 48 bytes containing the key pointer, key length, value pointer, value length, and 16 bytes of inline storage for short class names.

Pass lookup during pipeline parsing uses the hash function at sub_C94890 (likely DJB/FNV-family). Parameterized passes are detected by the presence of <...> angle brackets after the pass name; the parameter string is extracted and forwarded to a pass-specific callback. The generic parameter validator sub_233A120 splits option strings on semicolons and compares each token to expected values, emitting "invalid {PassName} pass parameter '{token}'" on mismatch.

The alias analysis pipeline has its own parser at sub_233C0C0. It special-cases the string "default" (which calls sub_23A1380 then sub_23038C0 to build the default AA stack), and otherwise splits on commas, resolving each name through sub_233BD40:

AA Name	Constructor
`globals-aa`	`sub_2396EC0`
`basic-aa`	`sub_2361CE0`
`objc-arc-aa`	`sub_2361F60`
`scev-aa`	`sub_2362040`
`scoped-noalias-aa`	`sub_2362120`
`tbaa`	`sub_2362200`

Extension callbacks for target-specific pipeline customization are stored at [PassBuilder+2208] with a count at [PassBuilder+2216]. Each entry is 32 bytes with a guard at offset +16 (must be non-null) and the callback function pointer at offset +24. The string "all" in extension context triggers invalidate<all>.

Pipeline Text Parser

The pipeline text parser accepts a nesting grammar where each level specifies the pass manager scope:

module(
  function(
    instcombine<max-iterations=1>,
    dse,
    loop(indvars, loop-deletion)
  ),
  globalopt
)

The parser splits on commas and parentheses, recognizing module(...), cgscc(...), function(...), and loop(...) as scope wrappers. Bare names are looked up in the StringMap built by sub_2342890. For parameterized passes, the <...> suffix is extracted and dispatched to per-pass option parsers. Several NVIDIA-specific parameter parsers are thin wrappers around sub_233A120:

Parser	Pass	Recognized Options
`sub_233A330`	`process-restrict`	`propagate-only`
`sub_233A370`	`lower-struct-args`	`opt-byval`
`sub_233A3B0`	`lower-aggr-copies`	`lower-aggr-func-args`

More complex passes (GVN, SimplifyCFG, InstCombine) use chained sub_9691B0 string comparisons for multi-option parsing.

The pipeline name strings recognized by the nvopt<> dispatch table are:

Pipeline Name	CLI Source	Pass Count
`nvopt<O0>`	(no -O flag, no -Ofc)	~5--8
`nvopt<O1>`	`-O1`	~35
`nvopt<O2>`	`-O2`	~35+
`nvopt<O3>`	`-O3`	~35+
`nvopt<Ofcmax>`	`-Ofast-compile=max` / `-Ofc=max`	~12--15
`nvopt<Ofcmid>`	`-Ofast-compile=mid` / `-Ofc=mid`	~25--30
`nvopt<Ofcmin>`	`-Ofast-compile=min` / `-Ofc=min`	~30--35

Key addresses for pipeline name dispatch: sub_226C400 selects the pipeline name string, which is passed to sub_2277440 (pipeline text parser). The nvopt prefix is registered in sub_225D540 (new PM) and sub_12C35D0 (legacy PM), both calling into a pipeline builder class at vtable unk_4A08350.

Mutual exclusion: combining -O# with --passes= is an error: "Cannot specify -O#/-Ofast-compile=<min,mid,max> and --passes=/--foo-pass, use -passes='default<O#>,other-pass' or -passes='default<Ofcmax>,other-pass'".

Complete Pass Inventory

The following tables list every pass in exact registration order within sub_2342890. NVIDIA-specific passes are marked with bold names. Registration line numbers are from the decompiled output.

Module Analyses (18)

#	Pass Name	LLVM Class	Reg. Line
1	`callgraph`	`CallGraphAnalysis`	514
2	`collector-metadata`	`CollectorMetadataAnalysis`	—
3	`ctx-prof-analysis`	`CtxProfAnalysis`	—
4	`dxil-metadata`	`DXILMetadataAnalysis`	—
5	`dxil-resource-binding`	`DXILResourceBindingAnalysis`	—
6	`dxil-resource-type`	`DXILResourceTypeAnalysis`	—
7	`inline-advisor`	`InlineAdvisorAnalysis`	—
8	`ir-similarity`	`IRSimilarityAnalysis`	—
9	`last-run-tracking`	via `sub_2342820`	—
10	`lcg`	`LazyCallGraphAnalysis`	—
11	`module-summary`	`ModuleSummaryIndexAnalysis`	—
12	`no-op-module`	`NoOpModuleAnalysis`	—
13	`pass-instrumentation`	via `sub_2342830`	—
14	`profile-summary`	`ProfileSummaryAnalysis`	—
15	`reg-usage`	`PhysicalRegisterUsageAnalysis`	—
16	`stack-safety`	`StackSafetyGlobalAnalysis`	—
17	`verify`	via `sub_2342840`	596
18	`globals-aa`	`GlobalsAA`	—

Module Passes (131)

Registration lines 599--1153 in sub_2342890. The first 121 entries are standard LLVM; the final 12 are NVIDIA custom passes registered at lines 1096--1153.

Standard LLVM Module Passes (entries 19--131)

#	Pass Name	LLVM Class
19	`always-inline`	`AlwaysInlinerPass`
20	`annotation2metadata`	`Annotation2MetadataPass`
21	`assign-guid`	`AssignGUIDPass`
22	`attributor`	`AttributorPass`
23	`attributor-light`	`AttributorLightPass`
24	`called-value-propagation`	`CalledValuePropagationPass`
25	`canonicalize-aliases`	`CanonicalizeAliasesPass`
26	`check-debugify`	`NewPMCheckDebugifyPass`
27	`constmerge`	`ConstantMergePass`
28	`coro-cleanup`	`CoroCleanupPass`
29	`coro-early`	`CoroEarlyPass`
30	`cross-dso-cfi`	`CrossDSOCFIPass`
31	`ctx-instr-gen`	`PGOInstrumentationGen`
32	`ctx-prof-flatten`	`PGOCtxProfFlatteningPass`
33	`noinline-nonprevailing`	`NoinlineNonPrevailing`
34	`deadargelim`	`DeadArgumentEliminationPass`
35	`debugify`	`NewPMDebugifyPass`
36	`dfsan`	`DataFlowSanitizerPass`
37	`dot-callgraph`	`CallGraphDOTPrinterPass`
38	`dxil-upgrade`	`DXILUpgradePass`
39	`elim-avail-extern`	`EliminateAvailableExternallyPass`
40	`extract-blocks`	`BlockExtractorPass`
41	`expand-variadics`	`ExpandVariadicsPass`
42	`forceattrs`	`ForceFunctionAttrsPass`
43	`function-import`	`FunctionImportPass`
44	`global-merge-func`	`GlobalMergeFuncPass`
45	`globalopt`	`GlobalOptPass`
46	`globalsplit`	`GlobalSplitPass`
47	`hotcoldsplit`	`HotColdSplittingPass`
48	`inferattrs`	`InferFunctionAttrsPass`
49	`inliner-ml-advisor-release`	via `sub_2342850` (InlinerWrapper)
50	`inliner-wrapper`	via `sub_2342850` (InlinerWrapper)
51	`inliner-wrapper-no-mandatory-first`	via `sub_2342850`
52	`insert-gcov-profiling`	`GCOVProfilerPass`
53	`instrorderfile`	`InstrOrderFilePass`
54	`instrprof`	`InstrProfilingLoweringPass`
55	`ctx-instr-lower`	`PGOCtxProfLoweringPass`
56	`print<ctx-prof-analysis>`	`CtxProfAnalysisPrinterPass`
57	`invalidate<all>`	via `sub_2342860`
58	`iroutliner`	`IROutlinerPass`
59	`jmc-instrumenter`	`JMCInstrumenterPass`
60	`lower-emutls`	`LowerEmuTLSPass`
61	`lower-global-dtors`	`LowerGlobalDtorsPass`
62	`lower-ifunc`	`LowerIFuncPass`
63	`lowertypetests`	`LowerTypeTestsPass`
64	`fatlto-cleanup`	`FatLtoCleanup`
65	`pgo-force-function-attrs`	`PGOForceFunctionAttrsPass`
66	`memprof-context-disambiguation`	`MemProfContextDisambiguation`
67	`memprof-module`	`ModuleMemProfilerPass`
68	`mergefunc`	`MergeFunctionsPass`
69	`metarenamer`	`MetaRenamerPass`
70	`module-inline`	`ModuleInlinerPass`
71	`name-anon-globals`	`NameAnonGlobalPass`
72	`no-op-module`	`NoOpModulePass`
73	`nsan`	`NumericalStabilitySanitizerPass`
74	`objc-arc-apelim`	`ObjCARCAPElimPass`
75	`openmp-opt`	`OpenMPOptPass`
76	`openmp-opt-postlink`	`OpenMPOptPass`
77	`partial-inliner`	`PartialInlinerPass`
78	`pgo-icall-prom`	`PGOIndirectCallPromotion`
79	`pgo-instr-gen`	`PGOInstrumentationGen`
80	`pgo-instr-use`	`PGOInstrumentationUse`
81	`pre-isel-intrinsic-lowering`	`PreISelIntrinsicLoweringPass`
82	`print`	`PrintModulePass`
83	`print-callgraph`	`CallGraphPrinterPass`
84	`print-callgraph-sccs`	`CallGraphSCCsPrinterPass`
85	`print-ir-similarity`	`IRSimilarityAnalysisPrinterPass`
86	`print-lcg`	`LazyCallGraphPrinterPass`
87	`print-lcg-dot`	`LazyCallGraphDOTPrinterPass`
88	`print-must-be-executed-contexts`	`MustBeExecutedContextPrinterPass`
89	`print-profile-summary`	`ProfileSummaryPrinterPass`
90	`print-stack-safety`	`StackSafetyGlobalPrinterPass`
91	`print<dxil-metadata>`	`DXILMetadataAnalysisPrinterPass`
92	`print<dxil-resource-binding>`	`DXILResourceBindingPrinterPass`
93	`print<inline-advisor>`	`InlineAdvisorAnalysisPrinterPass`
94	`print<module-debuginfo>`	`ModuleDebugInfoPrinterPass`
95	`print<reg-usage>`	`PhysicalRegisterUsageInfoPrinterPass`
96	`pseudo-probe`	`SampleProfileProbePass`
97	`pseudo-probe-update`	`PseudoProbeUpdatePass`
98	`recompute-globalsaa`	`RecomputeGlobalsAAPass`
99	`rel-lookup-table-converter`	`RelLookupTableConverterPass`
100	`rewrite-statepoints-for-gc`	`RewriteStatepointsForGC`
101	`rewrite-symbols`	`RewriteSymbolPass`
102	`rpo-function-attrs`	`ReversePostOrderFunctionAttrsPass`
103	`rtsan`	`RealtimeSanitizerPass`
104	`sample-profile`	`SampleProfileLoaderPass`
105	`sancov-module`	`SanitizerCoveragePass`
106	`sanmd-module`	`SanitizerBinaryMetadataPass`
107	`scc-oz-module-inliner`	via `sub_2342850` (InlinerWrapper)
108	`shadow-stack-gc-lowering`	`ShadowStackGCLoweringPass`
109	`strip`	`StripSymbolsPass`
110	`strip-dead-debug-info`	`StripDeadDebugInfoPass`
111	`strip-dead-prototypes`	`StripDeadPrototypesPass`
112	`strip-debug-declare`	`StripDebugDeclarePass`
113	`strip-nondebug`	`StripNonDebugSymbolsPass`
114	`strip-nonlinetable-debuginfo`	`StripNonLineTableDebugInfoPass`
115	`trigger-crash-module`	`TriggerCrashModulePass`
116	`trigger-verifier-error`	`TriggerVerifierErrorPass`
117	`tsan-module`	`ModuleThreadSanitizerPass`
118	`tysan`	`TypeSanitizerPass`
119	`verify`	via `sub_2342870`
120	`view-callgraph`	`CallGraphViewerPass`
121	`wholeprogramdevirt`	`WholeProgramDevirtPass`

NVIDIA Module Passes (entries 122--131)

#	Pass Name	LLVM Class	Reg. Line	Purpose
122	`check-gep-index`	`CheckGepIndexPass`	1096	Validates GEP index bounds
123	`check-kernel-functions`	`NVPTXSetFunctionLinkagesPass`	1101	Enforces kernel linkage
124	`cnp-launch-check`	`CNPLaunchCheckPass`	1106	Cooperative launch validation
125	`ipmsp`	`IPMSPPass`	1111	Inter-procedural memory space propagation
126	`nv-early-inliner`	via `sub_2342850`	1114	NVIDIA early inlining heuristic
127	`nv-inline-must`	`InlineMustPass`	1119	Force-inlines `__forceinline__` functions
128	`nvvm-pretreat`	`PretreatPass`	1124	IR canonicalization before optimization
129	`nvvm-verify`	`NVVMIRVerifierPass`	1129	NVVM IR constraint validation
130	`printf-lowering`	`PrintfLoweringPass`	1134	Lowers printf to vprintf ABI
131	`select-kernels`	`SelectKernelsPass`	1139	Selects kernels for compilation

Parameterized Module Passes (entries 132--145)

#	Pass Name	Class	Parameters
132	`asan`	`AddressSanitizerPass`	`kernel`
133	`cg-profile`	`CGProfilePass`	`in-lto-post-link`
134	`global-merge`	`GlobalMergePass`	`group-by-use;ignore-single-use;max-offset=N`
135	`embed-bitcode`	`EmbedBitcodePass`	`thinlto;emit-summary`
136	`globaldce`	`GlobalDCEPass`	`in-lto-post-link`
137	`hwasan`	`HWAddressSanitizerPass`	`kernel;recover`
138	`internalize`	`InternalizePass`	`preserve-gv=GV`
139	`ipsccp`	`IPSCCPPass`	`no-func-spec;func-spec`
140	`loop-extract`	`LoopExtractorPass`	`single`
141	`memprof-use`	`MemProfUsePass`	`profile-filename=S`
142	`msan`	`MemorySanitizerPass`	`recover;kernel;eager-checks;track-origins=N`
143	`print<structural-hash>`	`StructuralHashPrinterPass`	`detailed;call-target-ignored`
144	`lower-ops`	`LowerOpsPass`	`enable-optimization`
145	`set-global-array-alignment`	`SetGlobalArrayAlignmentPass`	`modify-shared-mem;skip-shared-mem;modify-global-mem;skip-global-mem`

CGSCC Analyses and Passes (entries 146--158)

#	Pass Name	LLVM Class	Level
146	`no-op-cgscc`	`NoOpCGSCCAnalysis`	Analysis
147	`fam-proxy`	`FunctionAnalysisManagerCGSCCProxy`	Analysis
148	`pass-instrumentation`	via `sub_2342830`	Analysis
149	`argpromotion`	`ArgumentPromotionPass`	Pass
150	`attributor-cgscc`	`AttributorCGSCCPass`	Pass
151	`attributor-light-cgscc`	`AttributorLightCGSCCPass`	Pass
152	`invalidate<all>`	via `sub_2342860`	Pass
153	`no-op-cgscc`	`NoOpCGSCCPass`	Pass
154	`openmp-opt-cgscc`	`OpenMPOptCGSCCPass`	Pass
155	`coro-annotation-elide`	`CoroAnnotationElidePass`	Pass
156	`coro-split`	`CoroSplitPass`	Param: `reuse-storage`
157	`function-attrs`	`PostOrderFunctionAttrsPass`	Param: `skip-non-recursive-function-attrs`
158	`inline`	`InlinerPass`	Param: `only-mandatory`

Function Analyses (entries 159--201)

Registration lines 1208--1415 in sub_2342890.

#	Pass Name	LLVM Class
159	`aa`	`AAManager`
160	`access-info`	`LoopAccessAnalysis`
161	`assumptions`	`AssumptionAnalysis`
162	`bb-sections-profile-reader`	`BasicBlockSectionsProfileReaderAnalysis`
163	`block-freq`	`BlockFrequencyAnalysis`
164	`branch-prob`	`BranchProbabilityAnalysis`
165	`cycles`	`CycleAnalysis`
166	`da`	`DependenceAnalysis`
167	`debug-ata`	`DebugAssignmentTrackingAnalysis`
168	`demanded-bits`	`DemandedBitsAnalysis`
169	`domfrontier`	`DominanceFrontierAnalysis`
170	`domtree`	`DominatorTreeAnalysis`
171	`func-properties`	`FunctionPropertiesAnalysis`
172	`machine-function-info`	`MachineFunctionAnalysis`
173	`gc-function`	`GCFunctionAnalysis`
174	`inliner-size-estimator`	`InlineSizeEstimatorAnalysis`
175	`last-run-tracking`	via `sub_2342820`
176	`lazy-value-info`	`LazyValueAnalysis`
177	`loops`	`LoopAnalysis`
178	`memdep`	`MemoryDependenceAnalysis`
179	`memoryssa`	`MemorySSAAnalysis`
180	`no-op-function`	`NoOpFunctionAnalysis`
181	`opt-remark-emit`	`OptimizationRemarkEmitterAnalysis`
182	`pass-instrumentation`	via `sub_2342830`
183	`phi-values`	`PhiValuesAnalysis`
184	`postdomtree`	`PostDominatorTreeAnalysis`
185	`regions`	`RegionInfoAnalysis`
186	`scalar-evolution`	`ScalarEvolutionAnalysis`
187	`should-not-run-function-passes`	`ShouldNotRunFunctionPassesAnalysis`
188	`should-run-extra-vector-passes`	`ShouldRunExtraVectorPasses`
189	`ssp-layout`	`SSPLayoutAnalysis`
190	`stack-safety-local`	`StackSafetyAnalysis`
191	`target-ir`	`TargetIRAnalysis`
192	`target-lib-info`	`TargetLibraryAnalysis`
193	`uniformity`	`UniformityInfoAnalysis`
194	`verify`	via `sub_2342840`
195	`rpa`	`RegisterPressureAnalysis`
196	`merge-sets`	`MergeSetsAnalysis`

Function AA Analyses (entries 197--201)

#	Pass Name	LLVM Class
197	`basic-aa`	`BasicAA`
198	`objc-arc-aa`	`objcarc::ObjCARCAA`
199	`scev-aa`	`SCEVAA`
200	`scoped-noalias-aa`	`ScopedNoAliasAA`
201	`tbaa`	`TypeBasedAA`

Function Passes (entries 202--419)

Registration lines 1420--2319 in sub_2342890. The first 173 entries (202--374) are standard LLVM; entries 376--392 are NVIDIA-specific; entries 393--419 are parameterized passes (both standard and NVIDIA).

Standard LLVM Function Passes (entries 202--375)

#	Pass Name	LLVM Class
202	`aa-eval`	`AAEvaluator`
203	`adce`	`ADCEPass`
204	`add-discriminators`	`AddDiscriminatorsPass`
205	`aggressive-instcombine`	`AggressiveInstCombinePass`
206	`alignment-from-assumptions`	`AlignmentFromAssumptionsPass`
207	`annotation-remarks`	`AnnotationRemarksPass`
208	`assume-builder`	`AssumeBuilderPass`
209	`assume-simplify`	`AssumeSimplifyPass`
210	`atomic-expand`	`AtomicExpandPass`
211	`bdce`	`BDCEPass`
212	`break-crit-edges`	`BreakCriticalEdgesPass`
213	`callbr-prepare`	`CallBrPreparePass`
214	`callsite-splitting`	`CallSiteSplittingPass`
215	`chr`	`ControlHeightReductionPass`
216	`codegenprepare`	`CodeGenPreparePass`
217	`complex-deinterleaving`	`ComplexDeinterleavingPass`
218	`consthoist`	`ConstantHoistingPass`
219	`constraint-elimination`	`ConstraintEliminationPass`
220	`coro-elide`	`CoroElidePass`
221	`correlated-propagation`	`CorrelatedValuePropagationPass`
222	`count-visits`	`CountVisitsPass`
223	`dce`	`DCEPass`
224	`declare-to-assign`	`AssignmentTrackingPass`
225	`dfa-jump-threading`	`DFAJumpThreadingPass`
226	`div-rem-pairs`	`DivRemPairsPass`
227	`dot-cfg`	`CFGPrinterPass`
228	`dot-cfg-only`	`CFGOnlyPrinterPass`
229	`dot-dom`	`DOTGraphTraitsPrinter<DominatorTree, false>`
230	`dot-dom-only`	`DOTGraphTraitsPrinter<DominatorTree, true>`
231	`dot-post-dom`	`DOTGraphTraitsPrinter<PostDominatorTree, false>`
232	`dot-post-dom-only`	`DOTGraphTraitsPrinter<PostDominatorTree, true>`
233	`dse`	`DSEPass`
234	`dwarf-eh-prepare`	`DwarfEHPreparePass`
235	`expand-large-div-rem`	`ExpandLargeDivRemPass`
236	`expand-large-fp-convert`	`ExpandLargeFpConvertPass`
237	`expand-memcmp`	`ExpandMemCmpPass`
238	`extra-vector-passes`	`ExtraFunctionPassManager<ShouldRunExtraVectorPasses>`
239	`fix-irreducible`	`FixIrreduciblePass`
240	`flatten-cfg`	`FlattenCFGPass`
241	`float2int`	`Float2IntPass`
242	`gc-lowering`	`GCLoweringPass`
243	`guard-widening`	via `sub_2342880`
244	`gvn-hoist`	`GVNHoistPass`
245	`gvn-sink`	`GVNSinkPass`
246	`helloworld`	`HelloWorldPass`
247	`indirectbr-expand`	`IndirectBrExpandPass`
248	`infer-address-spaces`	`InferAddressSpacesPass`
249	`infer-alignment`	`InferAlignmentPass`
250	`inject-tli-mappings`	`InjectTLIMappings`
251	`instcount`	`InstCountPass`
252	`instnamer`	`InstructionNamerPass`
253	`instsimplify`	`InstSimplifyPass`
254	`interleaved-access`	`InterleavedAccessPass`
255	`interleaved-load-combine`	`InterleavedLoadCombinePass`
256	`invalidate<all>`	via `sub_2342860`
257	`irce`	`IRCEPass`
258	`jump-threading`	`JumpThreadingPass`
259	`jump-table-to-switch`	`JumpTableToSwitchPass`
260	`kcfi`	`KCFIPass`
261	`kernel-info`	`KernelInfoPrinter`
262	`lcssa`	`LCSSAPass`
263	`libcalls-shrinkwrap`	`LibCallsShrinkWrapPass`
264	`lint`	`LintPass`
265	`load-store-vectorizer`	`LoadStoreVectorizerPass`
266	`loop-data-prefetch`	`LoopDataPrefetchPass`
267	`loop-distribute`	`LoopDistributePass`
268	`loop-fusion`	`LoopFusePass`
269	`loop-load-elim`	`LoopLoadEliminationPass`
270	`loop-simplify`	`LoopSimplifyPass`
271	`loop-sink`	`LoopSinkPass`
272	`loop-versioning`	`LoopVersioningPass`
273	`lower-atomic`	`LowerAtomicPass`
274	`lower-constant-intrinsics`	`LowerConstantIntrinsicsPass`
275	`lower-expect`	`LowerExpectIntrinsicPass`
276	`lower-guard-intrinsic`	`LowerGuardIntrinsicPass`
277	`lower-invoke`	`LowerInvokePass`
278	`lower-widenable-condition`	`LowerWidenableConditionPass`
279	`make-guards-explicit`	`MakeGuardsExplicitPass`
280	`mem2reg`	`PromotePass`
281	`memcpyopt`	`MemCpyOptPass`
282	`memprof`	`MemProfilerPass`
283	`mergeicmps`	`MergeICmpsPass`
284	`mergereturn`	`UnifyFunctionExitNodesPass`
285	`move-auto-init`	`MoveAutoInitPass`
286	`nary-reassociate`	`NaryReassociatePass`
287	`newgvn`	`NewGVNPass`
288	`no-op-function`	`NoOpFunctionPass`
289	`normalize`	`IRNormalizerPass`
290	`objc-arc`	`ObjCARCOptPass`
291	`objc-arc-contract`	`ObjCARCContractPass`
292	`objc-arc-expand`	`ObjCARCExpandPass`
293	`pa-eval`	`PAEvalPass`
294	`partially-inline-libcalls`	`PartiallyInlineLibCallsPass`
295	`pgo-memop-opt`	`PGOMemOPSizeOpt`
296	`place-safepoints`	`PlaceSafepointsPass`
297	`print`	`PrintFunctionPass`
298--338	`print<access-info>` ... `print-predicateinfo`	(41 printer passes)
339	`reassociate`	`ReassociatePass`
340	`redundant-dbg-inst-elim`	`RedundantDbgInstEliminationPass`
341	`reg2mem`	`RegToMemPass`
342	`safe-stack`	`SafeStackPass`
343	`sandbox-vectorizer`	`SandboxVectorizerPass`
344	`scalarize-masked-mem-intrin`	`ScalarizeMaskedMemIntrinPass`
345	`sccp`	`SCCPPass`
346	`select-optimize`	`SelectOptimizePass`
347	`separate-const-offset-from-gep`	`SeparateConstOffsetFromGEPPass`
348	`sink`	`SinkingPass`
349	`sjlj-eh-prepare`	`SjLjEHPreparePass`
350	`slp-vectorizer`	`SLPVectorizerPass`
351	`slsr`	`StraightLineStrengthReducePass`
352	`stack-protector`	`StackProtectorPass`
353	`strip-gc-relocates`	`StripGCRelocates`
354	`tailcallelim`	`TailCallElimPass`
355	`transform-warning`	`WarnMissedTransformationsPass`
356	`trigger-crash-function`	`TriggerCrashFunctionPass`
357	`trigger-verifier-error`	`TriggerVerifierErrorPass`
358	`tsan`	`ThreadSanitizerPass`
359	`unify-loop-exits`	`UnifyLoopExitsPass`
360	`vector-combine`	`VectorCombinePass`
361	`verify`	via `sub_2342870`
362--368	`verify<cycles>` ... `verify<scalar-evolution>`	(7 verifiers)
369--374	`view-cfg` ... `view-post-dom-only`	(6 viewers)
375	`wasm-eh-prepare`	`WasmEHPreparePass`

NVIDIA Function Passes (entries 376--392)

Registered at lines 2212--2292 of sub_2342890.

#	Pass Name	LLVM Class	Reg. Line	Purpose
376	`basic-dbe`	`BasicDeadBarrierEliminationPass`	2212	Removes dead `bar.sync` instructions
377	`branch-dist`	`BranchDistPass`	2217	Branch distribution for divergence control
378	`byval-mem2reg`	`ByValMem2RegPass`	2222	Promotes byval arguments to registers
379	`bypass-slow-division`	`BypassSlowDivisionPass`	2227	Fast-path for small-operand division
380	`normalize-gep`	`NormalizeGepPass`	2232	GEP canonicalization for address arithmetic
381	`nvvm-reflect-pp`	`SimplifyConstantConditionalsPass`	2237	Folds `__nvvm_reflect` results (post-processing)
382	`nvvm-peephole-optimizer`	`NVVMPeepholeOptimizerPass`	2242	NVVM-specific peephole rewrites
383	`old-load-store-vectorizer`	`OldLoadStoreVectorizerPass`	2247	Legacy load/store vectorization
384	`print<merge-sets>`	`MergeSetsAnalysisPrinterPass`	2252	Printer for merge-sets analysis
385	`remat`	`RematerializationPass`	2257	Register-pressure-aware rematerialization
386	`print<rpa>`	`RegisterPressurePrinterPass`	2262	Printer for register pressure analysis
387	`propagate-alignment`	`PropagateAlignmentPass`	2267	Propagates alignment through pointer chains
388	`reuse-local-memory`	`ReuseLocalMemoryPass`	2272	Shares local memory across kernels
389	`set-local-array-alignment`	`SetLocalArrayAlignmentPass`	2277	Aligns stack arrays for coalescing
390	`sinking2`	`Sinking2Pass`	2282	Enhanced instruction sinking
391	`d2ir-scalarizer`	`ScalarizerPass` (NVIDIA alias)	2287	NVIDIA-branded scalarization
392	`sink<rp-aware>`	`SinkingPass` (variant)	2292	Register-pressure-aware sinking

Parameterized Function Passes (entries 393--419)

#	Pass Name	Class	Parameters
393	`cfguard`	`CFGuardPass`	`check;dispatch`
394	`early-cse`	`EarlyCSEPass`	`memssa`
395	`ee-instrument`	`EntryExitInstrumenterPass`	`post-inline`
396	`function-simplification`	(byte_3F871B3)	`O1;O2;O3;Os;Oz`
397	`gvn`	`GVNPass`	`no-pre;pre;no-load-pre;load-pre;...`
398	`instcombine`	`InstCombinePass`	`no-aggressive-aggregate-splitting;...;max-iterations=N`
399	`loop-unroll`	`LoopUnrollPass`	`O0;O1;O2;O3;full-unroll-max=N;...`
400	`loop-vectorize`	`LoopVectorizePass`	`no-interleave-forced-only;...`
401	`lower-allow-check`	`LowerAllowCheckPass`	(empty)
402	`lower-matrix-intrinsics`	`LowerMatrixIntrinsicsPass`	`minimal`
403	`lower-switch`	`LowerSwitchPass`	`enable-jump-table`
404	`mldst-motion`	`MergedLoadStoreMotionPass`	`no-split-footer-bb;split-footer-bb`
405	`print<da>`	`DependenceAnalysisPrinterPass`	`normalized-results`
406	`print<memoryssa>`	`MemorySSAPrinterPass`	`no-ensure-optimized-uses`
407	`print<stack-lifetime>`	`StackLifetimePrinterPass`	`may;must`
408	`scalarizer`	`ScalarizerPass`	`load-store;no-load-store;variable-insert-extract;...`
409	`separate-const-offset-from-gep`	`SeparateConstOffsetFromGEPPass`	`lower-gep`
410	`simplifycfg`	`SimplifyCFGPass`	`simplify-unreachable;...;bonus-inst-threshold=N`
411	`speculative-execution`	`SpeculativeExecutionPass`	`only-if-divergent-target`
412	`sroa`	`SROAPass`	`preserve-cfg;modify-cfg`
413	`structurizecfg`	`StructurizeCFG`	`skip-uniform-regions`
414	`win-eh-prepare`	`WinEHPreparePass`	`demote-catchswitch-only`
415	`bounds-checking`	`BoundsCheckingPass` (modified)	`trap`
416	`memory-space-opt`	`MemorySpaceOptPass`	`first-time;second-time;no-warnings;warnings`
417	`lower-aggr-copies`	`LowerAggrCopiesPass`	`lower-aggr-func-args`
418	`lower-struct-args`	`LowerStructArgsPass`	`opt-byval`
419	`process-restrict`	`ProcessRestrictPass`	`propagate-only`

LoopNest Passes (entries 420--423)

#	Pass Name	LLVM Class
420	`loop-flatten`	`LoopFlattenPass`
421	`loop-interchange`	`LoopInterchangePass`
422	`loop-unroll-and-jam`	`LoopUnrollAndJamPass`
423	`no-op-loopnest`	`NoOpLoopNestPass`

Loop Analyses (entries 424--428)

#	Pass Name	LLVM Class
424	`ddg`	`DDGAnalysis`
425	`iv-users`	`IVUsersAnalysis`
426	`no-op-loop`	`NoOpLoopAnalysis`
427	`pass-instrumentation`	via `sub_2342830`
428	`should-run-extra-simple-loop-unswitch`	`ShouldRunExtraSimpleLoopUnswitch`

Loop Passes (entries 429--455)

#	Pass Name	LLVM Class
429	`canon-freeze`	`CanonicalizeFreezeInLoopsPass`
430	`dot-ddg`	`DDGDotPrinterPass`
431	`guard-widening`	via `sub_2342880`
432	`extra-simple-loop-unswitch-passes`	`ExtraLoopPassManager<...>`
433	`indvars`	`IndVarSimplifyPass`
434	`invalidate<all>`	via `sub_2342860`
435	`loop-bound-split`	`LoopBoundSplitPass`
436	`loop-deletion`	`LoopDeletionPass`
437	`loop-idiom`	`LoopIdiomRecognizePass`
438	`loop-idiom-vectorize`	`LoopIdiomVectorizePass`
439	`loop-instsimplify`	`LoopInstSimplifyPass`
440	`loop-predication`	`LoopPredicationPass`
441	`loop-reduce`	`LoopStrengthReducePass`
442	`loop-term-fold`	`LoopTermFoldPass`
443	`loop-simplifycfg`	`LoopSimplifyCFGPass`
444	`loop-unroll-full`	`LoopFullUnrollPass`
445	`loop-versioning-licm`	`LoopVersioningLICMPass`
446	`no-op-loop`	`NoOpLoopPass`
447	`print`	`PrintLoopPass`
448--450	`print<ddg>`, `print<iv-users>`, `print<loop-cache-cost>`, `print<loopnest>`	(printers)
451	`loop-index-split`	`LoopIndexSplitPass`

Parameterized Loop Passes (entries 452--455)

#	Pass Name	Class	Parameters
452	`licm`	`LICMPass`	`allowspeculation;conservative-calls`
453	`lnicm`	`LNICMPass`	`allowspeculation`
454	`loop-rotate`	`LoopRotatePass`	`no-header-duplication;header-duplication;...`
455	`simple-loop-unswitch`	`SimpleLoopUnswitchPass`	`nontrivial;no-nontrivial;trivial;no-trivial`

Machine Function Analyses (entries 456--475)

#	Pass Name	LLVM Class
456	`edge-bundles`	`EdgeBundlesAnalysis`
457	`livedebugvars`	`LiveDebugVariablesAnalysis`
458	`live-intervals`	`LiveIntervalsAnalysis`
459	`live-reg-matrix`	`LiveRegMatrixAnalysis`
460	`live-stacks`	`LiveStacksAnalysis`
461	`live-vars`	`LiveVariablesAnalysis`
462	`machine-block-freq`	`MachineBlockFrequencyAnalysis`
463	`machine-branch-prob`	`MachineBranchProbabilityAnalysis`
464	`machine-cycles`	`MachineCycleAnalysis`
465	`machine-dom-tree`	`MachineDominatorTreeAnalysis`
466	`machine-loops`	`MachineLoopAnalysis`
467	`machine-opt-remark-emitter`	`MachineOptimizationRemarkEmitterAnalysis`
468	`machine-post-dom-tree`	`MachinePostDominatorTreeAnalysis`
469	`machine-trace-metrics`	`MachineTraceMetricsAnalysis`
470	`pass-instrumentation`	via `sub_2342830`
471	`regalloc-evict`	`RegAllocEvictionAdvisorAnalysis`
472	`regalloc-priority`	`RegAllocPriorityAdvisorAnalysis`
473	`slot-indexes`	`SlotIndexesAnalysis`
474	`spill-code-placement`	`SpillPlacementAnalysis`
475	`virtregmap`	`VirtRegMapAnalysis`

Machine Function Passes (entries 476--526)

#	Pass Name	LLVM Class
476	`dead-mi-elimination`	`DeadMachineInstructionElimPass`
477	`detect-dead-lanes`	`DetectDeadLanesPass`
478	`early-ifcvt`	`EarlyIfConverterPass`
479	`early-machinelicm`	`EarlyMachineLICMPass`
480	`early-tailduplication`	`EarlyTailDuplicatePass`
481	`finalize-isel`	`FinalizeISelPass`
482	`fixup-statepoint-caller-saved`	`FixupStatepointCallerSavedPass`
483	`localstackalloc`	`LocalStackSlotAllocationPass`
484	`machine-cp`	`MachineCopyPropagationPass`
485	`machine-cse`	`MachineCSEPass`
486	`machine-latecleanup`	`MachineLateInstrsCleanupPass`
487	`machine-scheduler`	`MachineSchedulerPass`
488	`machinelicm`	`MachineLICMPass`
489	`no-op-machine-function`	`NoOpMachineFunctionPass`
490	`opt-phis`	`OptimizePHIsPass`
491	`patchable-function`	`PatchableFunctionPass`
492	`peephole-opt`	`PeepholeOptimizerPass`
493	`phi-node-elimination`	`PHIEliminationPass`
494	`post-RA-sched`	`PostRASchedulerPass`
495	`postmisched`	`PostMachineSchedulerPass`
496	`post-ra-pseudos`	`ExpandPostRAPseudosPass`
497	`print`	`PrintMIRPass`
498--510	`print<livedebugvars>` ... `print<virtregmap>`	(13 MF printers)
511	`reg-usage-collector`	`RegUsageInfoCollectorPass`
512	`reg-usage-propagation`	`RegUsageInfoPropagationPass`
513	`register-coalescer`	`RegisterCoalescerPass`
514	`rename-independent-subregs`	`RenameIndependentSubregsPass`
515	`remove-redundant-debug-values`	`RemoveRedundantDebugValuesPass`
516	`require-all-machine-function-properties`	`RequireAllMachineFunctionPropertiesPass`
517	`stack-coloring`	`StackColoringPass`
518	`stack-slot-coloring`	`StackSlotColoringPass`
519	`tailduplication`	`TailDuplicatePass`
520	`trigger-verifier-error`	`TriggerVerifierErrorPass`
521	`two-address-instruction`	`TwoAddressInstructionPass`
522	`verify`	`MachineVerifierPass`
523	`verify<machine-trace-metrics>`	`MachineTraceMetricsVerifierPass`
524	`machine-sink`	`MachineSinkingPass` (parameterized)
525	`regallocfast`	`RegAllocFastPass` (parameterized)
526	`greedy`	`RAGreedyPass` (parameterized, LAST registered)

No NVIDIA-specific machine function passes were identified in the registration table; NVIDIA's machine-level customizations are implemented through target hooks in the NVPTX backend rather than as separately registered passes.

Runtime Pass Execution Order

Registration order (above) describes what is known to the pipeline parser. Runtime execution order is determined by sub_12E54A0 (the pipeline assembler) and controlled by the tier system. The execution order varies dramatically depending on: (1) optimization level, (2) fast-compile mode, (3) language string, and (4) individual pass enable/disable flags in NVVMPassOptions.

The AddPass Mechanism -- `sub_12DE0B0`

All runtime pass insertion uses sub_12DE0B0 (0x12DE0B0), a hash-table-based function that:

Hashes the pass pointer: (pass >> 9) ^ (pass >> 4)
Probes an open-addressed hash table at passMgr+80
Stores the pass pointer and a flags byte (flags | 2 if barrier set)
Appends the pass pointer to a dynamic array at passMgr[0]
Increments the counter at passMgr+8

The third parameter encodes pass type: 0 = ModulePass/AnalysisPass, 1 = FunctionPass. The fourth parameter is a scheduling barrier hint.

Tier System Architecture

The tier system is NVIDIA's mechanism for interleaving custom passes with standard LLVM passes at precise points. The main optimization loop in sub_12E54A0 iterates over a plugin/extension pass array at opts[4488..4496] (16-byte stride: vtable + phase_id), and fires tier sub-pipelines when the accumulated phase counter exceeds their thresholds:

// Pseudocode from sub_12E54A0, lines 481-553
for (entry = opts[4488]; entry < opts[4496]; entry += 16) {
    phase_id = entry[8];

    if (opts[4224] && phase_id > opts[4228]) {   // Tier 0
        sub_12DE330(PM, opts);                    // Full optimization
        opts[4224] = 0;                           // Fire once
    }
    if (opts[3528] && phase_id > opts[3532]) {    // Tier 1
        sub_12DE8F0(PM, 1, opts);
        opts[3528] = 0;
    }
    if (opts[3568] && phase_id > opts[3572]) {    // Tier 2
        sub_12DE8F0(PM, 2, opts);
        opts[3568] = 0;
    }
    if (opts[3608] && phase_id > opts[3612]) {    // Tier 3
        sub_12DE8F0(PM, 3, opts);
        opts[3608] = 0;
    }

    pass = entry->vtable[72]();                   // Plugin pass factory call
    sub_12DE0B0(PM, pass, 1, 0);                  // Insert plugin pass

    if (opts[3904])                               // Debug mode
        insert_verifier_after_each();
}
// Remaining unfired tiers fire unconditionally after loop

The tier control fields in the NVVMPassOptions struct:

Offset	Type	Field
`+3528`	bool	Tier 1 enable
`+3532`	int	Tier 1 phase threshold
`+3568`	bool	Tier 2 enable
`+3572`	int	Tier 2 phase threshold
`+3608`	bool	Tier 3 enable
`+3612`	int	Tier 3 phase threshold
`+4224`	bool	Tier 0 (full optimization) enable
`+4228`	int	Tier 0 phase threshold

Infrastructure Setup (Always Runs)

These five passes are always inserted first, regardless of optimization level:

Pos	Factory	Identity	AddPass Flags
1	`sub_149CCE0` (alloc 368B)	`TargetLibraryInfoWrapperPass`	`(PM, TLI, 0, 0)` Module
2	`sub_1BFB520` (alloc 208B)	`TargetTransformInfoWrapperPass`	`(PM, TTI, 1, 0)` Function
3	`sub_14A7550`	`VerifierPass` / `BasicAliasAnalysis`	`(PM, _, 0, 0)` Module
4	`sub_1361950`	`AssumptionCacheTracker`	`(PM, _, 0, 0)` Module
5	`sub_1CB0F50`	`ProfileSummaryInfoWrapperPass`	`(PM, _, 1, 0)` Function

Tier 0 -- Full Optimization (`sub_12DE330`)

Called when opts[4224] (optimization enabled) and the phase threshold is exceeded. This is the primary optimization sub-pipeline for O1/O2/O3, adding ~40 passes. Address: 0x12DE330.

Confidence note: Pass identifications are based on diagnostic strings, factory-function signatures, and pipeline ordering. Most identifications are HIGH confidence (confirmed by unique string literals). Entries marked [MEDIUM confidence] are inferred from code structure, argument patterns, or address proximity rather than direct string evidence.

Pos	Factory Address	Likely Pass	Guard Condition
1	`sub_1654860(1)`	BreakCriticalEdges	always
2	`sub_1A62BF0(1,0,0,1,0,0,1)`	LLVM standard pipeline #1	always
3	`sub_1B26330`	MemCpyOpt	always
4	`sub_185D600`	IPConstantPropagation	always
5	`sub_1C6E800`	GVN	always
6	`sub_1C6E560`	NewGVN/GVNHoist `[MEDIUM confidence]`	always
7	`sub_1857160`	NVVMReflect	always
8	`sub_1842BC0`	SCCP	always
9	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160]`
10	`sub_12D4560`	NVVMVerifier	always
11	`sub_18A3090`	NVVMPredicateOpt	always
12	`sub_184CD60`	ConstantMerge	always
13	`sub_1869C50(1,0,1)`	Sink/MemSSA `[MEDIUM confidence]` -- three-arg factory matches Sink with MemSSA parameters, but could also be a custom sinking variant	`!opts[1040]`
14	`sub_1833EB0(3)`	TailCallElim/JumpThreading `[MEDIUM confidence]` -- integer arg=3 could be JumpThreading threshold or TailCallElim mode; no disambiguating string	always
15	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160]`
16	`sub_1952F90(-1)`	LoopIndexSplit	always
17	`sub_1A62BF0(1,...)`	LLVM standard pipeline #1	always
18	`sub_1A223D0`	NVVMIRVerification	always
19	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160]`
20	`sub_1A7A9F0`	InstructionSimplify	always
21	`sub_1A62BF0(1,...)`	LLVM standard pipeline #1	always
22	`sub_1A02540`	GenericToNVVM	always
23	`sub_198DF00(-1)`	LoopSimplify	always
24	`sub_1C76260`	ADCE	`!opts[1320]`
25	`sub_195E880(0)`	LICM	`opts[2880]`
26	`sub_19C1680(0,1)`	LoopUnroll	`!opts[1360]`
27	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160]`
28	`sub_19401A0`	InstCombine	always
29	`sub_1968390`	SROA	always
30	`sub_196A2B0`	EarlyCSE	always
31	`sub_19B73C0(2,-1,-1,-1,-1,-1,-1)`	LoopUnswitch	always
32	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160]`
33	`sub_190BB10(0,0)`	SimplifyCFG	always
34	`sub_1A13320`	NVVMRematerialization	always
35	`sub_17060B0(1,1)`	PrintModulePass (verified)	`opts[3160]`
36	`sub_18F5480`	DSE	always
37	`sub_18DEFF0`	DCE	always
38	`sub_1A62BF0(1,...)`	LLVM standard pipeline #1	always
39	`sub_18B1DE0`	NVVMLoopPass/BarrierOpt `[MEDIUM confidence]` -- address is in NVVM pass range, but dual name reflects ambiguity between loop optimization and barrier optimization roles	always
40	`sub_1841180`	FunctionAttrs	always

Tier 1/2/3 -- Phase-Specific Sub-pipeline (`sub_12DE8F0`)

Called with tier number (1, 2, or 3). Address: 0x12DE8F0. Stores the tier value to qword_4FBB410. When tier==3, sets qword_4FBB370 = 6 if BYTE4 was 0 (enables advanced barrier and memory space optimization features).

The pass sequence is significantly longer than Tier 0 and varies by tier. The following shows the superset of all passes that can be inserted; tier-based guards are annotated.

Confidence note: Same methodology as Tier 0 table above. Most identifications confirmed by diagnostic strings or NVVMPassOptions slot cross-references.

Pos	Factory Address	Likely Pass	Guard
1	`sub_1CB4E40(1)`	NVVMIntrinsicLowering	`!opts[2000]`
2	`sub_1A223D0`	NVVMIRVerification	`!opts[2600]`
3	`sub_1CB4E40(1)`	NVVMIntrinsicLowering (barrier)	`!opts[2000]`
4	`sub_18E4A00`	NVVMBarrierAnalysis	`opts[3488]`
5	`sub_1C98160(0)`	NVVMLowerBarriers	`opts[3488]`
6	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160] && !opts[1080]`
7	`sub_12D4560`	NVVMVerifier	`!opts[600]`
8	`sub_185D600`	IPConstPropagation	`opts[3200] && !opts[920]`
9	`sub_1857160`	NVVMReflect	`opts[3200] && !opts[880]`
10	`sub_18A3430`	NVVMPredicateOpt	`opts[3200] && !opts[1120]`
11	`sub_1842BC0`	SCCP	`opts[3200] && !opts[720]`
12	`sub_17060B0(1,0)`	PrintModulePass	`!opts[1080]`
13	`sub_12D4560`	NVVMVerifier	`!opts[600]`
14	`sub_18A3090`	NVVMPredicateOpt variant	`opts[3200] && !opts[2160]`
15	`sub_184CD60`	ConstantMerge	`opts[3200] && !opts[1960]`
16	`sub_190BB10(1,0)`	SimplifyCFG	tier!=1 && `!opts[1040] && !opts[1200]`
17	`sub_1952F90(-1)`	LoopIndexSplit	(same guard) && `!opts[1160]`
18	`sub_12D4560`	NVVMVerifier	(same guard) && `!opts[600]`
19	`sub_17060B0(1,0)`	PrintModulePass	(same guard) && `!opts[1080]`
20	`sub_195E880(0)`	LICM	`opts[3704] && opts[2880] && !opts[1240]`
21	`sub_1C8A4D0(v)`	EarlyCSE	`v=1 if opts[3704]`
22	`sub_1869C50(1,0,1)`	Sink	tier!=1 && `!opts[1040]`
23	`sub_1833EB0(3)`	TailCallElim	tier==3 && `!opts[320]`
24	`sub_1CC3990`	NVVMUnreachableBlockElim	`!opts[2360]`
25	`sub_18EEA90`	CorrelatedValuePropagation	`opts[3040]`
26	`sub_12D4560`	NVVMVerifier	`!opts[600]`
27	`sub_1A223D0`	NVVMIRVerification	`!opts[2600]`
28	`sub_1CB4E40(1)`	NVVMIntrinsicLowering	`!opts[2000]`
29	`sub_1C4B6F0`	Inliner	`!opts[440] && !opts[480]`
30	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160] && !opts[1080]`
31	`sub_1A7A9F0`	InstructionSimplify	`!opts[2720]`
32	`sub_12D4560`	NVVMVerifier	`!opts[600]`
33	`sub_1A02540`	GenericToNVVM	`!opts[2200]`
34	`sub_198DF00(-1)`	LoopSimplify	`!opts[1520]`
35	`sub_1C76260`	ADCE	`!opts[1320] && !opts[1480]`
36	`sub_17060B0(1,0)`	PrintModulePass	(same guard)
37	`sub_12D4560`	NVVMVerifier	(same guard)
38	`sub_195E880(0)`	LICM	`opts[2880] && !opts[1240]`
39	`sub_1C98160(0/1)`	NVVMLowerBarriers	`opts[3488]`
40	`sub_19C1680(0,1)`	LoopUnroll	`!opts[1360]`
41	`sub_17060B0(1,0)`	PrintModulePass	`!opts[1080]`
42	`sub_19401A0`	InstCombine	`!opts[1000]`
43	`sub_196A2B0`	EarlyCSE	`!opts[1440]`
44	`sub_1968390`	SROA	`!opts[1400]`
45	`sub_19B73C0(tier,...)`	LoopUnswitch	tier!=1, SM-arch-dependent params
46	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160] && !opts[1080]`
47	`sub_19B73C0(tier,...)`	LoopUnswitch (2nd)	`!opts[2760]`
48	`sub_1A62BF0(1,...)`	LLVM standard pipeline	`!opts[600]`
49	`sub_1A223D0`	NVVMIRVerification	`!opts[2600]`
50	`sub_1CB4E40(1)`	NVVMIntrinsicLowering	`!opts[2000]`
51	`sub_17060B0(1,0)`	PrintModulePass	`!opts[1080]`
52	`sub_190BB10(0,0)`	SimplifyCFG	`!opts[960]`
53	`sub_1922F90`	NVIDIA loop pass	`opts[3080]`
54	`sub_195E880(0)`	LICM	`opts[2880] && !opts[1240]`
55	`sub_1A13320`	NVVMRematerialization	`!opts[2320]`
56	`sub_1968390`	SROA	`!opts[1400]`
57	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160] && !opts[1080]`
58	`sub_18EEA90`	CorrelatedValuePropagation	`opts[3040]`
59	`sub_18F5480`	DSE	`!opts[760]`
60	`sub_18DEFF0`	DCE	`!opts[280]`
61	`sub_1A62BF0(1,...)`	LLVM standard pipeline	`!opts[600]`
62	`sub_1AAC510`	NVIDIA-specific pass	`!opts[520] && !opts[560]`
63	`sub_1A223D0`	NVVMIRVerification	`!opts[2600]`
64	`sub_1CB4E40(1)`	NVVMIntrinsicLowering	`!opts[2000]`
65	`sub_1C8E680`	MemorySpaceOpt	`!opts[2680]`, param from `opts[3120]`
66	`sub_1A223D0`	NVVMIRVerification	`opts[3120] && !opts[2600]`
67	`sub_17060B0(1,0)`	PrintModulePass (barrier)	`!opts[1080]`
68	`sub_1CC71E0`	NVVMGenericAddrOpt	`!opts[2560]`
69	`sub_1C98270(1,opts[2920])`	NVVMLowerBarriers variant	`opts[3488]`
70	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160] && !opts[1080]`
71	`sub_1C6FCA0`	ADCE	`opts[2840] && !opts[1840]`
72	`sub_18B1DE0`	LoopOpt/BarrierOpt	`opts[3200] && !opts[2640]`
73	`sub_1857160`	NVVMReflect	`opts[3200] && tier==3 && !opts[880]`
74	`sub_1841180`	FunctionAttrs	`opts[3200] && !opts[680]`
75	`sub_1C46000`	NVVMLateOpt	tier==3 && `!opts[360]`
76	`sub_1841180`	FunctionAttrs (2nd)	`opts[3200] && !opts[680]`
77	`sub_1CBC480`	NVVMLowerAlloca	`!opts[2240] && !opts[2280]`
78	`sub_1CB73C0`	NVVMBranchDist	`!opts[2080] && !opts[2120]`
79	`sub_1C7F370(1)`	NVVMWarpShuffle	`opts[3328] && !opts[1640]`
80	`sub_1CC5E00`	NVVMReduction	`opts[3328] && !opts[2400]`
81	`sub_1CC60B0`	NVVMSinking2	`opts[3328] && !opts[2440]`
82	`sub_1CB73C0`	NVVMBranchDist (2nd)	`opts[3328] && !opts[2080] && !opts[2120]`
83	`sub_17060B0(1,0)`	PrintModulePass	`opts[3328] && !opts[1080]`
84	`sub_1B7FDF0(3)`	Reassociate	`opts[3328] && !opts[1280]`
85	`sub_17060B0(1,0)`	PrintModulePass (final)	`opts[3160] && !opts[1080]`

Optimization Level Summary

Pipeline	Sub-pipeline called	lsa-opt	mem-space-opt	Approx. passes
`nvopt<O0>`	(minimal, `sub_1C8A4D0(0)` only)	off	off	~5--8
`nvopt<Ofcmax>`	Sinking2 + common tail only	forced 0	forced 0	~12--15
`nvopt<Ofcmid>`	mid-level pipeline	normal	enabled	~25--30
`nvopt<Ofcmin>`	close to full pipeline	normal	enabled	~30--35
`nvopt<O1>`	`sub_12DE330` (Tier 0)	normal	enabled	~35
`nvopt<O2>`	`sub_12DE330` + Tier 1/2	normal	enabled	~35+
`nvopt<O3>`	`sub_12DE330` + Tier 1/2/3	normal	enabled	~35+

O1/O2/O3 all route through the same sub_12DE330 (Tier 0). The difference manifests through the tiered pass inserter sub_12DE8F0: O1 only fires Tier 1, O2 fires Tiers 1--2, O3 fires all three tiers. Within the tiers, passes additionally vary by: loop unroll factor (parameter to sub_1833EB0), vectorizer width (parameters to sub_19B73C0), CGSCC iteration count (first parameter to sub_1A62BF0), and the SM-architecture-dependent late passes gated by opts[3328].

Ofcmax critical behavior: when fast-compile level == 2 (max), the libnvvm pipeline builder forces -lsa-opt=0 and -memory-space-opt=0 even if the user explicitly enables them. This is confirmed in both sub_9624D0 (line 1358) and sub_12CC750 (line 2025).

Codegen Dispatch -- `sub_12DFE00`

After all optimization tiers complete, sub_12DFE00 (0x12DFE00) performs codegen pass scheduling. This is NOT a simple pass adder -- it performs a full dependency graph construction:

Reads optimization level from opts[200] (0 = minimal, >1 = enable dependency tracking)
Iterates all passes already in the pass manager
For each pass, calls vtable+112 (isCodeGenOnly()) to filter
Calls vtable+16 (getAnalysisUsage()) to extract dependencies
Builds a secondary hash table of ordering constraints
Dispatches each pass to the codegen subsystem in topological order via the subtarget hook at vtable+16

Pass Classification Statistics

Category	Count
Module analyses	18
Module passes	~131
CGSCC analyses	3
CGSCC passes	~10
Function analyses	~39
Function AA analyses	5
Function passes	~219
LoopNest passes	4
Loop analyses	5
Loop passes	~26
MachineFunction analyses	20
MachineFunction passes	~50
Total	~526
NVIDIA additions	33
Standard LLVM	~493

Complete Pass Factory Address Map

Every unique pass factory address observed in sub_12E54A0, sub_12DE330, and sub_12DE8F0:

Function	Address	Size	Role
NVVMVerifier	`sub_12D4560`	many (tiers)	many (tiers)
AssumptionCacheTracker	`sub_1361950`	1	1
TargetLibraryInfoWrapperPass	`sub_149CCE0`	1	1
VerifierPass/BasicAA	`sub_14A7550`	1	1
BreakCriticalEdges	`sub_1654860`	2	2
PrintModulePass (debug dump)	`sub_17060B0`	~30+	~30+
InstructionCombining	`sub_1832270`	2	2
TailCallElim/JumpThreading	`sub_1833EB0`	3	3
FunctionAttrs	`sub_1841180`	3	3
SCCP	`sub_1842BC0`	2	2
NVVMReflect	`sub_1857160`	~8	~8
IPConstantPropagation	`sub_185D600`	3	3
Sink (MemorySSA-based)	`sub_1869C50`	3	3
NVVMPredicateOpt	`sub_18A3090`	2	2
AggressiveInstCombine	`sub_18A3430`	2	2
NVVMLoopOpt/BarrierOpt	`sub_18B1DE0`	3	3
Sinking2Pass (fast-mode)	`sub_18B3080`	1	1
DCE	`sub_18DEFF0`	4	4
NVVMBarrierAnalysis	`sub_18E4A00`	1	1
CorrelatedValuePropagation	`sub_18EEA90`	3	3
DSE	`sub_18F5480`	2	2
DeadArgElimination	`sub_18FD350`	5	5
SimplifyCFG	`sub_190BB10`	4	4
NVIDIA loop pass	`sub_1922F90`	1	1
LoopIndexSplit	`sub_1952F90`	3	3
LICM	`sub_195E880`	4	4
SROA	`sub_1968390`	2	2
EarlyCSE	`sub_196A2B0`	2	2
LoopUnroll/Vectorize	`sub_197E720`	1	1
LoopSimplify/IndVarSimplify	`sub_198DF00`	3	3
CorrelatedValuePropagation	`sub_198E2A0`	1	1
InstCombine	`sub_19401A0`	2	2
LoopUnswitch	`sub_19B73C0`	3	3
LoopUnroll	`sub_19C1680`	2	2
NVIDIA pass (unknown)	`sub_19CE990`	1	1
GenericToNVVM	`sub_1A02540`	1	1
NVVMRematerialization	`sub_1A13320`	3	3
NVVMIRVerification	`sub_1A223D0`	5+	5+
LLVM StandardPassPipeline	`sub_1A62BF0`	~9	~9
LoopIdiomRecognize	`sub_1A68E70`	1	1
InstructionSimplify	`sub_1A7A9F0`	3	3
NVIDIA-specific pass	`sub_1AAC510`	1	1
MemCpyOpt	`sub_1B26330`	4	4
Reassociate/Sinking	`sub_1B7FDF0`	3	3
TTIWrapperPass	`sub_1BFB520`	1	1
NVVMLateOpt	`sub_1C46000`	1	1
Inliner/AlwaysInline	`sub_1C4B6F0`	2	2
NewGVN/GVNHoist	`sub_1C6E560`	1	1
GVN	`sub_1C6E800`	2	2
ADCE (AggressiveDCE)	`sub_1C6FCA0`	2	2
ADCE variant	`sub_1C76260`	2	2
NVVMWarpShuffle	`sub_1C7F370`	1	1
EarlyCSE/GVN variant	`sub_1C8A4D0`	3	3
MemorySpaceOpt	`sub_1C8E680`	4	4
NVVMLowerBarriers	`sub_1C98160`	4	4
NVVMLowerBarriers variant	`sub_1C98270`	1	1
ProfileSummaryInfo	`sub_1CB0F50`	1	1
NVVMIntrinsicLowering	`sub_1CB4E40`	~10	~10
NVVMBranchDist	`sub_1CB73C0`	3	3
NVVMLowerAlloca	`sub_1CBC480`	1	1
NVVMUnreachableBlockElim	`sub_1CC3990`	1	1
NVVMReduction	`sub_1CC5E00`	1	1
NVVMSinking2	`sub_1CC60B0`	3	3
NVVMGenericAddrOpt	`sub_1CC71E0`	1	1
NVVMFinalLowering	`sub_1CEBD10`	1	1
NVVMPeephole	`sub_1CEF8F0`	2	2
NVVMAnnotationsProcessor	`sub_215D9D0`	2	2

Total unique pass factories: ~65.

NVVMPassOptions Offset-to-Pass Guard Map

The NVVMPassOptions struct (4,512 bytes, 221 slots) controls which passes execute. The pipeline assembler reads boolean flags at specific offsets to gate pass insertion. See NVVMPassOptions for the full slot layout. Key offset-to-pass mappings:

Offset	Slot	Type	Controls
+200	9	int	Optimization level (0/1/2/3)
+280	15	bool	DCE disable
+320	17	bool	TailCallElim/JumpThreading disable
+360	19	bool (default=1)	NVVMLateOpt disable
+600	31	bool	NVVMVerifier disable
+720	37	bool	SCCP disable
+760	39	bool	DSE disable
+880	45	bool	NVVMReflect disable
+920	47	bool	IPConstantPropagation disable
+960	49	bool	SimplifyCFG disable
+1000	51	bool	InstCombine disable
+1040	53	bool	Sink/MemSSA disable
+1080	55	bool	PrintModulePass disable
+1160	59	bool	LoopIndexSplit disable
+1240	63	bool	LICM disable
+1280	65	bool	Reassociate disable
+1320	67	bool	ADCE disable
+1360	69	bool	LoopUnroll disable
+1400	71	bool	SROA disable
+1440	73	bool	EarlyCSE disable
+1760	89	bool	MemorySpaceOpt disable
+2000	101	bool	NVVMIntrinsicLowering disable
+2320	117	bool (default=1)	NVVMRematerialization disable
+2440	123	bool	NVVMSinking2 disable
+2600	131	bool	NVVMIRVerification disable
+2840	141	bool (default=1)	ADCE enable (reversed logic)
+2880	143	bool (default=1)	LICM enable (reversed logic)
+3120	155	bool (default=1)	MemorySpaceOpt (2nd pass) enable
+3160	157	bool (default=1)	PrintModulePass/debug dump enable
+3200	159	bool (default=1)	Advanced NVIDIA passes group enable
+3328	165	bool (default=1)	SM-specific late passes enable
+3488	175	bool	Barrier optimization enable
+3648	181	ptr	Language string (`"ptx"`/`"mid"`/`"idn"`)
+3656	—	int	Language string length
+3704	185	bool	Late optimization / address-space flag
+4064	201	bool	Concurrent compilation enable
+4104	203	int (default=-1)	Thread count
+4224	211	bool (default=1)	Master optimization enable
+4304	213	bool	Device-code / separate-compilation flag
+4384	217	bool	Fast-compile bypass (skip LLVM pipeline)
+4464	219	bool (default=1)	Late CFG cleanup guard

Infrastructure Functions

Address	Function	Role
`0x2342890`	`sub_2342890`	Master pass registration (~2,816 lines)
`0xE41FB0`	`sub_E41FB0`	`StringMap::insert` (48-byte entries, open-addressing)
`0xE41C70`	`sub_E41C70`	`StringMap::grow` (hash table resize)
`0xC94890`	`sub_C94890`	String hash function (DJB/FNV-family)
`0x9691B0`	`sub_9691B0`	String equality (`len + memcmp`)
`0xC931B0`	`sub_C931B0`	`StringRef::find_first_of` (delimiter search)
`0x95CB50`	`sub_95CB50`	`StringRef::consume_front` (strip `llvm::` prefix)
`0x233C410`	`sub_233C410`	Help listing (`--print-pipeline-passes`)
`0x233BD40`	`sub_233BD40`	AA name resolver (chain of comparisons)
`0x233C0C0`	`sub_233C0C0`	AA pipeline parser
`0x233C300`	`sub_233C300`	Extension callback dispatch
`0x233A120`	`sub_233A120`	Generic parameterized option parser
`0x12E54A0`	`sub_12E54A0`	Master pipeline assembler (49.8KB)
`0x12DE0B0`	`sub_12DE0B0`	AddPass (hash-table-based insertion)
`0x12DE330`	`sub_12DE330`	Tier 0 full optimization sub-pipeline
`0x12DE8F0`	`sub_12DE8F0`	Tier 1/2/3 phase-specific sub-pipeline
`0x12DFE00`	`sub_12DFE00`	Codegen dispatch (dependency-ordered)
`0x226C400`	`sub_226C400`	Pipeline name selector (nvopt<O#>)
`0x2277440`	`sub_2277440`	Pipeline text parser entry
`0x225D540`	`sub_225D540`	New PM nvopt registration
`0x12C35D0`	`sub_12C35D0`	Legacy PM pipeline orchestrator
`0x2342820`	`sub_2342820`	`LastRunTrackingAnalysis` factory
`0x2342830`	`sub_2342830`	`PassInstrumentationAnalysis` factory
`0x2342840`	`sub_2342840`	`VerifierAnalysis` factory
`0x2342850`	`sub_2342850`	`InlinerWrapper` factory (shared by 4 inliner variants)
`0x2342860`	`sub_2342860`	`InvalidateAllAnalysesPass` factory
`0x2342870`	`sub_2342870`	`VerifierPass` factory
`0x2342880`	`sub_2342880`	`GuardWideningPass` factory
`0x2339850`	`sub_2339850`	`PassBuilder` destructor
`0x233B610`	`sub_233B610`	`PassBuilder::~PassBuilder` cleanup

Cross-References

Optimizer -- runtime pipeline assembly, two-phase model, concurrent compilation
NVVMPassOptions -- 221-slot option struct controlling pass enablement
Optimization Levels -- O0/O1/O2/O3 and Ofcmin/Ofcmid/Ofcmax
Concurrent Compilation -- Phase I/II, thread pool, GNU Jobserver

Scalar Passes: SROA, EarlyCSE & JumpThreading

Three LLVM scalar optimization passes play outsized roles in cicc's GPU pipeline. Each is a stock LLVM implementation with NVIDIA configuration overrides (and in EarlyCSE's case, binary-level modifications). Each appears multiple times in the pipeline at different tier levels, and each can be independently disabled via NVVMPassOptions flags.

SROA (Scalar Replacement of Aggregates)

SROA eliminates alloca instructions by decomposing aggregates into individual SSA values that the register allocator can place in registers. On a GPU this is existential: every surviving alloca becomes a spill to .local memory (DRAM-backed, 200-800 cycle latency on cache miss versus zero for a register). A single un-promoted alloca in a hot loop can degrade kernel throughput by 10-50x. SROA also eliminates the .param space copies generated for byval struct parameters, preventing round-trips through local memory.

Full SROA analysis >>>

EarlyCSE (Early Common Subexpression Elimination)

Cicc's EarlyCSE is not stock LLVM. The binary contains four CUDA-specific extensions: barrier-aware memory versioning that prevents CSE across __syncthreads() and other synchronization points, shared memory address space 7 protection against unsafe store-to-load forwarding between threads, a dedicated NVVM intrinsic call CSE handler with fast-path recognition for thread-invariant special register reads (threadIdx.x, etc.), and a PHI operand limit of 5 for compile-time control. It also adds a fourth scoped hash table (store-forwarding) that upstream LLVM lacks.

Full EarlyCSE analysis >>>

JumpThreading

JumpThreading duplicates basic blocks so that predecessors with statically-determinable branch conditions jump directly to the correct successor, eliminating warp divergence. The pass is fundamentally at odds with PTX's requirement for reducible control flow: block duplication can create irreducible cycles. Cicc addresses this through loop header protection (jump-threading-across-loop-headers defaults to false), conservative duplication thresholds (6-instruction block limit), and a late-pipeline StructurizeCFG safety net that catches any irreducibility that slips through. NVIDIA provides a separate "disable-jump-threading" kill switch (distinct from upstream's "disable-JumpThreadingPass"), with an OCG experiment annotation suggesting architecture-specific cases where the CFG disruption outweighs the benefit.

Full JumpThreading analysis >>>

Cross-References

Pipeline & Ordering -- tier-dependent scheduling of all three passes
Register Allocation -- surviving allocas after SROA become register pressure; failed promotion leads to .local memory spills
StructurizeCFG -- the safety net that catches irreducible CFG created by JumpThreading or other passes
GVN -- GVN performs load CSE and redundancy elimination complementary to EarlyCSE, running later in the pipeline with more expensive analysis
MemorySpaceOpt -- resolves generic pointers to specific address spaces; interacts with EarlyCSE's address-space-aware load forwarding
DSE -- Dead Store Elimination complements EarlyCSE's within-block store-to-load forwarding with cross-block dead store detection

SROA (Scalar Replacement of Aggregates)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Based on LLVM 20.0.0 SROA.cpp. Evidence: preserve-cfg / modify-cfg pipeline parser parameters match LLVM 16+ new PM integration; two-pass analysis mode (qword_50055E8) matches LLVM 17+ pre-analysis path. Core splitting algorithm is stock LLVM with no CUDA-specific modifications detected.

SROA is the single most important early-pipeline optimization for NVIDIA GPU compilation. Every alloca instruction that survives into code generation is lowered to .local memory (NVPTX address space 5) -- physically backed by device DRAM and accessed through the L1/L2 cache hierarchy. A .local access that misses L1 costs 200-400 cycles; a register read costs zero. A single un-promoted alloca in a hot loop can degrade kernel throughput by 10-50x. SROA's job is to decompose aggregate allocas (structs, arrays, unions) into individual scalar SSA values that the register allocator can place in registers, eliminating the memory traffic entirely.

Property	Value
Pass name	`"sroa"`
Pipeline parser params	`preserve-cfg`, `modify-cfg`
Entry function	`sub_2935C30` (`runOnAlloca`)
Core function	`sub_2930B90` (`splitAlloca`)
Binary footprint	~138 KB primary (80 KB + 58 KB), ~200 KB secondary (legacy PM)
Binary address range	`0x2910000`-`0x293FFFF` (178 functions)
Pipeline positions	Position 4 (early, after NVVMReflect) and post-sinking (late)
Disable flag	`NVVMPassOptions` offset `+1400`
Size threshold knob	`qword_50056C8` (max alloca size in bits)
Two-pass flag	`qword_50055E8` (enables pre-analysis for new PM)
NVIDIA modifications	None to core algorithm
Upstream source	`llvm/lib/Transforms/Scalar/SROA.cpp`

Why SROA Is Existential on GPU

On a CPU, an alloca that cannot be promoted to a register lives on the stack -- a cached, low-latency memory region with typical access times of 1-4 cycles. On an NVIDIA GPU there is no hardware stack cache: every surviving alloca becomes a .local allocation backed by DRAM with 200-800 cycle latency on cache miss versus zero for a register. See the GPU Execution Model memory hierarchy table for per-tier latencies.

Every alloca that survives SROA becomes a .local allocation. The NVPTX backend emits these as frame objects in the NVPTXFrameLowering::emitPrologue path, and ptxas maps them to per-thread local memory. Because occupancy is bounded by register count per SM, and .local spills effectively consume both registers (for the address) and memory bandwidth, the performance impact compounds.

The pipeline runs SROA twice: once early (position 4, immediately after NVVMReflect) to eliminate allocas before any other transform sees them, and once late (after NVVMCustomSinking2 and BreakCriticalEdges) to catch allocas created or exposed by loop unrolling, inlining, and other mid-pipeline transforms. The early invocation handles the common case (byval parameter copies, local struct variables); the late invocation cleans up whatever the loop optimizer and sinking passes left behind.

The `isAllocaPromotable` Fast Path

Before performing any splitting, runOnAlloca checks whether the alloca is trivially promotable via sub_B4CE70 (isAllocaPromotable). An alloca is promotable if every use is a simple load or store with no address-taken escape -- the same criterion as mem2reg. When this returns true, SROA marks the alloca for mem2reg and returns without performing any slice analysis or splitting. This fast path avoids the O(n) slice-building cost for the vast majority of CUDA local variables (scalar int, float, simple pointers), which are already simple enough for mem2reg to handle directly.

Algorithm: `runOnAlloca` (`sub_2935C30`)

The top-level per-alloca entry point. Validates the alloca as a candidate, builds the partition/slice table, and delegates to splitAlloca for the actual transformation.

Phase 1: Candidate Validation

runOnAlloca(state, alloca):
    if alloca has no users:
        eraseFromParent(alloca)
        return

    if isAllocaPromotable(alloca):
        defer to mem2reg
        return

    type = getAllocatedType(alloca)
    type_byte = getTypeID(type)

    // Accept: integers(3), half(4), bfloat(5), float(6),
    //         pointers(10), vectors(11), arrays(12), structs(15-18, 20)
    // Reject structs/composites unless isVectorType returns true
    if type_byte not in {3,4,5,6,10,11,12,15,16,17,18,20}:
        return
    if type_byte in {15,16,17,18,20} and not isVectorType(type):
        return  // function types, labels, etc.

    size = getTypeSizeInBits(type)   // sub_BDB740
    if size > qword_50056C8:         // SROA size threshold
        return  // alloca too large, leave for backend

The size threshold at qword_50056C8 is a global tuning knob, likely controlled by the sroa<preserve-cfg> / sroa<modify-cfg> pipeline parameter. Allocas larger than this threshold are left untouched; the backend will lower them to .local memory. The exact default is not exposed in the binary's constructor initializers, but upstream LLVM uses a default of 128 bytes (1024 bits) for the sroa-threshold flag.

Phase 2: Use Analysis and Slice Building

    metadata = buildMetadataTable(alloca)   // sub_D5F1F0

    if qword_50055E8:                       // two-pass mode
        buildSlices(state, alloca, 1)       // sub_2927160 — pre-analysis
        slices = buildPartitions(state)     // sub_2924690
    else:
        slices = buildPartitions(state)     // single-pass

buildSlices (sub_2927160) walks all users of the alloca, classifying each use as a "slice" -- a byte range [start, end) with associated flags. Each slice is a 24-byte entry:

Offset	Size	Field
+0	8	`start` (byte offset into alloca)
+8	8	`end` (byte offset, exclusive)
+16	8	`flags` -- bit 2 = splittable, bits [63:3] = user instruction metadata pointer

buildPartitions (sub_2924690) groups non-overlapping slices into partitions. Each partition represents a contiguous byte range that can be replaced by a single sub-alloca. Overlapping slices are merged; slices that cross partition boundaries are marked as "unsplittable."

The two-pass flag (qword_50055E8) enables a pre-analysis pass that runs buildSlices first with a "dry-run" mode to count slices and pre-allocate arrays, then runs the actual partition builder. This is the new PM (PassManager) style -- the legacy PM code path at 0x1A10000 does a single pass.

Phase 3: Contiguous Slice Merging

After building slices, runOnAlloca scans for contiguous ranges that share the same base type and can be merged:

    for each group of contiguous slices:
        if all loads/stores in group use the same type:
            if none are volatile (isVolatile check via sub_B46500):
                if all are in-bounds (byte +2, bit 0):
                    mergeSlices(group)   // sub_11D2BF0 + sub_11D3120 + sub_11D7E80

This optimizer/merger reduces redundant slices before the splitting phase. For example, if a 16-byte struct has four contiguous 4-byte i32 loads, the merger can combine them into a single slice covering the full struct, which may then map to a single <4 x i32> register rather than four separate scalar registers.

Phase 4: Dead Instruction Processing

    for each dead instruction found during analysis:
        for each operand:
            addToWorklist(operand)         // sub_29220F0
        replaceAllUsesWith(undef)          // sub_BD84D0 + sub_ACADE0
        eraseFromParent(instruction)       // sub_BD60C0

Dead instructions identified during slice building (stores to never-loaded ranges, loads of write-only ranges) are removed immediately, before the splitting phase begins.

Phase 5: Recursive Splitting

    if slices is non-empty:
        splitAlloca(state, alloca, slices)  // sub_2930B90 — recursive

This is the key: splitAlloca may create new sub-allocas that are themselves candidates for further splitting. The newly created sub-allocas are added to the worklist and processed in stack order (LIFO).

Phase 6-8: Post-Split Processing

After splitting, runOnAlloca processes newly created sub-allocas (56-byte records stored in a SmallVector with 2-element inline buffer), rewrites per-sub-alloca slice lists, and returns a two-byte result: byte 0 = changed flag, byte 1 = re-run needed flag.

Algorithm: `splitAlloca` (`sub_2930B90`)

The core splitting function. Given a partitioned alloca and its use-slices, it creates new sub-allocas and rewrites all users.

Phase 1: Pre-Filter Slices

Iterates the 24-byte slice array. For slices whose instruction is a load (opcode 61) or store (opcode 62) of a simple scalar type that fits entirely within the alloca boundary, clears the "splittable" bit (flag & 4). This prevents unnecessary splitting of trivial accesses -- a scalar i32 load from an i32 alloca does not need splitting. If any slices were de-flagged, calls sortSlices (sub_2912200) and compactSlices (sub_2915A90 / sub_2914CE0) to remove the now-redundant entries.

Phase 2: Partition Iteration

buildPartitionTable (sub_2913C40) produces a partition list from the sorted slices. Each partition is a local tuple [start, end, first_slice_ptr, last_slice_ptr]. The main loop advances through partitions via sub_2912870 (advancePartitionIterator).

Phase 3: Find Rewrite Target

For each partition [start, end):

Get the DataLayout via sub_B43CC0 (getDL).
If the partition contains only unsplittable slices, call findExistingValue (sub_291A860) to search for an existing SSA value that already covers [start, end). If found, reuse it instead of creating a new alloca.
Otherwise, scan slices for a single dominating load or store. Dispatch on opcode:
- 61 (load): extract the loaded type.
- 62 (store): extract the stored value type from the store's value operand.
- 85 (intrinsic): memcpy/memset/memmove -- follow the pointer chain to determine the affected type.
Compare type sizes via getTypeSizeInBits (sub_BDB740).
If no suitable existing value, create a new alloca via CreateAlloca (sub_BCD420) or CreateBitCast (sub_BCD140).

Phase 4: Size and Alignment Check

    alloc_size = getTypeAllocSize(partition_type)    // sub_9208B0
    if alloc_size > 0x800000:                        // 8 MB sanity limit
        skip partition

    // Verify rewrite target matches partition size (8-byte aligned)
    if match:
        checkTypeCompatibility(both_directions)      // sub_29191E0
        validateUnsplittableSlices(partition)         // sub_291A4D0

The 8 MB sanity limit prevents SROA from creating absurdly large sub-allocas from pathological input.

Phase 5: Slice Classification

For each slice in the partition, classifySlice (sub_29280E0) sorts it into one of two lists:

List	Variable	Contents
splittable-inside	`v446`	Slices fully contained within `[start, end)`
splittable-outside	`v452`	Slices that reference bytes outside the partition (integer widening)

The classification also tracks:

v413 (sameType flag): whether all slices in the partition use the same LLVM type.
v415 (common type): the shared type if sameType is true.
v412 (hasPointerType): whether any slice involves a pointer type.
Integer types (type byte == 14) are routed to the outside list for special handling (widening/narrowing may be needed).

Then rewritePartition (sub_29197E0) is called twice: first for inside slices with callback sub_2919EF0, then for outside slices if the first call produced nothing.

Phase 6: New Sub-Alloca Creation

    // Compute alignment
    align_log2 = _BitScanReverse64(alloca_alignment)
    abi_align = getABITypeAlignment(type)            // sub_AE5020
    pref_align = getPrefTypeAlignment(type)          // sub_AE5260

    // Build name: original_name + ".sroa." + index
    name = getName(alloca) + ".sroa."                // sub_BD5D20

    // Create the new alloca (80-byte AllocaInst object)
    new_alloca = AllocaInst::Create(type, size, alignment, name)
                                                     // sub_BD2C40 + sub_B4CCA0
    // Insert before the original alloca
    insertBefore(new_alloca, alloca)

    // Copy debug metadata
    copyDebugInfo(alloca, new_alloca)                // sub_B96E90 + sub_B976B0

Each sub-alloca is an 80-byte AllocaInst object with the .sroa. name prefix. The insertion point is always directly before the original alloca in the entry block, maintaining the invariant that all allocas are grouped at the function entry.

Phase 7: Instruction Rewriting

The visitUse function (sub_292A4F0) rewrites each user of the original alloca to reference the appropriate sub-alloca:

GEP chains: retargeted to the new sub-alloca with adjusted offsets (sub_29348F0).
Loads: rewritten with type-casts if the sub-alloca type differs from the original load type (sub_F38250).
Stores: same treatment as loads (sub_F38250).
Memcpy/memset: split into smaller operations covering only the sub-alloca's byte range (sub_F38330).

Each rewritten instruction is validated via sub_291F660 (validateRewrite).

Phase 8: Worklist Management

Dead instructions are removed from the pass's open-addressing hash table (at pass state offset +432, mask at +896). New sub-allocas are added to the worklist (sub_2928360) for re-processing. Allocas that cannot be split are recorded via sub_2916C30 (recordNonSplitAlloca).

Phase 9: Result Recording

For each partition that produced a new alloca, the result is stored as a 24-byte entry [new_alloca, bit_offset, bit_size] in the output array. Hash table capacity is computed using the classic 4n/3 + 1 formula (next power of 2), and entries are stored via open-addressing with linear probing (sub_29222D0 handles resizing).

Phase 10: Post-Split Use Rewriting

The most complex phase. For every use of the original alloca:

getOperandNo (sub_B59530) determines which operand references the alloca.
getAccessRange (sub_AF47B0) computes the byte range [begin, end) within the alloca that this use touches.
For each new sub-alloca in the result array, checkSubAllocaOverlap (sub_AF4D30) tests whether the sub-alloca's range overlaps the use's range.
If overlap: computeRewrittenValue (sub_2916270) produces the replacement value by combining reads from multiple sub-allocas if the original use spans a partition boundary.
Dead uses are identified by isDeadUse (sub_291D8F0) and erased.

The use-list implementation uses a tagged-pointer scheme: bit 2 indicates "heap-allocated list" vs. "inline single element," bits [63:3] are the actual pointer. Lists are freed via _libc_free after extracting the data pointer.

Phase 11-12: Lifetime and Debug Info

Lifetime markers (llvm.lifetime.start / llvm.lifetime.end) are rewritten via sub_291E540 to cover only the sub-alloca's byte range. Debug declarations (dbg.declare, dbg.value) are similarly rewritten: each debug-info entry pointing to the original alloca is retargeted to the sub-alloca whose byte range covers the relevant fragment, using the debug expression's DW_OP_LLVM_fragment to indicate the piece.

Speculative Loads Through Select

When a load reaches its pointer through a select instruction, SROA hoists the load into both branches:

; Before SROA:
%p = select i1 %cond, ptr %a, ptr %b
%v = load float, ptr %p, align 4

; After SROA:
%vt = load float, ptr %a, align 4          ; .sroa.speculate.load.true
%vf = load float, ptr %b, align 4          ; .sroa.speculate.load.false
%v  = select i1 %cond, float %vt, float %vf ; .sroa.speculated

This is significant on GPU for two reasons:

SIMT execution model. A select on a GPU maps to a predicated move, which executes in a single cycle without divergence. The two speculative loads execute unconditionally and in parallel (both issue to the memory pipeline regardless of the predicate). This is cheaper than a control-dependent load that would require branch divergence handling.
Alloca elimination. The original pattern requires the select to produce a pointer, which means the alloca must remain in memory (the pointer must be materializable). After speculation, both pointers are consumed directly by loads, and if %a and %b are themselves sub-allocas that can be promoted to registers, the entire chain collapses to register-only operations.

The implementation (Kind 3, lines 1024-1235 of splitAlloca) creates:

Two BitCastInst with names .sroa.speculate.cast.true and .sroa.speculate.cast.false.
Two LoadInst with names .sroa.speculate.load.true and .sroa.speculate.load.false, preserving alignment from the original load.
One SelectInst with name .sroa.speculated via sub_B36550 (SelectInst::Create).
Metadata copied from the original load via sub_B91FC0 (copyMetadata).

Interaction with `.param` Space

Function parameters passed by value in CUDA/PTX use the .param address space (NVPTX address space 101). The EDG frontend generates an alloca to hold a copy of each byval parameter, then loads fields from it. Consider:

struct Vec3 { float x, y, z; };

__device__ float sum(Vec3 v) {
    return v.x + v.y + v.z;
}

The IR before SROA contains:

define float @sum(%struct.Vec3* byval(%struct.Vec3) align 4 %v) {
  %v.addr = alloca %struct.Vec3, align 4           ; byval copy
  %x = getelementptr %struct.Vec3, ptr %v.addr, i32 0, i32 0
  %0 = load float, ptr %x, align 4
  %y = getelementptr %struct.Vec3, ptr %v.addr, i32 0, i32 1
  %1 = load float, ptr %y, align 4
  %z = getelementptr %struct.Vec3, ptr %v.addr, i32 0, i32 2
  %2 = load float, ptr %z, align 4
  %add = fadd float %0, %1
  %add1 = fadd float %add, %2
  ret float %add1
}

SROA splits %v.addr into three scalar allocas (%v.addr.sroa.0, .sroa.1, .sroa.2), each holding a single float. Because each sub-alloca has only simple loads and stores, mem2reg (which runs in the next pipeline iteration) promotes all three to SSA registers. The final IR has no allocas and no memory traffic -- the three float values live entirely in registers.

Without SROA, the byval copy would persist as a .local allocation, and every field access would be a .local load. For a kernel that calls sum() in a tight loop, this difference is the difference between register-speed and DRAM-speed execution.

The NVPTXTargetLowering::LowerCall function (sub_3040BF0) emits DeclareParam (opcode 505) and StoreV1/V2/V4 (opcodes 571-573) for the .param writes on the caller side; SROA's job is to ensure the callee's reads never touch memory.

Auxiliary SROA Functions (Secondary Instance)

The binary contains a second SROA instance at 0x1A10000-0x1A3FFFF (~200 KB), corresponding to the legacy pass manager code path. This instance contains additional rewriting functions not visible in the primary (new PM) instance:

Function	Size	Role	Key strings
`sub_1A3B290`	58 KB	`rewritePartition` (memcpy/memset)	`"memcpy.load.fca"`, `"memcpy.store.fca"`, `"memset.store.fca"`, `".fca"`
`sub_1A2D070`	35 KB	`presplitLoadsAndStores`	`"select.gep.sroa"`, `"select.sroa"`, `"phi.sroa"`, `"phi.gep.sroa"`
`sub_1A2C2F0`	9 KB	Select speculation	`".sroa.speculate.load.true"`, `".sroa.speculate.load.false"`
`sub_1A2FFA0`	12 KB	Vector splat handling	`"vsplat"`, `".splatinsert"`, `".splat"`
`sub_1A30D10`	16 KB	Load rewriting	`"copyload"`, `"oldload"`
`sub_1A31B60`	9 KB	Extract/load patterns	`"extract"`, `"load.ext"`, `"endian_shift"`, `"load.trunc"`
`sub_1A23B30`	11 KB	Type casting	`"sroa_raw_cast"`, `"sroa_raw_idx"`, `"sroa_cast"`
`sub_1A3A670`	13 KB	Speculative load promotion	`".sroa.speculated"`, `".sroa.speculate.load."`
`sub_1A13B30`	36 KB	Alloca analysis / slice building	--
`sub_1A15E70`	34 KB	Partition computation	--
`sub_1A18770`	38 KB	Use analysis	--
`sub_1A3DCD0`	15 KB	Cleanup	--

The .fca suffix stands for "first-class aggregate" -- LLVM's term for structs and arrays passed by value. The presplitLoadsAndStores function handles a special case where loads and stores of aggregates can be split before the main SROA algorithm runs, decomposing load { i32, i32 } into separate load i32 instructions and store { i32, i32 } into separate store i32 instructions. The select.gep.sroa and phi.gep.sroa strings indicate that this pre-split phase also handles GEP chains through PHI nodes and selects, a pattern common in CUDA code after inlining.

Data Structures

Slice Entry (24 bytes)

struct SROASlice {
    uint64_t start;     // +0:  byte offset into alloca (inclusive)
    uint64_t end;       // +8:  byte offset into alloca (exclusive)
    uint64_t flags;     // +16: bit 2 = splittable, bits [63:3] = user metadata ptr
};

The splittable bit indicates whether the slice can be split across partition boundaries. Loads and stores of simple scalars that fit entirely within the alloca have this bit cleared in Phase 1 of splitAlloca.

Sub-Alloca Record (56 bytes)

struct SubAllocaRecord {
    void* alloca_ptr;       // +0:  pointer to the new AllocaInst
    void* slice_list;       // +8:  pointer to slice list for this sub-alloca
    uint64_t slice_list_cap; // +16: capacity of slice list
    // ... additional fields through +55
};

Stored in a SmallVector<SubAllocaRecord, 2> -- the inline buffer holds two elements (common case: a struct with two fields), spilling to heap for larger aggregates.

Pass State Hash Table

The SROA pass state object (parameter a1 to both main functions) contains an open-addressing hash table at offsets +432 through +896. It uses LLVM-layer sentinels (-4096 / -8192) with instruction pointer keys. This table tracks which instructions have already been processed or are pending in the worklist. See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth policy.

Tagged Pointer Scheme

Use-lists and debug-info lists use a tagged-pointer encoding for memory efficiency:

Bit 2 clear: the "pointer" field directly contains a single element (inline storage for the common case of one use).
Bit 2 set: bits [63:3] are a heap-allocated pointer to a variable-length list. Freed via _libc_free after masking off the tag bits.

This avoids heap allocation for the overwhelmingly common case where an alloca field has exactly one load or one store.

IR Before/After Example

Consider a CUDA kernel that uses a local struct:

__global__ void kernel(float* out, int n) {
    struct { float a; int b; float c; } local;
    local.a = 1.0f;
    local.b = n;
    local.c = 2.0f;
    out[0] = local.a + local.c;
    out[1] = (float)local.b;
}

Before SROA:

define void @kernel(ptr %out, i32 %n) {
entry:
  %local = alloca { float, i32, float }, align 4
  %a = getelementptr { float, i32, float }, ptr %local, i32 0, i32 0
  store float 1.0, ptr %a, align 4
  %b = getelementptr { float, i32, float }, ptr %local, i32 0, i32 1
  store i32 %n, ptr %b, align 4
  %c = getelementptr { float, i32, float }, ptr %local, i32 0, i32 2
  store float 2.0, ptr %c, align 4
  %v0 = load float, ptr %a, align 4
  %v2 = load float, ptr %c, align 4
  %sum = fadd float %v0, %v2
  store float %sum, ptr %out, align 4
  %v1 = load i32, ptr %b, align 4
  %conv = sitofp i32 %v1 to float
  %idx = getelementptr float, ptr %out, i64 1
  store float %conv, ptr %idx, align 4
  ret void
}

After SROA (three sub-allocas, then mem2reg promotes to registers):

define void @kernel(ptr %out, i32 %n) {
entry:
  ; No allocas remain -- all promoted to SSA values
  %sum = fadd float 1.0, 2.0          ; constant-folded later by InstCombine
  store float %sum, ptr %out, align 4
  %conv = sitofp i32 %n to float
  %idx = getelementptr float, ptr %out, i64 1
  store float %conv, ptr %idx, align 4
  ret void
}

SROA splits %local into %local.sroa.0 (float), %local.sroa.1 (i32), %local.sroa.2 (float). Each sub-alloca has trivial load/store patterns, so mem2reg promotes all three. The stores and loads collapse, GEPs disappear, and the kernel runs entirely from registers.

Name Suffixes Created During Splitting

Suffix	Purpose
`.sroa.`	New sub-alloca name prefix
`.sroa.speculate.cast.true`	Bitcast for true branch of select
`.sroa.speculate.cast.false`	Bitcast for false branch of select
`.sroa.speculate.load.true`	Speculative load from true branch
`.sroa.speculate.load.false`	Speculative load from false branch
`.sroa.speculated`	Final select combining speculative loads
`.cont`	Continuation block (after branch splitting)
`.then`	Then-branch block
`.else`	Else-branch block
`.val`	Value extracted from split load/store
`.fca`	First-class aggregate decomposition
`select.gep.sroa`	GEP through select, pre-split
`select.sroa`	Select pointer, pre-split
`phi.sroa`	PHI pointer, pre-split
`phi.gep.sroa`	GEP through PHI, pre-split
`sroa_raw_cast`	Raw bitcast during type rewriting
`sroa_raw_idx`	Raw index computation during rewriting
`sroa_cast`	Generic SROA type cast
`vsplat`	Vector splat element
`.splatinsert`	Splat insert element
`.splat`	Splat shuffle
`copyload`	Copy of a load during rewriting
`oldload`	Original load being replaced
`extract`	Extracted sub-value
`load.ext`	Load with extension
`endian_shift`	Endianness-adjustment shift
`load.trunc`	Load with truncation
`memcpy.load.fca`	Memcpy load of first-class aggregate
`memcpy.store.fca`	Memcpy store of first-class aggregate
`memset.store.fca`	Memset store of first-class aggregate

Differences from Upstream LLVM

The core SROA algorithm in cicc v13.0 is stock LLVM SROA. No CUDA-specific modifications to the splitting logic, slice building, or partition computation were detected. The NVIDIA-specific elements are limited to:

Pass state object layout. The offsets within the pass state structure (worklist at +432, hash table at +824-+864, sub-alloca records at +1080-+1096) reflect NVIDIA's PassManager integration, not upstream's.
IR node encoding. Opcode numbers (61 = load, 62 = store, 85 = intrinsic, 55 = phi) and operand layout (32-byte basic blocks, tagged pointers) follow NVIDIA's modified IR format.
Debug metadata system. The metadata kind for debug info uses MD_dbg = 38 (NVIDIA assignment), queried via sub_B91C10.
Global threshold knob. The value at qword_50056C8 may have an NVIDIA-specific default different from upstream's 128-byte / 1024-bit default. The knob is likely settable via the pipeline text sroa<preserve-cfg> or sroa<modify-cfg>.
Pipeline positioning. The early-pipeline placement (position 4, before NVVMLowerArgs and NVVMLowerAlloca) is NVIDIA-specific. Upstream LLVM typically places SROA after InstCombine and SimplifyCFG; cicc places it before those passes to eliminate byval parameter copies as early as possible.

Configuration

Knob	Global	Description
`qword_50056C8`	SROA size threshold	Maximum alloca size (in bits) that SROA will attempt to split. Allocas exceeding this are left for the backend.
`qword_50055E8`	Two-pass analysis flag	When set, enables a pre-analysis pass before slice building (new PM integration).
`NVVMPassOptions` offset `+1400`	Disable flag	Setting this byte disables SROA entirely.
Pipeline param `preserve-cfg`	--	Runs SROA without modifying the CFG (no block splitting for speculative loads across PHIs).
Pipeline param `modify-cfg`	--	Allows SROA to modify the CFG (enables full speculative load hoisting including PHI/select decomposition).

Function Map

Function	Address	Size	Role
	Primary instance (new PM)		--
`SROAPass::runOnAlloca`	`sub_2935C30`	58 KB	--
`SROAPass::splitAlloca`	`sub_2930B90`	80 KB	--
`buildSlices` (use analysis)	`sub_2927160`	--	--
`buildPartitions` (group slices)	`sub_2924690`	--	--
`buildPartitionTable`	`sub_2913C40`	--	--
`sortSlices`	`sub_2912200`	--	--
`compactSlices` (with filter)	`sub_2915A90`	--	--
`compactSlices` (simple)	`sub_2914CE0`	--	--
`findExistingValue`	`sub_291A860`	--	--
`rewritePartition`	`sub_29197E0`	--	--
`rewriteCallback`	`sub_2919EF0`	--	--
`visitUse` (rewrite one use)	`sub_292A4F0`	54 KB	--
`validateRewrite`	`sub_291F660`	--	--
`analyzeSlice`	`sub_29150D0`	--	--
`addToNewAllocaWorklist`	`sub_2929FB0`	--	--
`addToWorklist`	`sub_2928360`	--	--
`addOperandToWorklist`	`sub_29220F0`	--	--
`clearPendingQueue`	`sub_2921860`	--	--
`classifySlice`	`sub_29280E0`	--	--
`recordNonSplitAlloca`	`sub_2916C30`	--	--
`computeRewrittenValue`	`sub_2916270`	--	--
`advancePartitionIterator`	`sub_2912870`	--	--
`rewriteGEPChain`	`sub_29348F0`	--	--
`replaceAndErase`	`sub_2914800`	--	--
`collectUsesForRewrite` (variant)	`sub_2914380`	--	--
`collectUsesForRewrite` (original)	`sub_2914550`	--	--
Hash table resize	`sub_29222D0`	--	--
Alloca rewriting helper	`sub_292D810`	67 KB	--
SROA pass metadata	`sub_2912100`	--	--
SROA pass registration (`"Scalar Replacement Of Aggregates"`, `"sroa"`)	`sub_2912340`	--	--
	Secondary instance (legacy PM)		--
`SROAPass::runOnAlloca` (legacy)	`sub_1A33E80`	61 KB	--
`SROAPass::splitAlloca` (legacy)	`sub_1A37040`	46 KB	--
`rewritePartition` (memcpy/memset)	`sub_1A3B290`	58 KB	--
`presplitLoadsAndStores`	`sub_1A2D070`	35 KB	--
Select speculation	`sub_1A2C2F0`	9 KB	--
Vector splat handling	`sub_1A2FFA0`	12 KB	--
Load rewriting	`sub_1A30D10`	16 KB	--
Extract/load patterns	`sub_1A31B60`	9 KB	--
Type casting	`sub_1A23B30`	11 KB	--
Speculative load promotion	`sub_1A3A670`	13 KB	--
Alloca analysis / slice building	`sub_1A13B30`	36 KB	--
Partition computation	`sub_1A15E70`	34 KB	--
Use analysis	`sub_1A18770`	38 KB	--
Cleanup	`sub_1A3DCD0`	15 KB	--
	Shared helpers		--
`isAllocaPromotable`	`sub_B4CE70`	--	--
`getDL` (DataLayout)	`sub_B43CC0`	--	--
`getTypeSizeInBits`	`sub_BDB740`	--	--
`getTypeAllocSize`	`sub_9208B0`	--	--
`getType`	`sub_BD5C60`	--	--
`getName`	`sub_BD5D20`	--	--
`AllocaInst::Create`	`sub_BD2C40`	--	--
`PHINode::Create`	`sub_BD2DA0`	--	--
`AllocaInst` constructor	`sub_B4CCA0`	--	--
`CreateBitCast`	`sub_BCD140`	--	--
`CreateAlloca`	`sub_BCD420`	--	--
`replaceAllUsesWith`	`sub_BD84D0`	--	--
`eraseFromParent`	`sub_B43D60`	--	--
`SelectInst::Create`	`sub_B36550`	--	--
`UndefValue::get`	`sub_ACADE0`	--	--
`getABITypeAlignment`	`sub_AE5020`	--	--
`getPrefTypeAlignment`	`sub_AE5260`	--	--
`copyMetadata`	`sub_B91FC0`	--	--
`isVolatile`	`sub_B46500`	--	--
`isVectorType`	`sub_BCEBA0`	--	--
`rewriteLoadStoreOfSlice`	`sub_F38250`	--	--
`rewriteMemTransferOfSlice`	`sub_F38330`	--	--
`collectAllUses`	`sub_AE74C0`	--	--
`getAccessRange`	`sub_AF47B0`	--	--
`checkSubAllocaOverlap`	`sub_AF4D30`	--	--
`buildMetadataTable`	`sub_D5F1F0`	--	--
`addToErasedSet`	`sub_D6B260`	--	--
Slice optimizer init	`sub_11D2BF0`	--	--
Slice optimizer run	`sub_11D3120`	--	--
Slice optimizer finalize	`sub_11D7E80`	--	--

Test This

The following kernel allocates a local struct and accesses its fields. SROA should completely eliminate the alloca, promoting all fields to registers.

struct Particle {
    float x, y, z;
    float vx, vy, vz;
};

__global__ void sroa_test(float* out, int n) {
    Particle p;
    p.x  = (float)threadIdx.x;
    p.y  = (float)threadIdx.y;
    p.z  = 0.0f;
    p.vx = 1.0f;
    p.vy = 2.0f;
    p.vz = 3.0f;

    float energy = 0.5f * (p.vx*p.vx + p.vy*p.vy + p.vz*p.vz);
    out[threadIdx.x] = p.x + p.y + p.z + energy;
}

What to look for in PTX:

Absence of .local memory declarations. If SROA succeeds, there should be no .local .align directives in the PTX for the Particle struct. All six fields (x, y, z, vx, vy, vz) should live in %f (float) registers.
No st.local or ld.local instructions. These indicate that the struct survived into .local memory -- a 200-400 cycle penalty per access versus zero cycles for a register.
The PTX should show direct register arithmetic: mov.f32, fma.rn.f32, add.f32 -- no memory traffic at all for the struct fields.
To see the failure case, add volatile to the struct declaration (volatile Particle p;). This prevents SROA from promoting the alloca, and ld.local/st.local instructions will appear in the PTX, demonstrating the performance cliff that SROA normally prevents.
At -O0, SROA still runs (it is correctness-relevant for address space resolution), but with a more conservative threshold. Compare the .local frame size between -O0 and -O2.

Cross-References

Scalar Passes Hub -- hub page linking SROA, EarlyCSE, and JumpThreading with GPU-context summaries
Pipeline & Ordering -- pipeline positions 4 and post-sinking
Register Allocation -- surviving allocas become .local spills, directly increasing register pressure
Rematerialization -- recomputes cheap values to reduce register pressure; operates downstream of SROA
StructSplitting -- NVIDIA custom pass that splits struct arguments at the call boundary; complements SROA's intra-procedural splitting
MemorySpaceOpt -- resolves generic pointers to specific address spaces; runs after SROA
Hash Infrastructure -- the open-addressing hash table used by the SROA pass state

EarlyCSE (Early Common Subexpression Elimination)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Based on LLVM 20.0.0 EarlyCSE.cpp. Evidence: iterative (non-recursive) dominator-tree walk matches the LLVM 16+ refactoring; MemorySSA-backed variant with early-cse-memssa pipeline parameter matches LLVM 14+. NVIDIA adds four GPU extensions (barrier-aware versioning, AS 7 handling, NVVM call CSE, PHI limit) and a fourth scoped hash table not present in any upstream version.

EarlyCSE is a fast dominator-tree-walk pass that eliminates redundant computations, loads, and calls within a function. Cicc's version is not stock LLVM 20.0.0 -- the binary contains four CUDA-specific extensions that handle GPU memory model semantics: barrier-aware memory versioning with hardcoded NVVM intrinsic ID checks, shared memory address space 7 protection against unsafe store-to-load forwarding, a dedicated NVVM intrinsic call CSE handler with a fast-path for thread-invariant special register reads, and a PHI operand limit of 5 for compile-time control. It also adds a fourth scoped hash table (store-forwarding) that upstream lacks.

Key Facts

Property	Value
Pass name	`"early-cse"` (standard), `"early-cse-memssa"` (MemorySSA variant)
Pipeline parser params	`memssa` (selects MemorySSA-backed variant)
Entry point (standard)	`sub_2778270`
Entry point (MemorySSA)	`sub_27783D0`
Core function	`sub_2780B00` (12,350 bytes)
NVVM call CSE handler	`sub_2780450` (1,142 bytes, ~263 decompiled lines)
Pipeline slot	245, 291 (tier 1); 525, 593 (tier 2+); ~370 (late)
Disable flag	`NVVMPassOptions` offset `+1440`
Pipeline assembler	`sub_18E4A00` (MemorySSA variant), `sub_196A2B0` (standard)
Upstream LLVM file	`llvm/lib/Transforms/Scalar/EarlyCSE.cpp`
NVIDIA modifications	Barrier generation tracking, AS 7 handling, NVVM call CSE, PHI limit, store-fwd table

Algorithm Overview

The pass performs a stack-driven iterative DFS over the dominator tree. At each basic block it scans instructions linearly, attempting three forms of elimination:

Expression CSE -- arithmetic, casts, comparisons, GEPs with identical operands are looked up in a scoped hash table. If a matching canonical instruction exists, the redundant one is replaced via RAUW and erased.
Load CSE and store-to-load forwarding -- loads from the same address and type as a prior load (or a prior store) are replaced with the already-available value. This is gated by a CurrentGeneration counter that invalidates stale entries whenever a memory-writing instruction or barrier intrinsic is encountered.
Call CSE -- readonly/readnone calls with identical targets and arguments are deduplicated. The NVVM-specific handler sub_2780450 provides a fast-path for thread-invariant NVVM intrinsics (llvm.nvvm.read.ptx.sreg.*).

The dominator tree walk is not recursive. It uses an explicit growable stack (initial capacity 8 entries, 64 bytes) with DomTreeScope nodes that record per-scope hash table insertions. On scope exit all insertions are tombstoned. This matters for deeply-nested GPU kernel CFGs where stack overflow from recursion is a real risk.

function EarlyCSE(ctx):
    root = ctx.Function.DomTree.root
    stack.push(DomTreeScope(root))

    while stack is not empty:
        scope = stack.top()
        ctx.CurrentGeneration = scope.generation_begin

        if not scope.visited:
            for inst in scope.bb.instructions:
                processNode(ctx, inst)       // CSE logic below
            scope.visited = true
            scope.generation_end = ctx.CurrentGeneration
        else:
            if scope has unvisited children:
                child = scope.children.pop_front()
                stack.push(DomTreeScope(child))
                continue
            else:
                unwindScope(ctx, scope)      // tombstone entries, free node
                stack.pop()

DomTreeScope Structure

Each scope node is 160 bytes (0xA0), allocated via sub_22077B0:

Offset	Type	Field
`+0x00`	`u32`	`generation_begin` -- snapshot of `CurrentGeneration` at scope entry
`+0x04`	`u32`	`generation_end` -- value at scope exit (after processing all instructions)
`+0x08`	`BasicBlock*`	The basic block for this domtree node
`+0x10`	`DomTreeNode**`	`children_begin`
`+0x18`	`DomTreeNode**`	`children_end`
`+0x20`	scope link	Expression ScopedHT chain -> `ctx+0x78`
`+0x38`	scope link	Load ScopedHT chain -> `ctx+0x108`
`+0x50`	scope link	Call ScopedHT chain -> `ctx+0x198`
`+0x68`	scope link	Call-values ScopedHT chain -> `ctx+0x228`
`+0x80`	scope link	Store-fwd ScopedHT chain -> `ctx+0x250`
`+0x98`	`u8`	`visited` flag (0 = not yet processed, 1 = instructions scanned)

Each chain entry is a triplet [link_fwd, link_back, insertion_list_head] occupying 24 bytes. On scope exit, the pass walks each insertion list and tombstones the corresponding hash table entries, then frees the scope node.

Four Scoped Hash Tables

Upstream LLVM EarlyCSE has three scoped hash tables (expression, load, call). Cicc adds a fourth dedicated to store-to-load forwarding.

Table	Context offset	Hash function	Equality	Key	Value
Expression	`+0xE8` / `+0xF8`	`sub_277F590`	`sub_277AC50`	Opcode + operand value-numbers	Canonical instruction pointer
Load	`+0x178` / `+0x188`	`sub_277CF80`	`sub_27792F0`	Load address + type	Previously loaded value
Call	`+0x230` / `+0x240`	`sub_277CF80`	`sub_27792F0`	Call target + arguments	Return value
Store-fwd	`+0x2C0` / `+0x2D0`	`sub_277C800`	`sub_27781D0`	Store address + type	Stored value

All four use open-addressing with linear probing. Sentinel values: 0xFFFFFFFFFFFFF000 = empty, 0xFFFFFFFFFFFFE000 = tombstone. Resize triggers at 75% load factor (4 * (count + 1) >= 3 * bucket_count) or when tombstones exceed 12.5% of capacity. Bucket counts are always a power of two.

The store-forwarding table is the NVIDIA addition. Upstream EarlyCSE performs store-to-load forwarding through the load table by inserting the stored value when a store is processed. Cicc separates this into a dedicated table, which enables more aggressive dead-store detection within the early pipeline -- two stores to the same address with no intervening load or barrier can be recognized without polluting the load table's namespace.

CUDA Extension 1: Barrier-Aware Memory Versioning

The context structure holds a CurrentGeneration counter at offset +0x2E0 (type u32). This counter acts as a memory version number. Every load and call CSE lookup checks whether the cached entry's generation matches the current generation -- a mismatch means an intervening memory-modifying operation invalidated the entry.

Generation is incremented when:

A trivially dead instruction is skipped (minor bump at 0x2781950)
sub_B46490 (hasMemoryWriteSideEffects) returns true for a call instruction
Any of four hardcoded NVVM barrier intrinsic IDs is encountered

The barrier intrinsic checks are explicit cmp dword ptr [rax+24h], IMM instructions at specific addresses in the binary:

Address	Encoding	Intrinsic ID	Decimal	Identity
`0x2781B30`	`cmp ..., 9Bh`	`0x9B`	155	`llvm.nvvm.barrier0` (`__syncthreads`)
`0x27812AF`	`cmp ..., CDh`	`0xCD`	205	`llvm.nvvm.membar.*` (device/system memory barrier)
`0x2781F4D`	`cmp ..., 123h`	`0x123`	291	`llvm.nvvm.bar.sync` (named barrier sync)
`0x2781F40`	`cmp ..., 144h`	`0x144`	324	NVVM cluster barrier (SM 90+ cluster-scope fence)

These checks are a safety net on top of the intrinsics' declared memory-effect attributes. Upstream LLVM relies solely on the memory-effect modeling to determine whether a call clobbers memory. Cicc adds the explicit ID checks because the barrier intrinsics' memory effects, as declared in the NVVM tablegen files, may not fully capture the GPU-specific semantics: a bar.sync does not just write memory from the perspective of one thread -- it makes writes from other threads visible. The LLVM memory model has no native concept of inter-thread visibility guarantees at the IR level, so the explicit ID checks are the correctness backstop.

When any of these four intrinsics appears between two memory operations, EarlyCSE refuses to forward the earlier value. This prevents optimizations like:

;; INCORRECT optimization that barriers prevent:
%v1 = load i32, ptr addrspace(3) %p          ;; load from shared memory
call void @llvm.nvvm.barrier0()               ;; __syncthreads()
%v2 = load i32, ptr addrspace(3) %p          ;; CANNOT be replaced with %v1
;; Another thread may have written to %p between the barrier and this load

CUDA Extension 2: Shared Memory Address Space 7 Handling

Stores targeting NVPTX address space 7 (the internal representation for __shared__ memory) receive special treatment that prevents unsafe store-to-load forwarding.

At address 0x2781BB6, the pass checks byte [rdx+8] == 7 on the store's pointer operand type. When this matches, the store is routed through sub_B49E20 (isSharedMemoryStore), which calls sub_B43CB0 (getCalledFunction) and sub_B2D610 (hasIntrinsicID) to confirm the target is a shared memory variable (string ID 0x31 = "shared").

The motivation: shared memory is written by one thread and potentially read by a different thread after a barrier. Forwarding a stored value to a subsequent load in the same thread is only safe if no barrier intervenes -- but even then, a reimplementor must be careful because the CUDA memory model permits a thread to read its own store without a barrier, while other threads cannot. The shared-memory path in EarlyCSE conservatively disables forwarding for shared-memory stores to avoid the case where a load is CSE'd to the stored value, but the actual runtime value has been modified by another thread's post-barrier store to the same location.

processStore(ctx, store_inst):
    ptr_type = store_inst.pointer_operand.type
    if ptr_type.address_space == 7:                 // NVPTX shared memory
        if isSharedMemoryStore(store_inst):         // sub_B49E20
            ctx.CurrentGeneration++                 // invalidate load/call tables
            return                                  // do NOT insert into store-fwd table
    // Normal path: insert stored value into store-fwd table for later forwarding
    insertStoreForwarding(ctx, store_inst)

CUDA Extension 3: NVVM Intrinsic Call CSE (`sub_2780450`)

The dedicated function sub_2780450 (1,142 bytes, ~263 decompiled lines) handles CSE for calls to NVVM builtin intrinsics. It is entered when the main instruction loop detects a single-use-by-call pattern: the instruction's result has exactly one user, that user is a CallInst (opcode 0x1F), and the operand index is 3.

The function provides a fast-path for thread-invariant special register reads. Many NVVM intrinsics return values that are constant for the lifetime of a kernel invocation from a given thread's perspective:

llvm.nvvm.read.ptx.sreg.tid.x/y/z -- threadIdx.x/y/z
llvm.nvvm.read.ptx.sreg.ntid.x/y/z -- blockDim.x/y/z
llvm.nvvm.read.ptx.sreg.ctaid.x/y/z -- blockIdx.x/y/z
llvm.nvvm.read.ptx.sreg.nctaid.x/y/z -- gridDim.x/y/z
llvm.nvvm.read.ptx.sreg.warpsize
llvm.nvvm.read.ptx.sreg.laneid

Upstream LLVM would model these as readnone and CSE them through the generic call table. The NVVM-specific handler recognizes these intrinsic IDs directly via sub_987FE0 (getIntrinsicID), avoiding the overhead of the general readonly-call analysis. For a kernel that references threadIdx.x twenty times, the fast-path eliminates nineteen redundant intrinsic calls in a single pass.

The function also handles two additional NVVM intrinsic IDs:

ID	Decimal	Identity	CSE behavior
`0xE4`	228	NVVM load intrinsic	CSE-able if same address and no intervening clobber
`0xE6`	230	NVVM store intrinsic	Blocks CSE (generation bump)

The check at 0x2783890 tests for intrinsic ID 228 and at 0x27839BC for intrinsic ID 230. The store intrinsic (230) triggers a generation bump, while the load intrinsic (228) is treated as a CSE candidate.

CUDA Extension 4: PHI Operand Limit

At address 0x2781BED, the pass checks:

if PHINode.getNumIncomingValues() > 5:
    skip CSE analysis for this PHI

This is a compile-time heuristic absent from upstream LLVM. GPU kernel code after loop unrolling and predication commonly produces PHI nodes with dozens of operands. Comparing all incoming values for CSE equivalence becomes quadratic in the operand count (each pair of values must be checked for dominance and equivalence), and the benefit for wide PHIs is marginal -- they rarely represent true common subexpressions.

The threshold of 5 is hardcoded with no cl::opt override.

Instruction Classification

The inner processing loop at 0x2780EB5--0x2781110 classifies each instruction by its opcode byte at [instr-0x18]:

Opcode	Hex	Instruction	EarlyCSE action
`0x55`	Store	`StoreInst`	Store-to-load forwarding path; shared memory check
`0x3D`	Call	`CallInst`	Call CSE or generation bump (if memory effects)
`0x3E`	Invoke	`InvokeInst`	Same as CallInst
`0x3F`	Select	`SelectInst`	Expression CSE with type-size check
`0x40`	PHI	`PHINode`	Expression CSE if operand count <= 5
`<= 0x1C`	--	Constants/args	Skip (not instructions)
`0x29`	Return	`ReturnInst`	Skip
`0x43`--`0x4F`	Casts	Cast instructions	Expression CSE

The classification dispatches to these helper predicates:

Helper	Address	Purpose
`sub_AA54C0`	`0x2780EC6`	`isTriviallyDead` -- if true, bump generation and skip
`sub_D222C0`	`0x2780F97`	`isSimpleExpression` -- arithmetic, casts, comparisons, GEPs
`sub_F50EE0`	`0x2780F7A`	`canCSE` / `doesNotAccessMemory`
`sub_1020E10`	`0x2781967`	`getCallCSEValue` -- readonly/readnone call check
`sub_B46420`	`0x2781B95`	`isLoadCSECandidate`
`sub_B46490`	`0x2781CC6`	`hasMemoryWriteSideEffects` -- triggers generation bump

Load-Store Forwarding Detailed Flow

The most complex code path (0x2781B48--0x2781F32) handles load CSE and store-to-load forwarding:

processLoad(ctx, load_inst):
    key = computeLoadCSEKey(load_inst, ctx.DataLayout)    // sub_2779A20
    if key.status != 0:
        // Cannot form clean key -- check if call/invoke returns equivalent value
        if load_inst is CallInst (0x3D) or InvokeInst (0x3E):
            tryCallValueForwarding(ctx, load_inst)
        return

    // Check for preceding store to same address
    store_entry = lookupStoreTable(ctx, key)
    if store_entry and store_entry.generation == ctx.CurrentGeneration:
        // Forward stored value to this load
        salvageDebugInfo(load_inst, store_entry.value)    // sub_BD84D0
        replaceAllUsesWith(load_inst, store_entry.value)  // sub_11C4E30
        eraseInstruction(load_inst)                       // sub_B43D60
        return CHANGED

    // Check for preceding load from same address
    load_entry = lookupLoadTable(ctx, key)
    if load_entry and load_entry.generation == ctx.CurrentGeneration:
        // Replace with previously loaded value
        replaceAllUsesWith(load_inst, load_entry.value)
        eraseInstruction(load_inst)
        return CHANGED

    // Not found -- insert into load table for future lookups
    insertLoadTable(ctx, key, load_inst, ctx.CurrentGeneration)

For stores, the pass also performs dead-store detection within the same scope: if two stores target the same address with no intervening load or barrier, the earlier store is dead. The barrier check uses the same four intrinsic ID comparisons described above.

Type Compatibility and Bitwidth Handling

At 0x27829C3--0x2782B87, for expression CSE of SelectInst and PHINode:

sub_AE43F0 computes type size in bits via the DataLayout
If size <= 64 bits: use a u64 bitmask as the CSE key
If size > 64 bits: allocate a BitVector via sub_C43690 and use bit-level comparison

At 0x2782F72--0x2782FD5, integer constant range analysis computes leading zeros/ones to determine effective bit-width. If the value fits in fewer bits, EarlyCSE allows CSE across different integer types (e.g., i32 zext i64 vs i64). This is an NVIDIA extension that upstream LLVM does not perform -- upstream requires exact type matches for expression CSE.

Context Structure Layout

The EarlyCSEContext structure passed to sub_2780B00 in rdi:

Offset	Field	Size
`+0x00`	Current instruction pointer	8
`+0x08`	`DataLayout` / `TargetData`	8
`+0x10`	`Function*` (-> `[+0x60]` = DomTree root)	8
`+0x18`	`TargetLibraryInfo*`	8
`+0x20`	`AssumptionCache*`	8
`+0x68`	MemDep result tracking	8
`+0x70`	MemDep analysis reference	8
`+0xE8`--`+0x110`	Expression hash table (buckets, count, ScopedHT, free list, allocator)	40
`+0x170`--`+0x198`	Load hash table + ScopedHT	40
`+0x200`--`+0x258`	Call hash table + ScopedHT	88
`+0x2B8`--`+0x2D8`	Store-fwd hash table + ScopedHT	32
`+0x2E0`	`CurrentGeneration` (`u32`)	4

Stack frame: 0x1D0 bytes (sub rsp, 0x1A8 + 5 callee-saved pushes).

Scope Page Management

The scoped hash tables use 512-byte (0x200) scope pages chained together. When a page fills:

At 0x2781328: fetch previous page via [stack.end - 8], advance by 0x200 to the next chained page.
At 0x2782260: when reclaiming, free the current page and pop from the page pointer array.

The initial worklist stack is 64 bytes (8 entries of 8 bytes each). The scope-page-pointer array is 8-byte aligned via lea rbx, [rdx*4 - 4]; and rbx, ~7; add rbx, rax.

`memssa` Pipeline Parameter

The pipeline parser registers "early-cse" at slot 394 with the parameter keyword memssa. When memssa is specified, the pass uses the MemorySSA-backed variant (sub_27783D0, pass name "Early CSE w/ MemorySSA") instead of the standard variant (sub_2778270, pass name "Early CSE"). Both variants call the same core function sub_2780B00; the difference is that the MemorySSA variant receives a pre-built MemorySSA graph in the context structure and uses it for more precise clobber queries, avoiding the O(n^2) scanning that the non-MSSA path falls back to for load CSE.

Knobs

Knob	Default	Description
`enable-earlycse-memoryssa`	`true`	Master switch for MemorySSA integration
`earlycse-debug-hash`	`false`	Debug: log hash function inputs/outputs
`earlycse-mssa-optimization-cap`	`500`	Max MemorySSA queries per block before falling back to conservative
`enable-earlycse-imprecision`	`false`	Allow approximate analysis in pathological cases (huge blocks, deep PHI nests)

No dedicated cl::opt flags exist for any of the four NVIDIA extensions. The PHI operand limit of 5, the four barrier intrinsic IDs, and the shared-memory address space 7 check are all hardcoded in the binary.

Pipeline Positions and Tier Gating

Tier	Position(s)	Notes
Tier 1 (O1)	Skipped	`sub_12DE8F0` explicitly gates EarlyCSE with `tier != 1`
Tier 2 (O2)	525, 593	Two invocations: early function simplification and post-loop-optimization
Tier 3 (O3)	245, 291, ~370	Three invocations; additional late-pipeline run
Ofcmid	After Sinking2	Single invocation in the moderate-optimization path

The pass is independently disableable via NVVMPassOptions at offset +1440. The same offset gates the standard and MemorySSA variants identically.

Key Constants

Value	Hex	Meaning
160	`0xA0`	`DomTreeScope` node size
512	`0x200`	Scope page size
64	`0x40`	Initial stack capacity (8 entries)
48	`0x30`	Hash table entry node size
40	`0x28`	Insertion record size
`0xFFFFFFFFFFFFF000`	--	Hash table EMPTY sentinel
`0xFFFFFFFFFFFFE000`	--	Hash table TOMBSTONE sentinel
155	`0x9B`	`llvm.nvvm.barrier0` intrinsic ID
205	`0xCD`	`llvm.nvvm.membar.*` intrinsic ID
291	`0x123`	NVVM `bar.sync` intrinsic ID
324	`0x144`	NVVM cluster barrier intrinsic ID
228	`0xE4`	NVVM load intrinsic ID
230	`0xE6`	NVVM store intrinsic ID
5	--	PHI operand limit for CSE

Differences from Upstream LLVM 20.0.0

Feature	Upstream	Cicc
Scoped hash tables	3 (expression, load, call)	4 (+ store-forwarding)
Barrier intrinsic checks	Relies on memory-effect attributes only	Explicit ID checks for IDs 155, 205, 291, 324
Shared memory handling	No address-space-specific logic	AS 7 stores skip store-fwd insertion, bump generation
NVVM intrinsic call CSE	Generic readonly-call path	Dedicated `sub_2780450` with fast-path for `sreg.*` reads
PHI operand limit	None	Skip CSE for PHI nodes with >5 incoming values
Cross-type expression CSE	Exact type match required	Allows CSE across integer widths when value range fits
Dominator tree walk	Recursive in many LLVM builds	Always iterative (explicit stack)

Function Map

Function	Address	Size	Role
`EarlyCSEPass::run` (standard variant entry)	`sub_2778270`	--	--
`EarlyCSEPass::run` (MemorySSA variant entry)	`sub_27783D0`	--	--
Core pass body (domtree walk + instruction processing)	`sub_2780B00`	12,350	--
`handleNVVMCallCSE` (NVVM intrinsic call CSE)	`sub_2780450`	1,142	--
Expression hash function	`sub_277F590`	--	--
Expression equality check	`sub_277AC50`	--	--
Load/call key hash	`sub_277CF80`	--	--
Load/call key equality	`sub_27792F0`	--	--
Store key hash	`sub_277C800`	--	--
Store key equality	`sub_27781D0`	--	--
`isSimpleExpression`	`sub_D222C0`	--	--
`canCSE` / `doesNotAccessMemory`	`sub_F50EE0`	--	--
`isSharedMemoryStore` (AS 7 check)	`sub_B49E20`	--	--
`isSharedMemoryAccess`	`sub_B49E00`	--	--
`getCallCSEValue` (readonly/readnone check)	`sub_1020E10`	--	--
`isLoadCSECandidate`	`sub_B46420`	--	--
`hasMemoryWriteSideEffects`	`sub_B46490`	--	--
`computeCSEHash` / `isVolatile`	`sub_B46500`	--	--
`getIntrinsicID` (NVVM intrinsic ID from call)	`sub_987FE0`	--	--
`isTriviallyDead`	`sub_AA54C0`	--	--
`replaceAllUsesWith` (RAUW)	`sub_11C4E30`	--	--
`salvageDebugInfo`	`sub_BD84D0`	--	--
`eraseInstruction`	`sub_B43D60`	--	--
`removeFromParent`	`sub_27793B0`	--	--
`computeLoadCSEKey`	`sub_2779A20`	--	--
`insertStoreForwarding`	`sub_27808D0`	--	--
`insertExprIntoScopedHT`	`sub_27801B0`	--	--
`lookupScope` (find value by generation)	`sub_277D510`	--	--
`lookupCallTable`	`sub_277D3C0`	--	--
`lookupInScopedHT`	`sub_2778110`	--	--
`shouldInsertIntoTable`	`sub_27785B0`	--	--
`growTable` (double hash table size)	`sub_277C980`	--	--
`insertIntoTable` (post-grow insert)	`sub_277C8A0`	--	--
`cleanupLoadTable` (compact after scope exit)	`sub_277FFC0`	--	--
`cleanupCallTable` (compact after scope exit)	`sub_277A110`	--	--
`compareLoadTypes` (type compatibility)	`sub_277A9A0`	--	--
`TargetData::getTypeSizeInBits`	`sub_AE43F0`	--	--
`getCalledFunction`	`sub_B43CB0`	--	--
`hasIntrinsicID`	`sub_B2D610`	--	--

Common Pitfalls

These are mistakes a reimplementor is likely to make when extending EarlyCSE for a GPU target with barrier semantics.

1. Relying solely on LLVM memory-effect attributes to model barrier semantics. Upstream LLVM models barrier intrinsics as memory-writing calls, which triggers a generation bump through the standard hasMemoryWriteSideEffects path. This is insufficient for GPU barriers: a bar.sync does not just write memory from one thread's perspective -- it makes writes from other threads visible. The LLVM memory model has no native concept of inter-thread visibility guarantees. Cicc adds explicit hardcoded checks for four intrinsic IDs (155, 205, 291, 324) as a safety net. A reimplementation that trusts the declared memory effects alone will forward values across barriers, producing load CSE that reads stale pre-barrier data written by a different thread.

2. Forwarding stores to loads across barriers in shared memory (AS 7). When thread T0 stores to smem[0], a barrier fires, and thread T1 loads from smem[0], the load must see T1's own value (if it wrote) or the value written by whichever thread last stored before the barrier. Forwarding T0's stored value to T0's subsequent load is only safe if no barrier intervenes and no other thread could have written to the same location. Cicc's AS 7 handling conservatively disables store-to-load forwarding for all shared memory stores by bumping the generation counter. A reimplementation that allows shared memory store forwarding without barrier awareness will produce reads that return the local thread's stale value instead of the globally-visible post-barrier value.

3. Missing one or more of the four barrier intrinsic IDs. Cicc checks for IDs 155 (barrier0 / __syncthreads), 205 (membar.*), 291 (bar.sync), and 324 (cluster barrier for SM 90+). A reimplementation that only handles __syncthreads (ID 155) will fail to invalidate the load/call tables when a bar.sync or cluster barrier is encountered. The result: loads before and after a named barrier or cluster-scope fence are incorrectly CSE'd, producing silent data corruption in multi-CTA cooperative kernels.

4. Applying expression CSE to PHI nodes with more than 5 incoming values. Cicc hardcodes a PHI operand limit of 5 for CSE analysis. GPU kernel code after loop unrolling and predication commonly produces PHI nodes with dozens of operands. Comparing all incoming values for CSE equivalence is quadratic in operand count, and the benefit for wide PHIs is negligible -- they rarely represent true common subexpressions. A reimplementation without this threshold will experience severe compile-time regressions on heavily unrolled GPU kernels.

5. Not adding a dedicated store-forwarding hash table. Upstream LLVM uses three scoped hash tables (expression, load, call). Cicc adds a fourth table dedicated to store-to-load forwarding. Without this separation, inserting stored values into the load table pollutes the load namespace, making dead-store detection within the same scope unreliable. Two stores to the same address with no intervening load or barrier should trigger dead-store elimination of the earlier store; mixing stores into the load table obscures this pattern.

Cross-References

Scalar Passes Hub -- hub page linking SROA, EarlyCSE, and JumpThreading with GPU-context summaries
MemorySSA Builder for GPU -- the MemorySSA infrastructure consumed by the early-cse-memssa variant
Hash Infrastructure -- the universal DenseMap mechanics shared by all four hash tables
Barriers & Sync -- the barrier builtins whose intrinsic IDs trigger generation bumps
Dead Synchronization Elimination -- the 96KB pass that removes dead barriers; interacts with EarlyCSE's barrier-aware generation tracking
GVN -- the more expensive redundancy elimination pass that complements EarlyCSE later in the pipeline
DSE -- Dead Store Elimination, which complements EarlyCSE's within-scope store-to-load forwarding with cross-block analysis
Pipeline & Ordering -- tier-dependent scheduling and NVVMPassOptions gating
Alias Analysis & NVVM AA -- address-space-aware alias analysis that feeds into MemorySSA clobber queries

InstCombine

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/InstCombine/InstructionCombining.cpp, llvm/lib/Transforms/InstCombine/InstCombine*.cpp (LLVM 20.0.0). The upstream is split across ~15 files by instruction category; cicc inlines them into a single monolithic visitor.

NVIDIA's InstCombine in CICC v13.0 is approximately twice the size of upstream LLVM's, weighing in at roughly 405 KB for the main visitor alone. The monolithic visitor function at sub_10EE7A0 dispatches across 80 unique opcode cases through a three-level switch structure, handling standard LLVM instructions, NVIDIA-extended vector and FMA operations, and three high-opcode NVVM intrinsic dead-code elimination patterns. A separate 87 KB intrinsic folding function (sub_1169C30) handles NVVM-specific canonicalization, and a 127 KB computeKnownBits implementation (sub_11A7600) provides the dataflow backbone. This page covers the visitor architecture, the per-instruction-type visitors recovered from the binary, and the NVIDIA-specific extensions that distinguish this implementation from upstream.


Registration	New PM #398, parameterized: `no-aggressive-aggregate-splitting;...;max-iterations=N`
Runtime positions	Tier 0 #28 (via `sub_19401A0`); Tier 1/2/3 #42 (gated by `!opts[1000]`); see Pipeline
Main visitor	`sub_10EE7A0` (`0x10EE7A0`, ~405 KB, 9,258 lines)
Intrinsic folding	`sub_1169C30` (`0x1169C30`, ~87 KB, 2,268 lines)
computeKnownBits	`sub_11A7600` (`0x11A7600`, ~127 KB, 4,156 lines)
SimplifyDemandedBits	`sub_11AE870` / `sub_11AE3E0` (wrapper + hash table)
Opcode cases	80 unique case labels across 3 switch blocks
NVIDIA extra size	~200 KB beyond upstream (~87 KB intrinsic fold + ~113 KB expanded cases)

Visitor Architecture

The main visitor sub_10EE7A0 receives an NVVM IR node pointer (__m128i* a2) and attempts to simplify it. A persistent local v1612 aliases the instruction being visited. The function has four structural regions:

Preamble (lines ~1760--2000) performs pre-dispatch checks: validating call-site attributes (opcode 41 for bitwise-assert), handling ternary FMA instructions (opcodes 238--245), checking for constant-foldable select patterns, canonicalizing operand ordering (constant to RHS), and running SimplifyDemandedBits via sub_11A3F30 on the result type.

Opcode dispatch reads the NVVM opcode via sub_987FE0 (getOpcode) and uses a three-level switch:

Switch Level	Opcode Range	Description
Level 1	`0x99`--`0x2A5` (main)	Standard LLVM instructions (GEP, select, stores, casts, compares, calls, vectors)
Level 2	`0x01`--`0x42` (low)	Binary operations, casts, early comparisons
Level 3	`> 0x13CF` (high)	NVIDIA proprietary intrinsic IDs (9549, 9553, 9567)

Additional if-else chains handle intermediate ranges: opcodes 0xC7E (3198), 0x2E2 (738), 0x827 (2087), 0x2CC (716), 0xE07--0xE08, 0xE4F--0xE51, 0x13C6--0x13C7, and 0x13CD--0x13CE.

The fallback path at LABEL_95 calls sub_F0C430 for generic simplification. The no-change return path at LABEL_155 is referenced 101 times throughout the function.

Per-Instruction Visitors

Each major instruction type is handled by a dedicated visitor function called from the main dispatch. The following table summarizes the recovered visitors with their sizes and key characteristics.

visitBinaryOperator -- `sub_10D8BB0`


Address	`0x10D8BB0` (102 KB, 2,078 lines)
Dispatch case	`0x3A` in the master dispatcher
Sibling cases	`0x39` (NSW/NUW-focused), `0x3B` (associative/commutative)

This is the second-largest visitor. It implements approximately 25 cascading simplification phases for all binary arithmetic (Add, Sub, Mul, Div, Rem, Shl, LShr, AShr, And, Or, Xor, and their floating-point counterparts). The phases execute in a strict try-and-return order:

Phase 0 runs quick exits: pattern-matched constant fold (sub_101E960), SimplifyBinOp (sub_F29CA0), algebraic identities (sub_F0F270), NSW/NUW simplification (sub_F11DB0), and critically the NVIDIA-specific intrinsic handler sub_11AE870 which runs before any standard LLVM folds.

Phases 1--9 handle associative/commutative factoring, cross-operand Mul-of-Add matching, delegated simplification, overflow detection, and multiply-shift strength reduction. Phase 5 detects multiply-by-power-of-2 and converts to shift; sub_10BA120 builds the full strength reduction for patterns like x * (2^n + 1) into (x << n) + x.

Phases 10--25 cover Add-of-Mul factoring, shift chains, linear expression folding, subtraction of multiplied constants, demanded-bits masking, reciprocal elimination, overflow intrinsic decomposition, and division/remainder folding. The division constant folder uses sub_C46BD0 (APInt::udiv), sub_C499A0 (APInt::urem), sub_C45F70 (APInt::sdiv), and sub_C49AB0 (APInt::srem).

Four template-instantiated helpers at sub_10D2680--sub_10D2D70 (2,767 bytes each, identical structure) implement matchBinOpReduction parameterized by NVVM intrinsic ID (329, 330, 365, 366) and acceptable opcode range. These detect NVVM horizontal reduction intrinsics (e.g., horizontal add/mul across vector lanes) and simplify them to scalar binary operations.

visitICmpInst -- `sub_1136650` + `sub_113CA70`


Comprehensive folder	`sub_1136650` (`0x1136650`, 149 KB, 3,697 lines)
Per-opcode dispatch	`sub_113CA70` (`0x113CA70`) -- 12 case labels

The ICmp folder is the single largest function in InstCombine. It runs before the per-opcode dispatch table and handles 15 major fold categories: all-ones/sign-bit constant folds, Mul-with-constant strength reduction (NUW-gated), nested Mul decomposition, common sub-operand cancellation, NUW/NSW flag-gated predicate conversion, known-nonnegativity folds, ConstantRange intersection, shared sub-operand elimination, Sub sign-bit analysis, min/max pattern recognition, computeKnownBits sign-bit analysis, power-of-2 optimizations, remainder pattern matching, XOR/shift decomposition, and Or/And decomposition with type width folding.

NVVM uses a custom predicate encoding stored at ICmpInst+2 as a 6-bit field (*(_WORD*)(inst+2) & 0x3F):

Value	Predicate	Value	Predicate
32	EQ	33	NE
34	UGT	35	UGE
36	ULT	37	ULE
38	SGT	39	SGE
40	SLT	41	SLE

The per-opcode dispatch at sub_113CA70 routes based on the non-constant operand's opcode tag:

Tag	Instruction	Handler	Size
`*` (42)	Mul	`sub_1128290`	1,178 lines
`,` (44)	Add	`sub_1119FB0`	413 lines
`.` (46)	Trunc	`sub_1115510`	--
`0` (48)	SExt	`sub_11164F0`	--
`1` (49)	ZExt	`sub_1122A30`	--
`4` (52)	Select	`sub_1115C10`	428 lines
`6` (54)	And	`sub_1120680`	911 lines
`7` (55)	Or	`sub_1126B10`	786 lines
`8` (56)	Xor	`sub_1126B10`	shared with Or
`9` (57)	Shl	`sub_112C930`	664 lines
`:` (58)	LShr	`sub_1133500`	--
`;` (59)	Sub	`sub_111CED0` + `sub_113BFE0`	519 lines

visitCastInst -- `sub_110CA10`


Address	`0x110CA10` (93 KB, 2,411 lines)
Cast chain helper	`sub_110B960` (22 KB, 833 lines)

Handles all cast simplification: same-type identity elimination, bool-to-float chains, integer-to-integer narrowing/widening, FP-to-int special cases, FP narrowing, cast-through-select/PHI, and the major cast-of-cast chain folding. The helper sub_110B960 implements deep cast chain folding for aggregate types using a worklist with a DenseMap for O(1) deduplication, preventing exponential blowup on diamond-shaped use-def graphs. The function is conservative about side effects: sub_B46500 (isVolatile) is called before every fold.

visitSelectInst -- `sub_1012FB0`


Address	`0x1012FB0` (74 KB, 1,801 lines)
Local variables	190 total

Implements 18 prioritized select simplifications: constant fold, undef arm elimination, both-same identity, PHI-through-select, KnownBits sign analysis, ConstantRange analysis, full-range analysis, KnownBits cross-validation, ICmpInst arm synthesis, ExtractValue decomposition, implied condition, canonicalization (delegated to sub_1015760, 27 KB), min/max pattern detection (smin/smax/umin/umax/abs/nabs via four helpers), select-in-comparison chains, PHI-select worklist scan (DenseMap with hash (ptr >> 9) ^ (ptr >> 4)), ValueTracking classification, pointer-null folding, and load/trunc delegation.

visitPHINode -- `sub_1175E90`


Address	`0x1175E90` (~57 KB, ~2,130 lines)

Implements 16 PHI optimization strategies tried in sequence: SimplifyInstruction constant fold, foldPHIArgOpIntoPHI (binary/cast with one varying operand), foldPHIArgConstantOp, typed opcode dispatch (GEP via sub_1172510, InsertValue, ExtractValue, CmpInst, BinOp/Cast), GEP incoming deduplication with loop back-edge analysis, single-use PHI user check, GEP-of-PHI transform (sub_1174BB0, 1,033 lines), phi-cycle escape detection, trivial PHI elimination (all-same non-PHI value), recursive PHI cycle resolution (sub_116D410), operand reordering canonicalization, identical-PHI-in-block deduplication, pointer-type struct GEP optimization, all-undef incoming check, and dominator-tree GEP index hoisting using two DenseMaps.

visitCallInst -- `sub_1162F40`


Address	`0x1162F40` (50 KB, 1,647 lines)

Processes calls through a 15-step cascade: LibCall simplification (sub_100A740), standard intrinsic folding (sub_F0F270), return attribute analysis (sub_F11DB0), overflow/saturating arithmetic (sub_115C220), inline mul-by-constant folding, generic call combining (sub_115A080), FMA/fneg/fsub canonicalization (the largest block, requiring all of nnan+ninf+nsz+arcp+reassoc on both call and function), constant-argument intrinsic folding, unary intrinsic constant folding, exp/log pair detection (IDs 325 and 63), sqrt/rsqrt folding (IDs 284, 285), min/max folding (IDs 88, 90), nested intrinsic composition, division-to-reciprocal-multiply, and finally the NVIDIA-specific sub_115A4C0 which dispatches to the 87 KB intrinsic folding table.

visitLoadInst -- `sub_1152CF0`


Address	`0x1152CF0` (~68 KB, ~1,680 lines)
Stack frame	`0x4F0` bytes (1,264 bytes)

Four major paths: constant-address fold (loads from known constant pointers with types <= 64 bits are replaced via symbol table lookup using sub_BCD420), address-space-based elimination (loads from non-AS(32) pointers are replaced with constants, exploiting CUDA's read-only address spaces), the main store-to-load forwarding worklist (BFS over the def-use graph following GEPs, PHIs, and bitcasts, depth-limited by global qword_4F90528), and dominator-based forwarding for non-pointer loads. Alignment is propagated as the maximum of source and destination, with the volatile bit carefully preserved through the *(node+2) 16-bit field (bits [5:0] = log2(alignment), bit [6] = volatile flag).

NVIDIA-Specific Extensions

NVVM Intrinsic Folding -- `sub_1169C30`

This 87 KB function is the core of NVIDIA's additions to InstCombine. Called from the main visitor when the instruction is an NVIDIA intrinsic, it uses a two-layer dispatch:

Layer 1 (primary switch, entered when the uses-list is empty or the "fast" flag at a1+336 is set) dispatches on the node's byte-tag:

Tag	Char	Fold Type
42	`*`	FNeg/negation -- pushes negation through arithmetic via the "Negator" chain
55	`7`	Vector extract from intrinsic result (full-width extract becomes identity)
56	`8`	Vector insert into intrinsic result (full-width insert becomes And mask)
59	`;`	Multiply-like symmetric intrinsic (folds when one operand is known non-negative)
68	`D`	ZExt of i1 intrinsic result (bypasses intrinsic wrapper)
69	`E`	SExt of i1 intrinsic result (bypasses intrinsic wrapper)
85	`U`	Call-site fold for `llvm.nvvm.*` with specific IDs (313, 362)
86	`V`	Select-like intrinsic fold (dead select elimination)

Layer 2 (depth-gated by qword_4F908A8 = instcombine-negator-max-depth) adds aggressive cases:

Tag	Char	Fold Type
46	`.`	Dot product fold
54	`6`	Indexed access / extract with fold
58	`:`	Comparison intrinsic fold
67	`C`	Type conversion intrinsic fold
84	`T`	Tensor / multi-operand intrinsic fold
90	`Z`	Zero-extend intrinsic fold
91	`[`	Three-operand fold (e.g., fma)
92	`\`	Four-operand fold (e.g., dp4a)
96	`	Unary special intrinsic fold

The FNeg case (tag 42) is the most complex. It first attempts constant folding: if the operand is all-ones (-1), it creates sub(0, operand) via CSE lookup with opcode 30. When the simple fold fails, it falls through to the Negator chain at LABEL_163: sub_1168D40 collects all negatable sub-expressions, sub_1169800 attempts to fold negation into each operand, and the results are combined with sub_929C50 or sub_929DE0. This pushes negation through chains of arithmetic to find a cheaper representation, depth-gated to prevent exponential blowup. Created replacement instructions carry .neg modifier metadata for PTX emission.

Three High-Opcode NVIDIA Intrinsics

Opcodes 0x254D (9549), 0x2551 (9553), and 0x255F (9567) are NVIDIA-proprietary intrinsic IDs handled directly in the main visitor. All three share the same pattern: extract the commuted-operand index via v1612->m128i_i32[1] & 0x7FFFFFF, verify the other operand has byte-tag 12 or 13 (ConstantInt/ConstantFP), query metadata via sub_10E0080 with mask 0xFFFFFFFFFFFFFFFF, and test specific bit patterns:

Opcode	Test	Fold Condition
0x2551 (9553)	`((result >> 40) & 0x1E) == 0x10`	Fold when bit pattern mismatches
0x255F (9567)	`(result & 0x10) != 0`	Fold when bit 4 is clear
0x254D (9549)	`(result & 0x200) != 0`	Fold when bit 9 is clear

When the filter passes, the shared epilogue calls sub_F207A0(v6, v1612->m128i_i64) (eraseInstFromFunction), deleting the instruction entirely. These implement dead-code elimination for NVIDIA intrinsics with constant arguments matching known-safe-to-remove criteria.

Separate Storage Assume Bundles

At lines 6557--6567 of the main visitor, the code iterates over operand bundles on llvm.assume calls (opcode 0x0B). For each bundle with a tag of exactly 16 bytes matching "separate_storage" (verified by memcmp), it calls sub_10EA360 on both bundle operands. This implements NVIDIA's separate_storage alias analysis hint, allowing InstCombine to exploit non-aliasing assumptions for pairs of pointers declared to reside in separate memory spaces.

Expanded GEP Handling

The GEP case (opcode 0x99 = 153) is significantly expanded compared to upstream. The global dword_4F901A8 controls a depth-limited chain walk for nested GEP simplification:

v729 = getOperand(0) of GEP
if (dword_4F901A8) {
    v730 = 0;
    do {
        if (!isConstantGEP(v729)) break;
        ++v730;
        v729 = getOperand(0, v729);  // walk up
    } while (v730 < dword_4F901A8);
}
if (*(_BYTE*)v729 != 85)  // 85 = CallInst
    goto LABEL_155;        // bail

This walks backward through constant-index GEP chains up to dword_4F901A8 steps, looking for a CallInst base pointer. The knob controls how many GEP levels to look through when simplifying GEP(GEP(GEP(..., call_result))).

Ternary/FMA Support

The preamble handles 3-operand instructions (opcodes 238--245) representing fused multiply-add variants. This includes checking whether the third operand is a zero-constant, converting between FMA opcode variants (238 vs. 242), and handling address space mismatches on FMA operand types -- entirely NVIDIA-specific for CUDA's FMA intrinsics.

computeKnownBits -- `sub_11A7600`

The 127 KB computeKnownBits implementation dispatches on the first byte of the NVVM IR node (the type tag):

Tag	Char	Node Type
42	`*`	Truncation (extracts low bits)
44	`,`	GEP (computes known bits through pointer arithmetic)
46	`.`	Comparison (known result bits)
48	`0`	Select (intersection of known bits from both arms)
52	`4`	Branch-related
54	`6`	Vector shuffle
55	`7`	Vector extract
56	`8`	Vector insert
57	`9`	PHI node (intersection across incoming values)
58	`:`	Comparison variant
59	`;`	Invoke / call
67	`C`	Cast chain
68	`D`	Binary op path 1
69	`E`	Binary op path 2
85	`U`	CallInst (sub-dispatch: `0x0F`=abs, `0x42`=ctpop, `0x01`=bitreverse)
86	`V`	LoadInst

A debug assertion at lines 2204--2212 fires when computeKnownBits and SimplifyDemandedBits produce inconsistent results, printing both APInt values and calling abort(). This invariant check (known_zero & known_one == 0, plus consistency with the demanded mask) is compiled in for debug/checked builds.

SimplifyDemandedBits -- `sub_11AE870`

The wrapper sub_11AE870 gets the bit-width via sub_BCB060 (or sub_AE43A0 for non-integer types), allocates two APInts sized to the width, delegates to sub_11AE3E0, and frees any heap-allocated storage. The core implementation at sub_11AE3E0 (235 lines) calls computeKnownBits, then if the instruction was simplified, walks the use-chain and inserts each user into a hash table (open-addressing with quadratic probing, hash = (ptr >> 9) ^ (ptr >> 4)) at offset +2064 from the InstCombiner context. This "seen instructions" set prevents infinite recursion during demanded-bits propagation.

Configuration Knobs

Global	CLI Flag	Default	Used In
`dword_4F901A8`	(GEP chain look-through depth)	unknown	GEP handler (case 0x99)
`qword_4F908A8`	`instcombine-negator-max-depth`	-1	`sub_1169C30` (depth gate)
`qword_4F90988`	`instcombine-negator-enabled`	1	ctor_090
`qword_4F8B4C0`	`instcombine-split-gep-chain`	--	ctor_068
`qword_4F8B340`	`instcombine-canonicalize-geps-i8`	--	ctor_068
`qword_4F909E0`	`instcombine-max-num-phis`	--	ctor_091
`qword_4F90120`	`instcombine-guard-widening-window`	3	ctor_087
`qword_4F90528`	(load forwarding search depth)	--	`sub_1152CF0`

Key Helper Functions

Address	Recovered Name	Purpose
`sub_987FE0`	`getOpcode()`	Reads NVVM opcode from IR node
`sub_B46B10`	`getOperand(idx)`	Operand access
`sub_B44E20`	`eraseFromParent()`	Unlink instruction
`sub_F207A0`	`eraseInstFromFunction()`	Delete instruction from worklist
`sub_F162A0`	`replaceInstUsesWith()`	RAUW and return replacement
`sub_F20660`	`setOperand(i, val)`	Replace operand in-place
`sub_B33BC0`	`CreateBinOp()`	IRBuilder binary op creation
`sub_B504D0`	`CreateBinOp(no-flags)`	Binary op without flags
`sub_B51D30`	`CreateCast()`	Cast instruction creation
`sub_AD8D80`	`ConstantInt::get(type, APInt)`	Constant integer factory
`sub_AD64C0`	`ConstantInt::get(type, val, signed)`	Constant integer factory (scalar)
`sub_BCB060`	`getScalarSizeInBits()`	Type bit-width query
`sub_10E0080`	`getKnownBitsProperty()`	Metadata property query
`sub_B43CB0`	`getFunction()`	Get parent function
`sub_B43CA0`	`getParent()`	Get parent basic block
`sub_10A0170`	`extractFlags()`	Read fast-math, exact, etc.
`sub_B44900`	`isCommutative()`	Check commutativity
`sub_C444A0`	`APInt::countLeadingZeros()`	Bit analysis
`sub_986760`	`APInt::isZero()`	Zero test
`sub_10EA360`	`recordSeparateStorageOperand()`	Separate storage alias hint

Diagnostic Strings

Diagnostic strings recovered from the InstCombine binary region. InstCombine uses assertion-style diagnostics rather than optimization remarks; the computeKnownBits consistency check is the primary runtime diagnostic.

String	Source	Category	Trigger
`"computeKnownBits(): "`	`sub_904010` in `sub_11A7600` line ~2204	Assertion	Debug build: `computeKnownBits` and `SimplifyDemandedBits` produce inconsistent results (prints both APInt values, then calls `abort()`)
`"SimplifyDemandedBits(): "`	`sub_904010` in `sub_11A7600` line ~2212	Assertion	Debug build: paired with `computeKnownBits()` inconsistency diagnostic above
`"separate_storage"`	Main visitor lines 6557--6567	Bundle tag	Matched via `memcmp` (16 bytes) on `llvm.assume` operand bundles; not a user-visible diagnostic
`"instcombine-negator-max-depth"`	`ctor_090` at `0x4F908A8`	Knob	Knob registration (default -1, unlimited)
`"instcombine-negator-enabled"`	`ctor_090` at `0x4F90988`	Knob	Knob registration (default 1, enabled)
`"instcombine-split-gep-chain"`	`ctor_068` at `0x4F8B4C0`	Knob	Knob registration
`"instcombine-canonicalize-geps-i8"`	`ctor_068` at `0x4F8B340`	Knob	Knob registration
`"instcombine-max-num-phis"`	`ctor_091` at `0x4F909E0`	Knob	Knob registration
`"instcombine-guard-widening-window"`	`ctor_087` at `0x4F90120`	Knob	Knob registration (default 3)

InstCombine does not emit OptimizationRemark diagnostics. The only runtime-visible diagnostic is the debug assertion that fires when computeKnownBits and SimplifyDemandedBits produce inconsistent results (known_zero & known_one != 0, or results disagree with the demanded mask). This check is compiled into debug/checked builds only and calls abort() after printing both APInt values.

Size Contribution Estimate

Component	Size	Description
Upstream visitor baseline	~200 KB	Standard LLVM visiting ~50 instruction types
`sub_1169C30` intrinsic folding	~87 KB	NVVM-specific intrinsic canonicalization
NVVM GEP/FMA/vector cases	~40 KB	Expanded GEP chains, ternary FMA, vector width-changing
`separate_storage` + assume	~10 KB	Operand bundle handling for alias hints
High-opcode NVIDIA intrinsics	~15 KB	DCE for opcodes 0x254D/0x2551/0x255F
Expanded comparator/cast	~50 KB	Extended ICmp, cast chain, select handling
NVIDIA total addition	~200 KB	Roughly doubles upstream InstCombine

Optimization Level Behavior

Level	Scheduled	Instances	Notes
O0	Not run	0	No optimization passes
Ofcmax	Runs	1	Single instance in fast-compile pipeline
Ofcmid	Runs	2	Early + post-GVN cleanup
O1	Runs	3-4	Early, post-SROA, post-GVN, late cleanup
O2	Runs	4-5	Same as O1 + additional Tier 2 instance after loop passes
O3	Runs	5-6	Same as O2 + Tier 3 instance; benefits from more aggressive inlining/unrolling

InstCombine is the most frequently scheduled pass in the CICC pipeline. Each instance runs the full 405KB visitor but benefits from different preceding transformations: the post-SROA instance cleans up cast chains from aggregate decomposition, the post-GVN instance simplifies expressions exposed by redundancy elimination, and the late instance performs final canonicalization before codegen. The instcombine-negator-max-depth and instcombine-negator-enabled knobs apply uniformly across all instances. Even at Ofcmax, at least one InstCombine run is considered essential for basic IR canonicalization. See Optimization Levels for pipeline tier details.

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Binary size	~200 KB main visitor	~405 KB main visitor + 87 KB intrinsic folding (~2x upstream)
NVVM intrinsic folding	No NVVM-specific intrinsic canonicalization	Dedicated 87 KB function (`sub_1169C30`) with two-layer dispatch for negation, vector extract/insert, FMA, tensor, dot product, and 15+ fold types
High-opcode DCE	Not present	Three NVIDIA proprietary intrinsic IDs (9549, 9553, 9567) with constant-argument dead-code elimination
`separate_storage` bundles	No `separate_storage` operand bundle handling	Iterates `llvm.assume` bundles, extracting `"separate_storage"` hints for alias-based optimization
Ternary FMA opcodes	Standard `llvm.fma` / `llvm.fmuladd` folding	Extended preamble handles opcodes 238--245 for CUDA FMA variants with address-space mismatch handling
GEP chain look-through	Single-level GEP simplification	Depth-limited chain walk (`dword_4F901A8` steps) backward through constant-index GEP chains to find CallInst base pointers
Horizontal reduction	Standard intrinsic-based reduction fold	Four template-instantiated `matchBinOpReduction` helpers for NVVM horizontal reduction intrinsics (IDs 329, 330, 365, 366)
KnownBits integration	Separate `computeKnownBits` in ValueTracking	Fused 127 KB `computeKnownBits` + `SimplifyDemandedBits` with GPU special-register range oracle

GVN (Global Value Numbering)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Scalar/GVN.cpp, llvm/lib/Transforms/Scalar/NewGVN.cpp (LLVM 20.0.0)

CICC v13.0 ships two GVN implementations: the classic GVN pass at 0x1900BB0 (83 KB, ~2314 decompiled lines) and a NewGVN pass at 0x19F99A0 (68 KB, ~2460 decompiled lines). Both are derived from upstream LLVM but carry substantial NVIDIA modifications for GPU-specific value numbering, store splitting, and intrinsic-aware CSE. The knob constructor at ctor_201 (0x4E0990) registers eleven tunables that control PRE, store splitting, PHI removal, dominator caching, and recursion depth.

Key Facts

Property	Value
Pass name (pipeline)	`gvn` (parameterized)
Registration	New PM #397, parameterized: `no-pre;pre;no-load-pre;load-pre;...`
Runtime positions	Tier 0 #5 (via `sub_1C6E800`); also appears at NewGVN/GVNHoist position #6; see Pipeline
Classic GVN entry	`sub_1900BB0` (83 KB, 2,314 lines)
NewGVN entry	`sub_19F99A0` (68 KB, 2,460 lines)
Knob constructor	`ctor_201` at `0x4E0990`
Upstream source	`llvm/lib/Transforms/Scalar/GVN.cpp`, `NewGVN.cpp` (LLVM 20.0.0)

Knob Inventory

Knobs are registered in ctor_201 at 0x4E0990. Bool knobs use cl::opt<bool> (vtable 0x49EEC70); int knobs use cl::opt<int> (vtable 0x49EEB70). The store-split limit knobs route through a custom NVIDIA registrar at sub_190BE40 that accepts an int** default initializer.

Knob	Type	Default	Global Address	Purpose
`enable-pre`	bool	true	`0x4FAEEE0`	Enable Partial Redundancy Elimination
`enable-load-pre`	bool	true	`0x4FAEE00`	Enable load PRE (load sinking across edges)
`enable-split-backedge-in-load-pre`	bool	false	`0x4FAED20`	Allow splitting backedges during load PRE
`enable-phi-remove`	int	2	`0x4FAEC40`	PHI removal aggressiveness (0=off, 2=aggressive)
`dump-phi-remove`	int	0	`0x4FAEB60`	Dump PHI removal decisions (debug)
`no-split-stores-below`	int	-1	`0x4FAEA80`	Minimum store width in bits for splitting (-1 = no limit)
`no-split-stores-above`	int	-1	`0x4FAE9A0`	Maximum store width in bits for splitting (-1 = no limit)
`split-stores`	bool	true	`0x4FAE8C0`	Master enable for store splitting
`profusegvn`	bool	true	`0x4FAE7E0`	Verbose diagnostics via NVIDIA profuse framework
`gvn-dom-cache`	bool	true	`0x4FAE700`	Cache dominator tree query results (cache size 32)
`max-recurse-depth`	int	1000	`0x4FAE620`	Maximum recursion depth during simplification

IR Before/After Example

GVN eliminates redundant computations and forwards store values to loads. The following shows a common GPU pattern: a redundant load eliminated via value numbering, and a store-to-load forward.

Before:

define void @f(ptr addrspace(1) %p, ptr addrspace(1) %q) {
  %a = load float, ptr addrspace(1) %p, align 4
  %b = fmul float %a, 2.0
  %c = load float, ptr addrspace(1) %p, align 4        ; redundant load (same %p, no intervening store)
  %d = fadd float %b, %c
  store float 42.0, ptr addrspace(1) %q, align 4
  %e = load float, ptr addrspace(1) %q, align 4        ; load from location just stored to
  ret void
}

After:

define void @f(ptr addrspace(1) %p, ptr addrspace(1) %q) {
  %a = load float, ptr addrspace(1) %p, align 4
  %b = fmul float %a, 2.0
  ; %c eliminated -- replaced with %a (same value number)
  %d = fadd float %b, %a
  store float 42.0, ptr addrspace(1) %q, align 4
  ; %e eliminated -- forwarded from store (value 42.0)
  ret void
}

The second load from %p is eliminated because GVN assigns it the same value number as %a. The load from %q after the store is forwarded directly from the stored constant. On GPU, eliminating memory loads is especially valuable because each avoided ld.global saves hundreds of cycles of memory latency.

Classic GVN Algorithm

The main entry point is GVN::runOnFunction at sub_1900BB0. The pass object is approximately 600 bytes and carries four scoped hash tables plus a dominator tree reference.

Pass Object Layout

Offset	Field	Purpose
+0	vtable	Pass vtable pointer
+16	`Function*`	Current function being processed
+72	`MemoryDependenceResults*`	MemDep analysis handle
+88	`DominatorTree*`	Dominator tree
+240	LeaderTable	Hash: value number to canonical leader
+392	StoreExprTable	Hash: store expressions
+544	LoadExprTable	Hash: load expressions
+592	RPO counter	Current block's RPO number

Complexity

Let N = number of instructions, B = number of basic blocks, and D = depth of the dominator tree. The classic GVN traversal visits every instruction exactly once during the RPO walk: O(N). Each instruction is hashed (O(1) amortized via the scoped hash tables) and looked up in the leader table (O(1) amortized). Memory dependence queries (getDependency) are O(D) per load in the worst case, cached by MemDep to amortize across the function. PRE insertion adds at most O(N) new instructions. Store splitting is bounded by the number of stores times the split factor (controlled by no-split-stores-below/above). The gvn-dom-cache (size 32) converts repeated dominance queries from O(D) to O(1). PHI removal (replaceAndErase) is O(U) per replaced value where U = number of uses. Overall: O(N * D) in the worst case due to dominance queries; O(N) in practice with the dominator cache enabled (default). NewGVN's partition-based algorithm is O(N * alpha(N)) amortized where alpha is the inverse Ackermann function from union-find, though the fixpoint iteration can degrade to O(N^2) on pathological inputs.

Traversal Strategy

The pass walks the dominator tree in reverse post-order using an explicit segmented stack rather than recursion. The initial allocation is an 8-slot array of segment pointers (sub_22077B0(64)), each segment holding 64 pointers (512 bytes). The stack grows by allocating new segments and shrinks by freeing segments when popping past a boundary.

Each dominator tree node is a 136-byte structure (sub_22077B0(136)) containing RPO in/out numbers, basic block pointer, child pointers, scope chain links for all four hash tables, an undo list for backtracking, and a visited flag at offset +128.

Main Processing Loop

For each dominator tree node popped from the stack, the pass:

Sets the RPO number from the node's RPO_in field.
Skips already-visited nodes (checked via the byte at offset +128).
Iterates every instruction in the basic block.
Attempts SimplifyInstruction (sub_1AE9990) first; if it succeeds, replaces all uses and erases via sub_19003A0.
Dispatches on the instruction opcode byte at offset +16:
- Case 4 (call/intrinsic): Classifies purity via bitmask 0x1F133FFE23FFFF, checks volatility through sub_1560260 (flag 36), looks up in the LeaderTable via sub_18FDEE0 (hash) + sub_18FB980 (compare). Inserts new leaders via sub_18FEF10.
- Case 79 (load): Queries memory dependence, checks four NVIDIA intrinsic IDs for special pointer extraction, then attempts store-to-load forwarding or PRE.
- Case 114 (store): Inserts into the StoreExprTable using a 5-element hash key (opcode, type, pointer, value, alignment) via sub_18FEB70 / sub_18FFC60.
- Default: General expression numbering through sub_13E3350, with sub-dispatch for branches (opcode 57), loads (54/55), and call-like instructions (78).

NVIDIA Intrinsic-Aware Value Numbering

The classic GVN recognizes four NVIDIA-specific LLVM intrinsic IDs and extracts their pointer operands with non-standard indices:

Intrinsic ID	Name	Pointer Operand Index	Semantics
4057	`llvm.nvvm.ldu`	`1 - numOperands`	Load from uniform memory; aggressively CSE-able
4085	`llvm.nvvm.ldg`	`1 - numOperands`	Load via texture/global cache; CSE if same address
4492	(NVIDIA-specific)	`2 - numOperands`	Variant load with 2-operand pointer extraction
4503	(NVIDIA-specific)	`2 - numOperands`	Variant load with 2-operand pointer extraction

These intrinsics bypass the standard volatility checks and use custom operand extraction, allowing CSE of texture and surface loads that upstream LLVM GVN would not touch.

Scoped Hash Tables

GVN maintains four ScopedHashTable instances, pushed on dominator tree entry and popped on exit. The scope teardown at lines 1858-2101 restores the LoadExprTable via the undo list at offset +120, restores the StoreExprTable via the undo list at offset +72, frees the MemDepTable scope through sub_18FE3A0, and deallocates the 136-byte dom node.

The hash function (sub_18FDEE0, approximately 140 lines) is NVIDIA-modified. For binary ops (opcodes 35-52), it hashes the opcode and operand pointers with canonicalization (smaller pointer first for commutative operations). For comparisons, it includes the predicate. For GEPs (opcodes 86/87), it hashes the entire index sequence via sub_1597510. Hash mixing uses the formula (ptr >> 9) ^ (ptr >> 4) with XOR combining. The 5-element store expression variant (sub_18FEB70) computes:

hash = (v12>>9)^(v12>>4) ^ (v11>>9)^(v11>>4) ^ (v10>>9)^(v10>>4) ^ (37*v13) ^ (v9>>9)^(v9>>4)

Store Splitting

Three knobs control this NVIDIA-specific extension: split-stores (master enable), no-split-stores-below and no-split-stores-above (bit-width bounds, both default -1 meaning unlimited). The custom registrar at sub_190BE40 handles the limit knobs.

When GVN discovers a store that partially overlaps with a load, it attempts to split the store into sub-stores that individually satisfy dependence constraints. This is critical for GPU code where vector stores (float4, int4) partially overlap with subsequent scalar loads, texture/surface stores have alignment constraints, and shared memory bank conflicts may favor different store granularities.

The function sub_18FECC0 classifies store expressions by instruction type: store (54), atomic store (55), shufflevector (58), extractelement (59), and insertelement (82). The shufflevector/extract/insert handling reflects NVIDIA's lowering of vector operations into intermediate forms before GVN runs.

Dominator Cache

The gvn-dom-cache knob (default true, cache size 32) addresses a known performance bottleneck. GVN's dominance queries are O(n * depth) and can become expensive on deeply nested GPU kernels with many divergent branches. The cache stores recent dominates(A, B) results keyed by basic block pointer, converting repeated queries to O(1). The working set size of 32 was chosen empirically: GPU kernels typically have moderate dominator tree depth because shared memory parallelism keeps CFGs relatively flat.

PHI Removal

After GVN identifies equivalent values, some PHI nodes become trivial. The enable-phi-remove knob controls aggressiveness: level 0 disables removal, level 1 removes only trivially redundant PHIs, and level 2 (default) removes PHIs that become trivial after leader substitution.

The core replaceAndErase routine (sub_19003A0, 11 KB) iterates all uses of a replaced value, checks each PHI-node use for trivial foldability using a SmallDenseSet (opcode 23), and employs a 4-way unrolled loop (lines 301-317) for use scanning. This micro-optimization targets the common case of PHIs with many incoming edges after switch lowering or loop unrolling.

NewGVN

The NewGVN implementation at sub_19F99A0 (68 KB) uses congruence classes instead of simple leader tables, following the partition-based algorithm from Karthik Gargi (2002). The pass object stores a congruence class hash table at offset +1400 with count, bucket array, entry count, tombstone count, and bucket count fields.

The algorithm:

Builds initial partitions from the RPO-ordered instruction list.
For each worklist instruction, queries the current congruence class and computes the new value expression.
If the expression maps to a different class, splits the partition.
Repeats until fixpoint (no more splits).

Hash table growth is handled by sub_19F5120; insert-or-find by sub_19E6B80. Congruence class members are sorted (sub_19F5A00 + sub_19F6B20) for efficient merge operations.

Memory Dependence Integration

GVN interacts with MemoryDependenceResults at offset +72 through three key functions:

Function	Address	Role
`getDependency`	`sub_1422850`	Returns the memory instruction this load depends on
`getDominatorTree`	`sub_1423BA0`	Extracts the DomTree from MemDep for dominance queries
`properlyDominates`	`sub_1428550`	Tests strict dominance through the MemDep tree

The replacement safety check (sub_18FBB40) returns true immediately when RPO numbers match, and otherwise chains through getDependency -> getIDom -> dominates().

Profuse Diagnostics

The profusegvn knob (default true) enables verbose output through NVIDIA's custom profuse diagnostic framework, not the standard LLVM OptimizationRemark system. When active, diagnostics are emitted at value replacement decisions, store/load expression matches, and PRE insertion decisions. The framework is likely controlled by environment variables such as CICC_PROFUSE_DIAGNOSTICS.

Key Function Map

Function	Address	Size	Role
`GVN::runOnFunction`	`0x1900BB0`	83 KB	Main classic GVN pass
`replaceAndErase`	`0x19003A0`	11 KB	Replace uses + erase instruction
`NewGVN::run`	`0x19F99A0`	68 KB	NewGVN algorithm
`ctor_201`	`0x4E0990`	9 KB	GVN knob registration
`hashExpression`	`0x18FDEE0`	~5 KB	Expression hash function
`compareExpression`	`0x18FB980`	~2 KB	Expression equality test
`lookupExpr5`	`0x18FEB70`	~3 KB	5-key store expression lookup
`insertExpr5`	`0x18FFC60`	~3 KB	5-key insert with scoped undo
`insertLeader`	`0x18FEF10`	~5 KB	Leader table insert
`checkStoreSplit`	`0x18FECC0`	~3 KB	Store expression for splitting
`canReplace`	`0x18FBB40`	<1 KB	Dominance-based replacement check
`preAvailCheck`	`0x18FC460`	~3 KB	PRE availability analysis
`performPRE`	`0x18FF290`	10 KB	PRE insertion
`largeGVNHelper`	`0x18F6D00`	60 KB	PRE / load forwarding helper
`phiGVNHelper`	`0x18FAA90`	20 KB	PHI-related GVN helper
`storeSplitHelper`	`0x1906720`	26 KB	Store splitting implementation
`storeSplitVisit`	`0x1905CD0`	16 KB	Store-split worklist visitor
`postGVNCleanup`	`0x1908A00`	10 KB	Post-GVN cleanup
`gvnFinalCleanup`	`0x190C3B0`	8 KB	Final cleanup after GVN

Expression Classification Bitmask

The bitmask 0x1F133FFE23FFFF classifies opcodes that are safe for value numbering (pure, side-effect-free). It appears eight times in the main function. Bit positions correspond to (opcode - 35), covering standard arithmetic, logical, comparison, and cast operations, plus NVIDIA-specific opcodes in the extended range.

Multi-Pass Data Flow: SROA / InstCombine / GVN / DSE

These four passes form the core scalar optimization chain in CICC's mid-pipeline. They execute in sequence (often multiple times through the pipeline), with each pass producing IR transformations that create opportunities for the next. The following diagram traces data flow through a single iteration of the chain, showing what each pass produces and what the next pass consumes.

 SROA (Scalar Replacement of Aggregates)
 ========================================
 Input:  IR with aggregate alloca instructions (structs, arrays)
         Example: %s = alloca %struct.float4   -->  lives in .local memory (AS 5)

 +--------------------------------------------------------------+
 | Phase 1: Slice analysis                                      |
 |   Walk all uses of each alloca, build byte-range slices      |
 |   Group non-overlapping slices into partitions               |
 |                                                              |
 | Phase 2: Partition splitting                                 |
 |   Replace each partition with a scalar alloca or SSA value   |
 |   Insert extractvalue/insertvalue for partial accesses       |
 |   Defer trivially-promotable allocas to mem2reg              |
 |                                                              |
 | Produces:                                                    |
 |   - Scalar SSA values replacing aggregate members            |
 |   - Inserted bitcasts, trunc, zext for type mismatches       |
 |   - Dead aggregate allocas (erased)                          |
 |   - GEP chains pointing at sub-fields (now redundant)        |
 +------------------------------+-------------------------------+
                                |
                                | Scalar SSA values with redundant
                                | casts, dead GEPs, identity ops
                                v
 InstCombine (Instruction Combining)
 ========================================
 Input:  Post-SROA IR with redundant instructions

 +--------------------------------------------------------------+
 | 405KB visitor dispatches across 80 opcode cases:             |
 |                                                              |
 | Consumes from SROA:                                          |
 |   - Redundant bitcasts from type-punned accesses             |
 |   - trunc(zext(x)) chains from width mismatches              |
 |   - Dead GEP arithmetic (base + 0)                           |
 |   - Identity selects from conditional stores                 |
 |                                                              |
 | Canonicalization:                                            |
 |   - Constant folding (sub_101E960)                           |
 |   - Algebraic identities: x+0, x*1, x&-1 (sub_F0F270)      |
 |   - Strength reduction: x*2^n -> x<<n (sub_10BA120)         |
 |   - Cast chain collapse: trunc(zext(x)) -> x or smaller     |
 |   - NVIDIA intrinsic folding (sub_1169C30, 87KB)             |
 |   - computeKnownBits propagation (sub_11A7600, 127KB)        |
 |                                                              |
 | Produces:                                                    |
 |   - Canonical instruction forms (const on RHS, etc.)         |
 |   - Simplified expressions (fewer instructions)              |
 |   - Known-bits metadata on values                            |
 |   - Opportunities for value numbering (same expression       |
 |     in different blocks now looks identical)                  |
 +------------------------------+-------------------------------+
                                |
                                | Canonical IR with duplicate
                                | expressions across blocks
                                v
 GVN (Global Value Numbering)
 ========================================
 Input:  Canonicalized IR from InstCombine

 +--------------------------------------------------------------+
 | Traverses dominator tree in RPO with scoped hash tables:     |
 |                                                              |
 | Consumes from InstCombine:                                   |
 |   - Canonical expression forms (enables hash-table matching) |
 |   - Known-bits info (used in SimplifyInstruction)            |
 |   - Folded NVIDIA intrinsics (enables ldu/ldg CSE)           |
 |                                                              |
 | Value numbering:                                             |
 |   - Hash expression: (opcode, type, operands) -> leader      |
 |   - Scoped tables: LeaderTable, StoreExprTable, LoadExprTable|
 |   - NVIDIA ldu/ldg CSE (intrinsics 4057, 4085, 4492, 4503)  |
 |                                                              |
 | Load forwarding:                                             |
 |   - Query MemoryDependenceResults for store->load forwarding |
 |   - Store splitting: float4 store -> scalar float load       |
 |     (NVIDIA extension, controlled by split-stores knob)      |
 |                                                              |
 | PRE (Partial Redundancy Elimination):                        |
 |   - Insert computations at merge points to enable CSE        |
 |   - Load PRE across edges (enable-load-pre)                  |
 |                                                              |
 | Consumes from alias analysis:                                |
 |   - MemoryDependence results (which store feeds which load?) |
 |   - NVVM AA NoAlias answers for cross-address-space pairs    |
 |                                                              |
 | Produces:                                                    |
 |   - Eliminated redundant computations (replaced with leader) |
 |   - Forwarded loads (replaced with stored value)             |
 |   - Trivial PHIs (from leader substitution)                  |
 |   - Dead stores exposed (stored value is never loaded)       |
 +------------------------------+-------------------------------+
                                |
                                | IR with eliminated redundancies,
                                | forwarded loads, exposed dead stores
                                v
 DSE (Dead Store Elimination)
 ========================================
 Input:  Post-GVN IR with dead stores exposed

 +--------------------------------------------------------------+
 | 91KB across three major functions:                           |
 |                                                              |
 | Consumes from GVN:                                           |
 |   - Stores whose values were forwarded to loads (now dead)   |
 |   - Stores to locations that GVN proved are overwritten      |
 |   - Simplified store patterns from PRE insertion             |
 |                                                              |
 | Consumes from alias analysis:                                |
 |   - MemorySSA graph (which stores are visible to which loads)|
 |   - NVVM AA NoAlias (cross-space stores never conflict)      |
 |   - TBAA metadata (type-based aliasing for struct fields)    |
 |                                                              |
 | Dead store detection:                                        |
 |   - Complete overwrite: later store covers same location     |
 |   - Partial overwrite: float4 store then float4 store with   |
 |     overlapping range (72-byte hash table tracking)          |
 |   - Store chain decomposition: aggregate stores decomposed   |
 |     via GEP into element-level dead-store checks             |
 |                                                              |
 | NVIDIA extensions:                                           |
 |   - Partial store forwarding with type conversion            |
 |     (float4 -> float via GEP + load extraction)              |
 |   - Cross-store 6-element dependency records                 |
 |   - CUDA vector type-aware size computation                  |
 |                                                              |
 | Produces:                                                    |
 |   - Eliminated dead stores (fewer memory writes)             |
 |   - Replacement loads for partial forwards                   |
 |   - Reduced memory traffic (critical for GPU bandwidth)      |
 +--------------------------------------------------------------+

Cross-pass data dependency table:

Pass	Consumes from predecessor	Produces for successor
SROA	Aggregate allocas from frontend/inliner	Scalar SSA values, redundant casts/GEPs
InstCombine	Redundant casts, identity ops from SROA	Canonical expressions, known-bits metadata
GVN	Canonical forms from InstCombine, MemDep/AA results	Forwarded loads, eliminated redundancies, exposed dead stores
DSE	Dead stores exposed by GVN, MemorySSA/AA results	Eliminated stores, reduced memory traffic

Why this ordering matters for GPU code: SROA is existential because un-promoted allocas become .local memory (200-400 cycle penalty). InstCombine must run before GVN because GVN's hash-table matching requires canonical expression forms -- without InstCombine, (a + 0) and a would hash differently and miss the CSE opportunity. GVN must run before DSE because GVN's load forwarding is what exposes dead stores: once GVN proves that a load reads a value already available as an SSA register, the store that was keeping that value alive becomes dead. DSE then removes it, reducing the memory write traffic that is the primary bandwidth bottleneck on GPU architectures.

Optimization Level Behavior

Level	Classic GVN	NewGVN	PRE	Store Splitting
O0	Not run	Not run	N/A	N/A
Ofcmax	Not run	Not run	N/A	N/A
Ofcmid	Runs (1 instance)	Not run	Enabled (`enable-pre=true`)	Enabled (`split-stores=true`)
O1	Runs (1-2 instances in Tier 0/1)	Not run	Enabled	Enabled
O2	Runs (2-3 instances across Tier 0/1/2)	Not run	Enabled	Enabled
O3	Runs (2-3 instances, most aggressive inlining exposes more CSE)	Not run	Enabled	Enabled

GVN is a core mid-pipeline pass that runs at O1 and above. It appears multiple times in the pipeline -- typically once after CGSCC inlining and once in the late scalar cleanup. Each instance benefits from different preceding transformations (inlining, SROA, InstCombine). NewGVN is compiled into the binary but not scheduled in any standard pipeline tier. The enable-pre and enable-load-pre knobs are both true by default across all levels. See Optimization Levels for the complete tier structure.

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Store splitting	Not present; GVN handles stores only for forwarding	Three knobs (`split-stores`, `no-split-stores-below`, `no-split-stores-above`) enable splitting wide vector stores into sub-stores matching load granularity
NVIDIA intrinsic CSE	No awareness of `nvvm.ldu`, `nvvm.ldg`	Four NVIDIA intrinsic IDs (4057, 4085, 4492, 4503) with custom pointer operand extraction, enabling CSE of texture/global cache loads
Dominator cache	No caching; dominance queries are O(n * depth)	`gvn-dom-cache` (default true, size 32) caches recent `dominates(A, B)` results for O(1) repeated queries
PHI removal aggressiveness	Basic trivial PHI cleanup	Three-level `enable-phi-remove` knob (0=off, 1=trivial, 2=aggressive); 4-way unrolled use-scanning loop for PHI-heavy IR
Knob count	~4 knobs (`enable-pre`, `enable-load-pre`, `enable-split-backedge-in-load-pre`, `max-recurse-depth`)	11 knobs including store splitting limits, dominator caching, profuse diagnostics, and PHI removal depth
Diagnostic framework	Standard `OptimizationRemark` system	`profusegvn` knob (default true) uses NVIDIA's custom profuse diagnostic framework, not LLVM's `ORE`
NewGVN	Standard partition-based NewGVN	Same algorithm, ships alongside classic GVN at separate address; both carry NVIDIA modifications

Diagnostic Strings

All diagnostic strings recovered from the binary. GVN uses NVIDIA's custom profuse diagnostic framework rather than LLVM's OptimizationRemark system.

String	Source	Category	Trigger
`"profuse for GVN"`	`0x4FAE7E0` (`profusegvn` knob description)	Knob	Knob registration
`"enable caching of dom tree nodes"`	`0x4FAE700` (`gvn-dom-cache` knob description)	Knob	Knob registration
`"Max recurse depth (default = 1000)"`	`0x4FAE620` (`max-recurse-depth` knob description)	Knob	Knob registration
(profuse GVN diagnostic output)	`sub_1909530` (~5 KB)	Debug	`profusegvn` knob enabled (default true); emits at value replacement, store/load match, and PRE insertion decisions
(PHI removal diagnostic output)	`sub_19003A0` region	Debug	`dump-phi-remove` > 0; dumps which PHI nodes are being removed and why

The profusegvn framework follows the same pattern as profuseinline -- it is a custom NVIDIA diagnostic channel likely controlled by environment variables such as CICC_PROFUSE_DIAGNOSTICS, not the standard LLVM OptimizationRemark / ORE system. The dump-phi-remove knob (default 0) separately enables diagnostic output during PHI removal.

Allocation Strategy

The 136-byte domtree nodes and 48-byte expression entries use sub_145CBF0 (BumpPtrAllocator) and sub_22077B0 (malloc wrapper). This careful memory management addresses the potentially large number of expressions produced by heavily unrolled GPU kernels.

Test This

The following kernel contains redundant loads from the same global address. GVN should eliminate the second load by recognizing it has the same value number as the first.

__global__ void gvn_test(const float* __restrict__ in, float* __restrict__ out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= n) return;

    float a = in[tid];        // first load
    float b = a * 2.0f;
    float c = in[tid];        // redundant -- same address, no intervening store
    float d = c * 3.0f;

    out[tid] = b + d;
}

What to look for in PTX:

Only one ld.global.f32 instruction for in[tid], not two. GVN assigns the same value number to both loads (same pointer, no intervening aliasing store thanks to __restrict__) and replaces the second with the first's result.
The arithmetic should reduce to something equivalent to in[tid] * 5.0f. After GVN eliminates the redundant load, InstCombine or the backend may simplify a*2 + a*3 into a*5.
Remove __restrict__ and add an intervening store (out[tid] = b; between the two loads). Without __restrict__, GVN cannot prove the second load is redundant (the store to out might alias in), so both ld.global.f32 instructions survive. This demonstrates how alias analysis feeds GVN.
For store-to-load forwarding: insert out[tid] = 42.0f; followed by float e = out[tid];. GVN should replace the load with the constant 42.0f -- no ld.global emitted for e.

JumpThreading

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Based on LLVM 20.0.0 JumpThreading.cpp. Evidence: DFA JumpThreading variant (dfa-jump-threading) present as a separate pass matches LLVM 14+; early-exit-heuristic knob matches LLVM 16+. Core algorithm is unmodified; NVIDIA changes are configuration-level (adjusted thresholds, three pipeline positions, OCG disable flag).

CICC v13.0 ships LLVM's JumpThreadingPass at sub_2DC4260 (12,932 bytes, address range 0x2DC4260--0x2DC74E4). The pass duplicates basic blocks so that predecessors whose branch conditions can be statically resolved jump directly to the correct successor, eliminating a conditional branch from the critical path. On a GPU, this directly reduces warp divergence: a branch that was previously data-dependent becomes unconditional along each incoming edge, so the warp scheduler never needs to serialize the two paths.

The pass is fundamentally at odds with PTX's requirement for reducible control flow. Block duplication can create multi-entry loops (irreducible cycles) when the duplicated block is a loop header or when the threading target sits inside a loop whose header is not the threading source. CICC addresses this through three layered mechanisms -- loop header protection, conservative duplication thresholds, and a late-pipeline StructurizeCFG safety net -- that collectively keep the CFG reducible without sacrificing the pass's optimization value.

Property	Value
Pass name (pipeline parser)	`"jump-threading"`
Pass class	`llvm::JumpThreadingPass`
Entry function	`sub_2DC4260`
Binary size	12,932 bytes
Stack frame	`0x748` (1,864) bytes
Block duplication helper	`sub_2DC22F0` (2,797 bytes)
CFG finalization	`sub_2DC30A0` (1,094 bytes)
Single-instruction threading	`sub_2DC37C0` (2,288 bytes)
Select unfolding	`sub_2DC40B0` (420 bytes)
Pipeline positions	Three invocations: ~position 234, ~278, and a late tier-3 position (~239)
`NVVMPassOptions` disable offset	`+320`
Upstream LLVM source	`lib/Transforms/Scalar/JumpThreading.cpp`

Why JumpThreading Matters on GPU

Consider a CUDA kernel containing:

if (threadIdx.x < threshold)
    val = computeA();
else
    val = computeB();

if (val > 0)
    result = pathX(val);
else
    result = pathY(val);

The second branch depends on val, which is a PHI of computeA() and computeB(). If JumpThreading can determine that computeA() always returns a positive value, it duplicates the second if block and wires the computeA predecessor directly to pathX. Threads that took the first branch path never execute the second conditional at all.

On a CPU this saves a branch misprediction. On a GPU the payoff is larger: eliminating the second branch prevents a second point of warp divergence. If both branches would diverge on different thread subsets, removing one cuts the total serialization overhead in half. The threads that took computeA proceed straight to pathX without waiting for the computeB threads to rejoin.

Knob Inventory

Six cl::opt globals control the pass, registered in ctor_456 at 0x544220:

Knob	Default	Global	Description
`jump-threading-threshold`	6	`qword_4FFDBA0`	Max instructions in a block eligible for duplication
`jump-threading-implication-search-threshold`	3	`qword_4FFDAC0`	Max predecessors to search for condition implications
`jump-threading-phi-threshold`	76 (`0x4C`)	`qword_4FFD9E0`	Max PHI nodes in a block eligible for duplication
`jump-threading-across-loop-headers`	false	`qword_4FFD900`	Allow threading across loop headers (testing only)
`jump-threading-disable-select-unfolding`	false	`qword_4FFDC80`	Disable unfolding `select` instructions into branches
`print-lvi-after-jump-threading`	false	--	Debug: print LazyValueInfo cache after pass completes

The block-size threshold of 6 matches upstream LLVM. The PHI threshold of 76 is significantly higher than upstream's default (which is typically lower), reflecting GPU kernels' tendency toward wider PHI nodes due to predication and convergence patterns. The implication search depth of 3 is conservative, limiting compile-time cost from predecessor chain analysis in the typically shorter basic-block chains of GPU code.

Two Disable Flags

CICC registers two independent cl::opt flags that suppress jump threading behavior. They live in different subsystems and control different things:

Flag	Registration	Subsystem	Effect
`"disable-JumpThreadingPass"`	`ctor_637` @ `0x5934A7`	JumpThreading pass itself	Disables the standalone `JumpThreadingPass` invocations in the pipeline
`"disable-jump-threading"`	`ctor_073` @ `0x49A91E` (also `ctor_243` @ `0x4ED0C0`)	SimplifyCFG	Disables jump threading logic within SimplifyCFG -- the per-block branch-through-PHI threading that SimplifyCFG performs as part of its CFG simplification

The "disable-jump-threading" flag carries the annotation "Disable jump threading for OCG experiments", where OCG is NVIDIA's Optimizing Code Generation research infrastructure. This is a SimplifyCFG option, not a JumpThreadingPass option -- SimplifyCFG has its own internal implementation of branch threading through PHI nodes that is separate from the standalone pass. NVIDIA engineers can disable either or both independently.

The "fold-with-var-cond" flag is registered alongside "disable-jump-threading" in the same SimplifyCFG constructor group, controlling a related NVIDIA-specific extension for folding branches with variance conditions.

Interaction with StructurizeCFG

The fundamental tension: JumpThreading duplicates blocks to bypass conditionals, which can transform a reducible loop into an irreducible cycle. PTX requires all loops to be natural (single-entry, reducible). An irreducible CFG causes StructurizeCFG to emit "UnsupportedIrreducibleCFG" and bail out, leaving the function in a state that ptxas will likely reject.

CICC addresses this through three layered mechanisms:

1. Loop Header Protection via LoopInfo

The jump-threading-across-loop-headers flag defaults to false. Before threading any block, the pass queries LoopInfo through a red-black tree lookup at 0x2DC4781 using dword_501D5A8 as the analysis key. If the target block is a loop header (the LoopInfo query returns a non-null loop containing the block as its header), the pass skips it entirely.

A parallel DominatorTree query at 0x2DC4839 (using dword_501D4C8) verifies loop membership and nesting depth. If the block is found within a loop, a threshold override is loaded from qword_501D628, replacing the standard duplication threshold with a loop-specific one. A second override from qword_501D548 applies to blocks found via the DominatorTree-based lookup.

This double check -- LoopInfo for header identification, DominatorTree for membership -- prevents the most common source of irreducibility: duplicating a loop header creates a second entry into the loop body.

2. Conservative Duplication Thresholds

The three thresholds (6 instructions, 3 predecessors, 76 PHIs) restrict duplication to small, simple blocks where the CFG outcome is highly predictable and the duplication cost is bounded. A block must satisfy all three limits simultaneously. These thresholds interact multiplicatively: even a 6-instruction block with 4 predecessors would exceed the implication search depth and be rejected, while a 5-instruction block with 100 PHIs would exceed the PHI threshold.

3. StructurizeCFG Safety Net

StructurizeCFG (sub_35CC920) runs late in the pipeline, after all IR-level scalar and loop transforms. Its irreducibility detector (sub_35CA2C0) checks every back-edge: if the target does not dominate the source, the loop has multiple entries and is irreducible. If JumpThreading or any other pass creates an irreducible cycle that slipped past the loop header protection, StructurizeCFG will catch it.

This is defense-in-depth: the threading constraints prevent most irreducible cases, and structurization catches the rest. The design deliberately tolerates a small number of "false acceptances" at the JumpThreading level because the cost of occasionally running StructurizeCFG's rejection path is far lower than the cost of being too conservative and missing profitable threading opportunities.

Cost Model

The pass enforces a multi-level cost model that bounds total code growth per function.

Global Budget

At 0x2DC4887, the pass initializes a global instruction budget:

mov ebx, 200h    ; 512 instructions total budget

Each block duplication charges the duplicated block's instruction count against this budget. The budget is tracked in var_460 and checked before each duplication. Once exhausted, no further threading occurs in that invocation regardless of how profitable individual candidates might be.

Per-Predecessor Cost Division

When threading involves multiple predecessors, the per-predecessor cost is the block instruction count divided by the number of predecessors being threaded, with ceiling rounding:

cost_per_pred = block_instr_count / num_predecessors
; ceiling via: sbb eax, -1 (adds 1 if remainder was nonzero)

This division at 0x2DC4D78--0x2DC4D8E means a 6-instruction block being threaded for 3 predecessors costs only 2 instructions per predecessor against the global budget. The logic recognizes that multi-predecessor threading amortizes the code growth across more eliminated branches.

Special Cases

Single-instruction blocks (checked at 0x2DC4D94): Always eligible, regardless of budget. A block containing only a terminator instruction costs nothing to duplicate.
Empty blocks (checked at 0x2DC4D70): Skipped entirely.
Blocks with <=1 effective instructions (0x2DC4BF1): The comparison cmp edx, 1; jbe gates a fast path where the pass bypasses the full cost analysis.

LazyValueInfo Integration

The pass accepts a LazyValueInfo pointer as its third parameter (rdx). When non-null (checked at 0x2DC42BD), LVI provides range-based condition evaluation that enables threading even when the branch condition is not a simple constant comparison.

LVI State

The LVI cache occupies approximately 600 bytes (0x258) of local state:

Field	Offset	Purpose
Cache structure	`var_2F0` through `var_98`	LVI range cache local state
Valid flag	`var_C0`	Set to 1 when LVI is initialized
Cached ranges	`var_B0`	SmallVector-like structure
Initial capacity	`var_A8`	8 entries

Range-Based Threading

For ICMP_NE conditions (opcode 0xBA = 186), the pass calls sub_11F3070 (LVI::getPredicateAt) with the ICmp operand and a comparison predicate of 2, followed by sub_DFABC0 (evaluateConditionOnEdge) to resolve the branch direction along a specific incoming edge.

For alternate opcode paths (opcode 0x165 = 357), the pass uses sub_988330 (getConstantOnEdge) instead, which returns a concrete constant value if LVI can prove the condition evaluates to a known value along that edge.

The virtual dispatch at 0x2DC67D6 (call qword ptr [rax+78h]) invokes LVI::getPredicateOnEdge. If the vtable matches sub_920130 (the default implementation), a fallback path calls sub_AC4810 (isImpliedCondition) with predicate 0x27 (39), and if that also fails, sub_AA93C0 (SimplifyICmpInst).

Cleanup

On exit, if LVI was used, three cleanup calls occur:

sub_FFCE90 -- LVI::eraseBlock (invalidation)
sub_FFD870 -- LVI::clear
sub_FFBC40 -- LVI::releaseMemory

Main Algorithm

Outer Loop

The pass iterates over the function's basic block list via a linked-list traversal (BB->next chain at [BB+8]):

run(result_ptr, function, lvi_ptr, tli, ...):
    if lvi_ptr != null:
        initialize_lvi_cache(lvi_ptr)

    budget = 512
    changed = false

    loop:
        current_bb = function.entry_block    // sub_B2BEC0
        end = function + 0x48               // end sentinel

        while current_bb != end:
            if try_thread_block(current_bb, budget):
                changed = true
            current_bb = current_bb.next     // [current_bb + 8]

        if changed:
            changed = false
            goto loop    // restart: threading may expose new opportunities

    cleanup_lvi()
    return results

The restart-on-change behavior means threading is iterative: eliminating one branch can expose a new statically-determinable branch downstream.

Per-Block Classification

For each basic block, the pass examines the terminator instruction:

Opcode check (0x2DC443E): The instruction opcode byte is compared against 0x55 (85), which is LLVM's BranchInst opcode. Only conditional branches are considered.
Metadata check (0x2DC4449--0x2DC446E): Two calls to sub_A73ED0 check for metadata kinds 0x17 (23, "prof" branch weights) and 0x04 (debug). Then sub_B49560 (hasMetadataOtherThanDebugLoc) is called on the branch instruction.
Condition extraction (0x2DC45F8--0x2DC4636): sub_981210 (getBranchCondition) returns a success flag and a condition code. Two condition codes are handled:
- 0x165 (357): likely CmpInst::ICMP_EQ or a switch opcode
- 0x0BA (186): likely CmpInst::ICMP_NE
Other condition codes cause the block to be skipped.
Operand analysis (0x2DC465F--0x2DC467C): The operand count is extracted (AND with 0x7FFFFFF mask -- the use-count field in LLVM's Value layout). If the branch condition is an ICmp with a constant operand (type byte 0x11 = 17 = ConstantInt), threading is potentially profitable.

Condition-Specific Threading Paths

The pass contains four specialized threading strategies:

Constant-value threading (0x2DC66B7): When a predecessor can determine the branch outcome via a constant PHI incoming value, the simplest path. Creates a direct unconditional branch.

Single-instruction threading (sub_2DC37C0, 2,288 bytes): For blocks containing exactly one instruction (the terminator), called at 0x2DC6704. Creates a direct branch bypass.

Switch threading (0x2DC6A76--0x2DC6B0C): When the terminator is a SwitchInst (opcode byte 0x37 = 55), calls sub_2DC40B0 (tryToUnfoldSelect). This checks for SelectInst (opcode 0x52 = 82) and unfolds the select into explicit branches that can be individually threaded.

Implication-based threading (0x2DC6E71--0x2DC6EB3): For ICmpInst variants (opcode 0x28 = 40), the pass checks whether the predicate implies the branch condition via sub_B532B0, creates the threaded edge via sub_B52EF0, and wires the new block via sub_92B530.

All-Ones Constant Detection

Four sites (0x2DC71B0, 0x2DC71CA, 0x2DC7380, 0x2DC74DA) check for all-ones constants as PHI incoming values:

or rax, -1          ; create all-ones mask
shr rax, cl         ; cl = 64 - bitwidth, shift to match width
cmp [rdx+18h], rax  ; compare against actual constant value
setz al             ; true if constant is all-ones

For an i1 type, all-ones means true. This handles the common pattern where a PHI incoming value from one predecessor is the constant true (all bits set), allowing the pass to resolve the branch direction for that predecessor.

PHI Operand Iteration

Two nearly identical loops at 0x2DC7206--0x2DC726E and 0x2DC7456--0x2DC74CD iterate PHI operands to determine if all incoming values from relevant predecessors resolve to the same constant:

for pred_idx in range(phi.num_operands):    // var_668
    incoming = phi.getIncomingValueForBlock(pred)  // sub_AD69F0
    type_tag = incoming.type_byte

    if type_tag == 0x0D:     // ConstantInt::getTrue()
        continue
    if type_tag == 0x11:     // ConstantInt with bitwidth check
        if bitwidth <= 64:
            if value == all_ones_for_width:
                continue     // resolves to true
        else:
            skip             // wide integers, bail out

    // If any incoming value is non-constant, threading is unprofitable
    bail_out()

If every relevant predecessor provides the same constant value, the branch direction is fully determined and threading proceeds.

Created Block Names

When threading occurs, the pass creates new basic blocks with diagnostic names:

Name	String address	Purpose
`"endblock"`	`0x42E9094`	Terminal block of the threaded path; created via `sub_F36990` (`SplitBlockAndInsertIfThen`)
`"phi.res"`	`0x42E90C0`	PHI resolution node for merged values; created via `sub_D5C860` (`PHINode::Create`)
`"res_block"`	`0x42E909D`	Result block for the threaded path; allocated as 0x50-byte BasicBlock via `sub_22077B0`
`"loadbb"`	`0x42E90B9`	Load basic block for load-bearing threading; created in a loop at `0x2DC4F05`--`0x2DC4FFB`
`"phi.src1"`	`0x42E90A7`	First PHI source block
`"phi.src2"`	`0x42E90B0`	Second PHI source block

The "loadbb" blocks are created in a dynamic loop for multi-way threading, where each iteration allocates a 0x50-byte (sizeof(BasicBlock)) object and wires it into the CFG via sub_AA4D50 (BasicBlock::insertInto).

Block Duplication Engine: sub_2DC22F0

The 2,797-byte helper performs actual block cloning. Parameters:

Register	Role
`rdi`	Duplication context structure (at `var_490`)
`rsi`	Source block's value table
`rdx`	Destination hash table
`rcx`	PHI operand map
`r8d`	Instruction count for the source block

The cloning process:

Clone each instruction from the source block
Insert cloned instructions into use-def chains (0x2DC59A1--0x2DC59E7: linked-list surgery on LLVM's Value use-list)
Update PHI operands to reference the new predecessor (0x2DC5E1E onward)
Update branch targets in the predecessor blocks

CFG Finalization: sub_2DC30A0

The 1,094-byte helper, called at 0x2DC5015 and 0x2DC6408 after threading completes for a block, performs:

Successor edge updates
Dead block elimination for blocks made unreachable by the threading
DominatorTree updates if available (via sub_FFB3D0, DominatorTree::changeImmediateDominator)

Pipeline Positions

JumpThreading appears three times in the CICC pipeline, at different stages with different surrounding context:

Position	Pipeline context	Parameter	Purpose
~234	After ADCE, within the main function simplification loop	`sub_198DF00(-1)`	First opportunity: thread branches exposed by dead code elimination
~278	After NVVMPeephole2 and optionally GVN, in the NVIDIA-specific tier-2 sequence	`sub_198DF00(-1)`	Second opportunity: thread branches exposed by value numbering and peephole
Late tier-3	Within the ADCE/MemCpyOpt/DSE sequence	`sub_198DF00(t)`	Final opportunity: catch any remaining threadable branches before StructurizeCFG

The sub_198DF00 function is the combined CorrelatedValuePropagation/JumpThreading registration wrapper. The -1 parameter likely selects the default mode; the t parameter in the third position may be an optimization-level-dependent configuration.

All three positions are conditional on NVVMPassOptions offset +320 not being set to disable. Each invocation resets the 512-instruction global budget, so the total code growth across all three invocations can reach up to 1,536 instructions per function.

DFA JumpThreading

A separate DFA-based JumpThreading variant exists at sub_276AF50, registered as "dfa-jump-threading" (llvm::DFAJumpThreadingPass). This pass is controlled by:

Knob	Registration	Description
`enable-dfa-jump-thread`	`ctor_445` @ `0x53F5C0`	Enable/disable the DFA variant
`dfa-jump-view-cfg-before`	`ctor_445`	Debug: dump CFG before DFA threading
`dfa-early-exit-heuristic`	`ctor_445`	Early-exit heuristic for compile time

DFA JumpThreading handles state-machine patterns (switch statements in loops with predictable transitions between cases) that the standard JumpThreading cannot resolve. It is a separate pass with its own pipeline registration and does not share the budget or thresholds of the standard JumpThreading pass.

Before/After IR Example

Consider a kernel with a two-branch diamond:

Before JumpThreading:

entry:
  %cond1 = icmp sgt i32 %x, 0
  br i1 %cond1, label %positive, label %negative

positive:
  %a = call i32 @computeA()
  br label %merge

negative:
  %b = call i32 @computeB()
  br label %merge

merge:
  %val = phi i32 [ %a, %positive ], [ %b, %negative ]
  %cond2 = icmp eq i32 %val, 42
  br i1 %cond2, label %match, label %nomatch

match:
  ...
nomatch:
  ...

If LVI can prove that computeA() always returns 42 (e.g., it is a known constant), JumpThreading duplicates the merge block for the %positive predecessor:

After JumpThreading:

entry:
  %cond1 = icmp sgt i32 %x, 0
  br i1 %cond1, label %positive, label %negative

positive:
  %a = call i32 @computeA()
  br label %match              ; threaded: skip %merge entirely

negative:
  %b = call i32 @computeB()
  br label %merge

merge:                          ; now has only one predecessor
  %val = phi i32 [ %b, %negative ]
  %cond2 = icmp eq i32 %val, 42
  br i1 %cond2, label %match, label %nomatch

match:
  ...
nomatch:
  ...

The %positive path no longer passes through merge. The second branch is eliminated for threads that took the first path.

Differences from Upstream LLVM

Aspect	CICC v13.0	Upstream LLVM 20
PHI threshold default	76	Lower (typically ~32 or similar)
`disable-jump-threading` in SimplifyCFG	Present, annotated for OCG experiments	Present (standard LLVM flag)
Annotation	"Disable jump threading for OCG experiments"	No OCG reference
Pipeline invocations	Three positions, combined with CVP via `sub_198DF00`	Typically two (early and late in the function simplification pipeline)
`NVVMPassOptions` disable	Offset `+320`	N/A
Loop header override thresholds	`qword_501D628`, `qword_501D548`	Standard LoopInfo check only
`fold-with-var-cond`	NVIDIA-specific SimplifyCFG companion flag	Not present

The core algorithm is unmodified from upstream. NVIDIA's changes are configuration-level: adjusted thresholds, additional pipeline positions, the OCG disable flag, and integration with the NVVMPassOptions system.

Function Map

Function	Address	Size	Role
`JumpThreadingPass::run` (main pass body)	`sub_2DC4260`	12,932 bytes	--
Block cloning engine (`duplicateBlock`)	`sub_2DC22F0`	2,797 bytes	--
CFG finalization after threading	`sub_2DC30A0`	1,094 bytes	--
Single-instruction threading	`sub_2DC37C0`	2,288 bytes	--
`tryToUnfoldSelect`	`sub_2DC40B0`	420 bytes	--
SmallVector append/copy for instruction map	`sub_2DC1F40`	349 bytes	--
`LVI::getPredicateAt`	`sub_11F3070`	--	--
`evaluateConditionOnEdge`	`sub_DFABC0`	--	--
`getConstantOnEdge`	`sub_988330`	--	--
`isImpliedCondition`	`sub_AC4810`	--	--
`SimplifyICmpInst`	`sub_AA93C0`	--	--
`getBranchCondition`	`sub_981210`	--	--
`BranchInst::getCondition`	`sub_B43CB0`	--	--
`BranchInst::Create` (conditional)	`sub_B4C9A0`	--	--
`BranchInst::Create` (unconditional)	`sub_B4C8F0`	--	--
`PHINode::addIncoming`	`sub_B99FD0`	--	--
`PHINode::Create`	`sub_D5C860`	--	--
`SplitBlockAndInsertIfThen`	`sub_F36990`	--	--
`BasicBlock::getContext`	`sub_BD5C60`	--	--
`operator new(0x50)` (allocate BasicBlock)	`sub_22077B0`	--	--
`BasicBlock::insertInto`	`sub_AA4D50`	--	--
`Value::replaceAllUsesWith`	`sub_BD84D0`	--	--
`Instruction::eraseFromParent`	`sub_B43D60`	--	--
`DominatorTree::changeImmediateDominator`	`sub_FFB3D0`	--	--
`PHINode::getIncomingValueForBlock`	`sub_AD69F0`	--	--
LoopInfo pass lookup	`sub_C959E0`	--	--
Predicate implies branch check	`sub_B532B0`	--	--
`ConstantExpr::getICmp` or create threaded edge	`sub_B52EF0`	--	--
`CloneBasicBlock` or wire new block	`sub_92B530`	--	--
`CloneBasicBlock` (alternate path)	`sub_929DE0`	--	--

Cross-References

StructurizeCFG -- the late-pipeline safety net that catches irreducible CFG created by threading or other passes
Scalar Passes Hub -- hub page linking SROA, EarlyCSE, and JumpThreading with GPU-context summaries
GVN -- runs between JumpThreading invocations in the tier-2 sequence; can expose new threadable branches
Pipeline & Ordering -- tier-dependent scheduling of all three invocations
Knobs -- master knob inventory including all six JumpThreading knobs

LICM (Loop-Invariant Code Motion)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Loop-Invariant Code Motion in cicc v13.0 operates at three distinct levels: an IR-level pass ("licm", backed by MemorySSA), a pre-RA machine pass ("early-machinelicm"), and a post-RA machine pass ("machinelicm"). The IR-level pass runs in two modes within the same pipeline -- a hoist invocation early in the optimization sequence that pulls invariant computations and loads out of loops into preheaders, and a sink invocation via LoopSinkPass (or implicit re-processing) later that pushes unprofitable hoists back into cold loop blocks. On a CPU, hoisting is almost universally profitable because the preheader executes once per loop entry rather than once per iteration. On a GPU, the calculus is different: every value hoisted into the preheader extends its live range across the entire loop body, consuming a register for all iterations. If that extra register pushes the kernel past an occupancy cliff -- the threshold where the SM can fit one fewer warp -- the net effect is a slowdown, not a speedup. NVIDIA addresses this tension through the interplay of the two invocations, the NVVM alias analysis pipeline that makes cross-address-space loads trivially hoistable, and the downstream rematerialization passes that can undo hoists that turned out to be unprofitable after register allocation.

Key Facts

Property	Value
IR pass name	`"licm"` (new PM), `"LICMPass"` (legacy)
IR pass factory	`sub_195E880(0)` -- creates LICM with `AllowSpeculation=false`
IR pass factory (alt)	`sub_184CD60()` -- creates LICM (also identified as ConstantMerge in some sweeps; identity ambiguous -- see Analysis Notes)
Machine pass (pre-RA)	`"early-machinelicm"` / `EarlyMachineLICMPass`
Machine pass (post-RA)	`"machinelicm"` / `MachineLICMPass`
Knob registration	`ctor_457_0` at `0x544C40` (18,398 bytes -- 11 knobs)
MachineLICM knob registration	`ctor_305` (4 knobs)
Disable flag	`-disable-LICMPass` via `-Xcicc`
PassOptions disable	`-opt "-do-licm=0"` (also forced by `--emit-optix-ir`)
NVVMPassOptions slot	`opts[1240]` (disable), `opts[2880]` (enable, reversed logic)
Upstream LLVM source	`llvm/lib/Transforms/Scalar/LICM.cpp`, `llvm/lib/CodeGen/MachineLICM.cpp`

Pipeline Positions

LICM appears at multiple pipeline positions depending on the optimization tier and compilation mode. The pass uses two distinct factory functions, and the identification of which is definitively LICM versus another pass is uncertain in some cases due to the stripped binary. The following table lists all confirmed appearances.

IR-Level LICM

Position	Call site	Factory	Guard condition	Context
O1 baseline, position 12	`sub_12DE330`	`sub_184CD60()`	none	After LoopRotate, before IndVarSimplify. First hoist invocation.
Main optimizer, mid-pipeline	`sub_12DE8F0`	`sub_195E880(0)`	`!opts[1240]`	Guarded by the LICM disable flag. Runs after DCE and before NVVMLowerBarriers.
Main optimizer, late	`sub_12DE8F0`	`sub_195E880(0)`	`opts[2880] && !opts[1240]`	Second invocation, guarded by both enable and disable flags. Runs after ADCE, before LoopUnroll.
Extended pipeline	`sub_12E54A0`	`sub_195E880(0)`	`opts[2880] && !opts[1240]`	After NVVMLowerBarriers, before LoopUnroll.
Late pipeline	`sub_12E54A0`	`sub_195E880(0)`	`!opts[1240]`	After LoopIdiomRecognize and LoopSimplify, before SimplifyCFG. Late cleanup invocation.
Aggressive (O3, "mid" path)	`sub_12E54A0`	`sub_184CD60()`	none	Position 1 and position 18 of the aggressive pipeline. Second invocation follows GVN.

Machine-Level LICM

Position	Pass	Guard	Context
Pre-RA	`early-machinelicm`	`enable-mlicm`	After EarlyTailDuplicate, before MachineCSE. Controlled by the NVPTX target.
Post-RA	`machinelicm`	`!disable-postra-machine-licm`	After ExpandPostRAPseudos, before post-RA MachineSink.

Algorithm

IR-Level: Hoist Mode

LICM's hoist mode is the upstream LLVM 20.0.0 algorithm with no visible NVIDIA patches to the core logic. The NVIDIA delta is entirely in the analysis results that LICM consumes (NVVM AA, MemorySSA precision, convergent-call handling) and in the pipeline orchestration (multiple invocations, register-pressure-aware sink mode).

The algorithm processes each loop from innermost to outermost:

for each loop L in post-order (innermost first):
    preheader = L.getLoopPreheader()
    if preheader is null: skip

    // 1. Collect candidates
    for each basic block BB in L:
        for each instruction I in BB:
            if isLoopInvariant(I, L) and isSafeToHoist(I, L):
                candidates.push(I)

    // 2. Hoist each candidate
    for I in candidates:
        if I is a load:
            // Query MemorySSA walker for clobbering stores
            clobber = MSSA.getClobberingMemoryAccess(I)
            if clobber is outside L:
                hoist(I, preheader)
        else if I is a pure computation (no side effects):
            hoist(I, preheader)
        else if I is a store and hoist-const-stores is enabled:
            if store address is loop-invariant and
               no other store in L aliases this address:
                hoist(I, preheader)

The isLoopInvariant check verifies that all operands of the instruction are either defined outside the loop or are themselves loop-invariant. The isSafeToHoist check queries MemorySSA to determine whether the instruction's memory behavior is loop-invariant -- for loads, this means no store inside the loop may alias the load's address.

MemorySSA walker interaction. When LICM calls getClobberingMemoryAccess(load_in_loop), the MemorySSA walker walks upward from the load's MemoryUse through the MemorySSA graph. If the walk reaches the loop's entry MemoryPhi without encountering a MemoryDef that may-alias the load, the load is hoistable. The walk is bounded by licm-mssa-optimization-cap to prevent compile-time explosion on functions with dense memory SSA graphs.

The licm-mssa-max-acc-promotion knob limits how many MemoryAccesses LICM will attempt to promote (scalar-replace loads from loop-invariant addresses with SSA values held in registers across iterations). This is the LICM variant of store-to-load forwarding within a loop.

IR-Level: Sink Mode

The LoopSink pass ("loop-sink", registered at pipeline parser entry 271) is the inverse of hoist mode. It runs late in the pipeline and pushes instructions that were hoisted to the preheader back into the loop body, specifically into cold blocks that execute infrequently relative to the loop header.

The decision to sink is driven by block frequency analysis:

for each instruction I in preheader:
    if I has uses only in cold blocks of the loop:
        coldest_block = argmin(blockFreq(B) for B where I is used in B)
        if blockFreq(preheader) / blockFreq(coldest_block) > threshold:
            sink(I, coldest_block)

On GPUs, the sink mode is particularly important because:

Occupancy recovery. A hoist that added one live register at the preheader may have pushed the kernel from 8 to 7 warps per SM. Sinking that value back undoes the damage.
Divergent control flow. If the hoisted value is only used in a branch taken by some threads (divergent execution), hoisting forces all threads to compute it. Sinking limits the computation to the threads that actually take the branch.

Machine-Level: MachineLICM

MachineLICM operates on MachineInstr after instruction selection. The pre-RA variant (early-machinelicm) is gated by the enable-mlicm knob, which is controlled by the NVPTX target. The post-RA variant (machinelicm) runs unconditionally unless disable-postra-machine-licm is set.

The machine-level algorithm differs from the IR level in that it has concrete register pressure information:

for each machine loop ML (innermost first):
    preheader = ML.getLoopPreheader()
    for each MachineInstr MI in ML:
        if isLoopInvariant(MI) and isSafeToHoist(MI):
            // Compute pressure impact
            pressure_delta = estimatePressureIncrease(MI, preheader)
            if sink-insts-to-avoid-spills and
               pressure_delta would cause spills:
                skip MI  // Do not hoist
            else:
                hoist(MI, preheader)

The sink-insts-to-avoid-spills knob (registered at ctor_305) is the critical GPU-specific control: it tells MachineLICM to abandon a hoist when the resulting register pressure in the preheader would exceed the spill threshold. This directly prevents the occupancy-cliff problem at the machine level.

GPU-Specific Considerations

Register Pressure and Occupancy Cliffs

Each SM's register file is shared among all resident warps, creating discrete occupancy cliffs where a single additional register per thread can drop maximum occupancy by an entire warp group.

Hoisting one additional value into the preheader extends its live range across the entire loop body, increasing peak register pressure by one. If that increase crosses an occupancy cliff boundary, the kernel loses an entire warp's worth of parallelism per SM. This is why cicc invokes LICM early (to expose optimization opportunities for GVN, DSE, and InstCombine) and then relies on the downstream rematerialization infrastructure to undo hoists that became unprofitable after the register allocator made its decisions.

NVVM AA and Cross-Address-Space Independence

The single most impactful NVIDIA-specific behavior in LICM is not a patch to LICM itself but the NVVM alias analysis (nvptx-aa) that feeds into MemorySSA. When LICM queries whether a load from addrspace(1) (global memory) is clobbered by a store to addrspace(3) (shared memory), NVVM AA returns NoAlias immediately. This means:

A load from global memory inside a loop is trivially hoistable past any number of shared memory stores.
A shared memory load is hoistable past global stores.
Only stores to the same address space (or to addrspace(0) / generic) prevent hoisting.

This dramatically increases the set of hoistable instructions compared to a flat-memory architecture. Without NVVM AA, a conservative alias analysis would assume any store could clobber any load, making most loads inside GPU kernels non-hoistable.

Barrier-Aware Motion Constraints

CUDA __syncthreads() barriers are lowered to llvm.nvvm.barrier0 intrinsic calls, which are marked convergent and have memory side effects on shared memory. The convergent attribute prevents LICM from hoisting any instruction that depends (directly or transitively through the call graph) on a convergent call. The memory side effect on the barrier prevents hoisting loads across it even when the load does not depend on the barrier's value, because the barrier's MemoryDef in MemorySSA clobbers all shared-memory accesses.

This means LICM correctly refuses to hoist a shared memory load from below a __syncthreads() to above it -- doing so would read a value that the barrier was supposed to synchronize.

The NVVMLowerBarriers pass (sub_1C98160) runs between LICM invocations in the pipeline. Its position matters: barriers are still at the intrinsic level during the first LICM invocation, providing the convergent/memory-effect constraint. After lowering, the barrier semantics are encoded differently, which could affect what a later LICM invocation can move.

Interaction with Downstream Passes

LICM's hoist decisions feed into several downstream passes that can undo or refine them:

Rematerialization (nvvmrematerialize, nv-remat-block): If hoisting increased register pressure past the target, the rematerialization pass will clone the hoisted instruction back to each use site, effectively undoing the hoist while keeping the optimization benefits at the IR level. See Rematerialization.
Sinking2 (sub_1CC60B0): NVIDIA's custom sinking pass runs after LICM and can push instructions back toward their uses. The rp-aware-sink and max-uses-for-sinking knobs control whether the sink considers register pressure impact. See Sinking2.
Base Address Strength Reduction: Hoisted address computations are candidates for strength reduction. The sub_1C51340 function checks whether a base address is loop-invariant, which is trivially true after LICM has hoisted it.

Configuration

IR-Level LICM Knobs (`ctor_457_0` at `0x544C40`)

These are standard LLVM knobs present in the cicc binary. No NVIDIA-specific knobs were found in the IR-level LICM registration.

Knob	Type	Default	Effect
`disable-licm-promotion`	bool	false	Disable scalar promotion of memory locations (store-to-load forwarding within loops). When set, LICM will not replace repeated loads from a loop-invariant address with a register-held value.
`licm-control-flow-hoisting`	bool	false	Enable hoisting of instructions with control-flow-dependent execution. When disabled, only instructions that dominate the loop latch can be hoisted.
`licm-force-thread-model-single`	bool	false	Override the thread model to single-threaded, allowing LICM to hoist atomic operations. Not useful on GPU.
`licm-max-num-uses-traversed`	int	8	Maximum number of uses to traverse when checking whether all uses of a hoisted value are inside the loop. Limits compile time on values with many uses.
`licm-max-num-fp-reassociations`	int	(default)	Maximum FP reassociation chains LICM will attempt to hoist as a group.
`licm-hoist-bo-association-user-limit`	int	(default)	User count limit for binary operator association hoisting.
`licm-skip-unrolled-loops`	bool	false	Skip LICM on loops that have been unrolled (identified by metadata). Avoids re-hoisting values that were deliberately placed by the unroller.
`licm-insn-limit`	int	(default)	Maximum number of instructions LICM will process per loop. Compile-time safety valve.
`licm-max-num-int-reassociations`	int	(default)	Maximum integer reassociation chains for group hoisting.
`licm-mssa-optimization-cap`	int	(default)	Maximum number of MemorySSA accesses the walker will visit per query. Prevents pathological compile times on functions with dense memory access patterns.
`licm-mssa-max-acc-promotion`	int	(default)	Maximum number of MemoryAccesses LICM will attempt to promote (scalar-replace) per loop.

IR-Level LICM Pipeline Parameters

The pass text-pipeline parser accepts two parameters for the "licm" pass:

Parameter	Effect
`allowspeculation`	Allow speculative execution of hoisted instructions (loads that might trap).
`conservative-calls`	Use conservative call analysis -- treat all calls as potentially clobbering.

The factory function sub_195E880(0) creates LICM with AllowSpeculation=false, which is the safe default for GPU code where speculative loads from unmapped memory would fault the entire kernel.

Machine-Level MachineLICM Knobs (`ctor_305`)

Knob	Type	Default	Effect
`avoid-speculation`	bool	(default)	Avoid hoisting instructions that could speculatively execute and trap.
`hoist-cheap-insts`	bool	(default)	Hoist instructions with very low cost even when register pressure is high.
`sink-insts-to-avoid-spills`	bool	(default)	Critical GPU knob. When enabled, MachineLICM will sink (not hoist) instructions when hoisting would increase register pressure past the spill threshold. This directly trades code motion for spill avoidance.
`hoist-const-stores`	bool	(default)	Hoist stores of constant values out of loops. Enabled at the NVIDIA sinking/code-motion category level.

NVPTX Target Gating Knobs

Knob	Type	Default	Effect
`enable-mlicm`	bool	opt-level dependent	Master enable for pre-RA `EarlyMachineLICM` on NVPTX.
`disable-machine-licm`	bool	false	Disable pre-RA MachineLICM (stock LLVM knob).
`disable-postra-machine-licm`	bool	false	Disable post-RA MachineLICM (stock LLVM knob).

Global Pipeline Controls

Control	Mechanism	Effect
`do-licm=0`	PassOptions (`-opt` flag)	Disables IR-level LICM entirely. Automatically set by `--emit-optix-ir`.
`disable-LICMPass`	`-Xcicc` flag	Disables IR-level LICM via the pass-disable mechanism.
`opts[1240]`	NVVMPassOptions bit	Per-invocation disable flag for IR LICM.
`opts[2880]`	NVVMPassOptions bit	Per-invocation enable flag for IR LICM (reversed logic).

Diagnostic Strings

The IR-level LICM pass emits optimization remarks via the standard LLVM remark infrastructure. The following remark identifiers are present in upstream LLVM 20 and apply unchanged in cicc:

Remark	Condition
`"hoisted"`	Instruction was successfully hoisted to preheader.
`"sunk"`	Instruction was sunk from preheader into a loop block.
`"promoted"`	Memory location was scalar-promoted (repeated load replaced with register).
`"licm"`	General LICM diagnostic (pass name in remark metadata).

MachineLICM emits its own set:

String	Condition
`"Hoisting to BB#%d"`	Machine instruction hoisted to the specified preheader block.
`"Won't hoist cheap instruction"`	Instruction deemed too cheap to justify the pressure increase.
`"Can't hoist due to spill pressure"`	`sink-insts-to-avoid-spills` vetoed the hoist.

Analysis Notes

Identity Ambiguity: `sub_184CD60` and `sub_195E880`

The pipeline analysis identified two factory functions as LICM candidates:

sub_195E880(0): Called with explicit LICM disable guards (!opts[1240], opts[2880]). Present in the main optimizer and extended pipeline. This is the higher-confidence identification as the IR-level LICM factory.
sub_184CD60(): Called in the O1 baseline pipeline at position 12 (after LoopRotate), and in the aggressive pipeline. Some sweeps identify this as ConstantMerge or GlobalDCE. The O1 pipeline context (LoopRotate -> sub_184CD60 -> IndVarSimplify) strongly suggests this is LICM, as this is the canonical upstream LLVM loop optimization sequence. However, the aggressive pipeline uses it in a position where ConstantMerge would also make sense. Without the stripped symbol, the definitive identification relies on structural context.

Both functions likely create the same underlying LICMPass -- the difference may be in the parameters (e.g., AllowSpeculation, ConservativeCalls) or the analysis dependencies they request.

No Visible NVIDIA Patches to IR-Level LICM

Unlike DSE, GVN, and InstCombine, the IR-level LICM code does not appear to contain NVIDIA-specific modifications. The 11 knobs registered at ctor_457_0 are all standard upstream LLVM options. The NVIDIA delta for LICM is architectural:

Analysis precision: NVVM AA and enhanced MemorySSA provide better aliasing information, making LICM more aggressive without code changes.
Pipeline orchestration: Multiple invocations at different pipeline stages with different guard conditions.
Machine-level integration: sink-insts-to-avoid-spills and enable-mlicm provide GPU-specific pressure management.
Downstream safety net: Rematerialization undoes unprofitable hoists after register allocation.

LICM Disabled for OptiX IR

The --emit-optix-ir mode (triggered by OptiX runtime compilation with device type 0xDEED or 0xABBA) automatically sets do-licm=0, disabling LICM entirely. This suggests that OptiX IR is intended to be consumed by a downstream optimizer (the OptiX JIT compiler) that performs its own code motion decisions, and pre-hoisting at the cicc level would interfere with those decisions.

Function Map

Function	Address	Size	Role
`LICMPass::create`	`sub_195E880`	--	IR-level LICM factory (AllowSpeculation=false)
`LICMPass::create` (alt)	`sub_184CD60`	--	IR-level LICM factory (identity ambiguous, may be ConstantMerge)
LICM knob registration	`ctor_457_0` (`0x544C40`)	--	11 `cl::opt` registrations for IR LICM
MachineLICM knob registration	`ctor_305`	--	4 `cl::opt` registrations for MachineLICM
`EarlyMachineLICMPass`	(in codegen pipeline)	--	Pre-RA machine-level LICM
`MachineLICMPass`	(in codegen pipeline)	--	Post-RA machine-level LICM
`LoopSinkPass`	pipeline parser entry 271	--	Inverse of LICM hoist -- sinks unprofitable hoists
`NVVMLowerBarriers`	`sub_1C98160`	--	Runs between LICM invocations; lowers barrier intrinsics
NVVM AA query	`sub_146F1B0`	--	Address-space-based NoAlias determination used by MemorySSA
MemorySSA clobber walk	`sub_1A6AFB3`	--	Walker that LICM uses to determine load hoistability
Loop-invariant check	`sub_1C51340`	--	Utility for checking if a value is loop-invariant

Differences from Upstream LLVM

Aspect	Upstream LLVM 20	cicc v13.0
Pipeline invocations	Typically one LICM invocation in the function pipeline, plus LoopSink.	4-6 invocations at different pipeline stages with conditional guards.
Alias analysis precision	BasicAA + TBAA. Cross-address-space aliasing not exploited (all code shares one address space).	NVVM AA returns `NoAlias` for cross-address-space pairs, dramatically increasing hoistable instruction count.
MemorySSA sparsity	Dense graphs on flat-memory architectures.	Sparse graphs due to NVVM AA, reducing walker overhead and improving LICM precision.
Register pressure feedback	MachineLICM has `sink-insts-to-avoid-spills` but no GPU occupancy model.	`sink-insts-to-avoid-spills` interacts with NVPTX's occupancy-based register targets. `enable-mlicm` provides target-level gating.
Speculative hoisting	Allowed by default on most targets.	Disabled (`AllowSpeculation=false`) because GPU kernels fault on speculative loads from unmapped memory.
OptiX mode	N/A.	LICM entirely disabled for OptiX IR emission.
Downstream undo	No systematic mechanism to undo unprofitable hoists.	Rematerialization (`nvvmrematerialize`, `nv-remat-block`) systematically undoes hoists that increase pressure past the occupancy target.

Cross-References

MemorySSA Builder for GPU -- how MemorySSA exploits NVVM AA for sparse dependency graphs
Alias Analysis & NVVM AA -- the cross-address-space NoAlias analysis that enables aggressive hoisting
Rematerialization -- the safety net that undoes unprofitable hoists
Sinking2 -- NVIDIA's custom sinking pass that complements LICM sink mode
LLVM Optimizer -- pipeline assembly and two-phase compilation
Optimization Levels -- per-tier pipeline configuration
Machine-Level Passes -- MachineLICM pre-RA and post-RA placement
Loop Passes (Standard) -- LoopRotate, LCSSA, LoopSimplify that canonicalize before LICM
Loop Unrolling -- runs after LICM in the pipeline; the LoopUnroll pass factory at sub_19B73C0 was previously mislabeled as LICM

LICM (Loop-Invariant Code Motion) -- Redirect

This page previously contained LoopUnroll content due to a sweep misidentification. The LoopUnroll pass factory at sub_19B73C0 was incorrectly labeled as LICM because the two passes are adjacent in the binary. All LoopUnroll content has been merged into the Loop Unrolling page.

For the actual LICM documentation, see: LICM (Loop-Invariant Code Motion)

The LICM page covers:

IR-level LICM ("licm", backed by MemorySSA) -- hoist and sink modes
Machine-level LICM ("early-machinelicm", "machinelicm") -- pre-RA and post-RA
GPU-specific considerations: register pressure, occupancy cliffs, NVVM AA cross-address-space independence
All pipeline positions, knobs, and diagnostic strings
Interaction with downstream passes (rematerialization, Sinking2)

DSE (Dead Store Elimination)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp (LLVM 20.0.0)

CICC v13.0 contains a heavily modified Dead Store Elimination pass totaling approximately 91 KB of decompiled code across three major functions: the core DSE::runOnFunction at sub_19DA750 (33 KB), the overwrite detection engine at sub_19DDCB0 (28 KB), and the partial overwrite tracking system at sub_19DF5F0 (30 KB). This substantially exceeds the size of upstream LLVM DSE, primarily due to NVIDIA's additions for partial store forwarding with type conversion, cross-store dependency tracking, store-chain decomposition for aggregates, and native CUDA vector type awareness.

IR Before/After Example

DSE removes stores that are overwritten before any load reads them. The NVIDIA extension handles partial overwrites common in CUDA vector code.

Before (dead store followed by overwrite):

define void @f(ptr addrspace(1) %p, float %x, float %y) {
  store float %x, ptr addrspace(1) %p, align 4          ; dead: overwritten below before any load
  %other = fadd float %x, %y
  store float %other, ptr addrspace(1) %p, align 4      ; overwrites the first store completely
  ret void
}

After:

define void @f(ptr addrspace(1) %p, float %x, float %y) {
  ; first store removed -- overwritten by second store, no intervening load
  %other = fadd float %x, %y
  store float %other, ptr addrspace(1) %p, align 4
  ret void
}

NVIDIA's DSE also handles partial overwrite patterns with CUDA vector types. When a float4 store partially overwrites a previous float4 store, the pass decomposes via GEP to determine which elements are dead. This is a key GPU extension that upstream LLVM DSE does not handle.

Analysis Dependencies

DSE requires five analysis passes, resolved through the pass manager at registration time (sub_19DD1D0):

Analysis	Global Address	Pass ID
MemorySSA	`unk_4F9E06C`	Memory SSA graph
DominatorTree	`unk_4F9A488`	Dominator tree
MemoryDependence	`unk_4F9B6E8`	Memory dependence queries
PostDominatorTree	`unk_4F9D764`	Post-dominator tree
AliasAnalysis	`unk_4F9D3C0`	NVVM-aware alias analysis

Core Algorithm

The main entry point DSE::runOnFunction (sub_19DA750) processes a function by iterating over store instructions and checking whether each store is dead (fully or partially overwritten by a later store to the same location before any intervening load).

Early Exit and Setup

The pass begins with an early exit check via sub_1636880() to determine whether the function should be skipped entirely. It then retrieves MemoryDependence and AliasAnalysis from the pass manager and calls sub_14A4050 / sub_14A2F00 to verify the function contains stores worth analyzing. If no stores are present, the pass returns immediately.

Store Instruction Identification

Store instructions are identified by checking byte +16 of the instruction structure for value 77. The operand count is read from offset +20 (masked with 0xFFFFFFF), and the "has-operand-list-pointer" flag at byte +23, bit 0x40, indicates indirect operand storage for instructions with many operands.

Type Size Computation

DSE computes store sizes through a type-walker switch on byte +8 of the type structure. This logic is shared between the core pass and the overwrite detector:

Type Code	Size	Notes
1	16 bits	Half-precision float
2	32 bits	Float / int32
3, 9	64 bits	Double / int64
4	80 bits	x86 long double / PTX f80
5, 6	128 bits	Quad precision / int128
7	pointer-sized	Resolved via `sub_15A9520`
0xB	immediate	Size from upper bits of type word
0xD	struct	Layout computed by `sub_15A9930`
0xE	vector	`element_size * num_elements` with alignment
0xF	integer	Arbitrary-width integer
0x10	array	Recurses into element type, multiplies by count
0, 8, A, C	array-like	Follows pointer chain

The vector type formula (case 0xE) accounts for element alignment: 8 * num_elements * element_alignment * ceil(element_alignment + ceil(element_bits/8) - 1) / element_alignment). This handles CUDA native vector types (float2, float4, int4).

Overwrite Detection

The overwrite analysis engine at sub_19DDCB0 (28 KB) determines whether one store completely or partially covers another. It receives the instruction, an operand index, alias analysis results, and address-space information.

Alias Queries

The function calls sub_14C2730 to perform alias queries with full parameters: (target_ptr, data_layout, 0, instruction, store_address, alias_analysis). This returns whether two memory locations may alias. The alias analysis already incorporates CUDA address-space separation (shared=3, global=1, local=5, constant=4), so DSE itself does not need explicit address-space checks.

Partial Store Forwarding

When store sizes do not match, NVIDIA's DSE creates truncation or extension casts to extract the relevant portion. This is a critical GPU-specific extension:

If the source is smaller than the destination: creates an extension (opcode 36 = zext).
If the source is larger than the destination: creates a truncation (opcode 38 = trunc).
Alignment requirements are verified through sub_16431D0.
Complex types use sub_15FDBD0 for cast creation; simple types use sub_15A46C0.

Standard LLVM DSE bails on size mismatches. NVIDIA's version handles the common CUDA pattern of a float4 store followed by a scalar float load by extracting the relevant component via GEP + load.

Store Size Ratio Check

At labels LABEL_25 / LABEL_29 in the core function, DSE performs a ratio check:

Computes v159 = aligned size of destination type.
Computes v48 = aligned size of source type.
Calculates v148 = v48 / v159 (how many destination elements fit in source).
If v48 % v159 != 0, bails (partial overlap that cannot be forwarded).
If sizes differ, creates a GEP + load to extract the relevant portion.

Metadata Preservation

After creating a replacement instruction, the pass preserves metadata:

Debug location via sub_157E9D0.
Use-chain linkage by updating prev/next pointers at offsets +24/+32.
Basic block insertion via sub_164B780.
TBAA metadata propagation through sub_1623A60 / sub_1623210.
nonnull attribute copying via sub_15FA300 / sub_15FA2E0.
Use replacement via sub_164B7C0.

Partial Overwrite Tracking

The function-level partial overwrite pass at sub_19DF5F0 (30 KB) maintains a hash table of all stores in a function and tracks which stores partially overwrite each other.

Hash Table Structure

Each hash table entry is 72 bytes:

Offset	Content
+0	Key (store instruction pointer; -8 = empty, -16 = tombstone)
+8	Operand list pointer
+16	Operand count
+24	Inline storage (when count <= small threshold)
+48	Additional metadata

The hash function, probing strategy, and growth/compaction thresholds follow the standard DenseMap infrastructure; see Hash Table and Collection Infrastructure. This instance uses NVVM-layer sentinels (-8 / -16) and a minimum table size of 64 entries.

Cross-Store Dependency Records

When a new store aliases an existing entry, DSE records both stores in a 6-element record: {store1, store2, operand1, operand2, ptr1, ptr2}. This enables tracking stores that partially overwrite each other even when the overwritten value has been modified between stores. Reference counting is managed through sub_1649AC0 / sub_1649B30, and per-entry operand lists grow via sub_170B450.

Store-Chain Decomposition

In the LABEL_47 region of the core function, DSE walks store chains through struct/array GEPs and decomposes aggregate stores into element-level dead store checks. sub_19D94E0 handles chain-level elimination, while sub_19D91E0 builds the comparison set for overlap detection.

Address-Space Handling

DSE does not contain explicit CUDA address-space comparisons. Address-space separation is handled entirely by the underlying NVVM alias analysis (unk_4F9D3C0), which knows that different address spaces cannot alias. The alias query function sub_14C2730 receives the full instruction context including address space, so query results already incorporate this constraint.

Store Forwarding to Loads

The function sub_19DBD20 (20 KB) attempts store-to-load forwarding. When sub_19DD7C0 finds a store feeding into a load, it constructs a replacement using sub_12815B0. Sign/zero extension matching uses type byte 15 (float types) and type byte 11 (integer types), with opcodes 45 (float-to-int truncation), 46 (int-to-float), and 47 (generic cast).

Two related passes are registered alongside DSE in the same code region:

MergedLoadStoreMotion (sub_19DCD20, pass name mldst-motion): Shares the same alias infrastructure and is registered with the same analysis dependencies.
NaryReassociate (sub_19DD420 / sub_19DD530): N-ary reassociation pass factory, registered at sub_19DD1D0 with its own analysis set.

Key Function Map

Function	Address	Size	Role
`DSE::runOnFunction`	`0x19DA750`	33 KB	Main dead store elimination
`DSE::analyzeOverwrite`	`0x19DDCB0`	28 KB	Complete/partial overwrite detection
`DSE::runPartialOverwritePass`	`0x19DF5F0`	30 KB	Function-level partial tracking
`DSE::tryForwardStoresToLoad`	`0x19DBD20`	20 KB	Store-to-load forwarding
`DSE::buildOverwriteRecord`	`0x19D8AF0`	--	Overlap record construction
`DSE::buildComparisonSet`	`0x19D91E0`	--	Set of stores to compare
`DSE::eliminateStoreChain`	`0x19D94E0`	--	Chain-level elimination
`DSE::scanLoopForDeadStores`	`0x19DCB70`	--	Loop-level DSE
`DSE::runOnBasicBlock`	`0x19DCC90`	--	Block-level entry point
`DSE::extractStoreOperands`	`0x19DD690`	--	Get base pointer and stored value
`DSE::lookupDeadStoreCandidate`	`0x19DD7C0`	--	Hash table lookup
`DSE::decomposeGEPStore`	`0x19DD950`	--	GEP-based store decomposition
`DSE::collectPartialOperands`	`0x19DEFC0`	--	Partial overwrite operand collection
`DSE::checkPartialOverwrite`	`0x19DEE70`	--	Individual partial overwrite check
`DSE::tryEliminateStore`	`0x19DF200`	--	Attempt store elimination
`DSE::rehashStoreTable`	`0x19DF220`	--	Hash table resize

Differences from Upstream LLVM

Partial store forwarding with type conversion. Standard LLVM DSE bails when store and load sizes differ. NVIDIA's version creates GEP + load sequences to extract relevant portions, handling float4 -> float patterns.
72-byte hash table entries with cross-store tracking. Upstream uses simpler data structures. NVIDIA tracks which stores partially overwrite each other through 6-element dependency records.
Store-chain decomposition. Aggregate stores are decomposed through struct/array GEPs into element-level checks, enabling elimination of stores that are collectively dead.
Vector type awareness. The type walker includes a dedicated case for CUDA vector types with proper alignment computation.
Total code size. At ~91 KB across three functions, NVIDIA's DSE is roughly 3x the size of upstream LLVM's equivalent.

Constant Folding: Math & Intrinsics

NVIDIA-modified pass. GPU-specific changes (110+ math name variants, 60+ NVVM intrinsic IDs, exception-safe host evaluation) are documented throughout this page.

Upstream source: llvm/lib/Analysis/ConstantFolding.cpp (LLVM 20.0.0). The upstream ConstantFoldCall function handles standard llvm.* intrinsics; NVIDIA's extensions (sub_14D90D0 eligibility checker, sub_14D1BC0 evaluator) are layered on top.

LLVM version note: The upstream ConstantFolding.cpp in LLVM 20 handles approximately 30 standard math intrinsics (llvm.sin, llvm.cos, llvm.sqrt, etc.) and a small set of NVPTX-specific intrinsics (ceil, floor, fabs, sqrt in nvvm.* form). CICC extends this to 110+ math name variants (C, glibc __*_finite, C++ mangled _Z*) and 60+ NVVM intrinsic IDs. The upstream disable-fp-call-folding knob (cl::Hidden, default false) is preserved; NVIDIA adds a separate FPFoldDisable CiccOption for independent control.

CICC v13.0 extends LLVM's ConstantFolding analysis with two large custom functions that together enable compile-time evaluation of over 110 distinct math function name variants and 60+ NVVM intrinsic IDs. Upstream LLVM's ConstantFoldCall handles standard llvm.sin, llvm.cos, llvm.sqrt, and a handful of NVPTX-specific intrinsics (ceil, floor, fabs, sqrt in their nvvm.* forms, plus FP-to-integer conversion intrinsics). CICC goes far beyond this: it recognizes every C math library name (sin, sinf), every glibc __*_finite internal variant, every C++ mangled form (_Z3cosf, _Z4acosd), and the full set of NVVM approximate/FTZ math intrinsics -- then evaluates them using the host C math library with an exception-safe wrapper that refuses to produce results when the host FPU signals domain errors, overflow, or underflow.

The system is split into two cooperating functions. The eligibility checker sub_14D90D0 (27 KB, called nvvmIntrinsicConstantFold in the sweep analysis) is a fast predicate that answers "can this call be constant-folded?" without touching operand values. The evaluator sub_14D1BC0 (54 KB, called nvvmConstantFoldLibCall) performs the actual computation when all operands are constant. A third function, the NVVM InstCombine intrinsic folder sub_1169C30 (87 KB), handles algebraic simplification of NVVM intrinsics and is documented separately on the InstCombine page.


Eligibility checker	`sub_14D90D0` (`0x14D90D0`, 27 KB, 282 basic blocks, 489 edges)
Math evaluator	`sub_14D1BC0` (`0x14D1BC0`, 54 KB)
Constant extractor	`sub_14D1620` (`0x14D1620`)
Safe unary eval wrapper	`sub_14D19F0` (`0x14D19F0`)
Safe binary eval wrapper	`sub_14D1A80` (`0x14D1A80`)
ConstantFP builder	`sub_14D17B0` (`0x14D17B0`)
Custom fabs	`sub_14D1280` (`0x14D1280`) -- SSE2 sign-bit mask
Custom floor	`sub_14D13B0` (`0x14D13B0`) -- truncation + sign correction
Custom ceil	`sub_14D1410` (`0x14D1410`) -- truncation + sign correction
Custom sqrt	`sub_14D1470` (`0x14D1470`) -- thin wrapper around libc `sqrt`
Vector math mapping	`sub_149E420` (`0x149E420`, 26 KB)
LLVM knob	`disable-fp-call-folding` (upstream, `cl::Hidden`, default `false`)
NVIDIA knob	`FPFoldDisable` (NVIDIA CiccOption, disables FP constant folding)

Two-Tier Architecture: Eligibility vs. Evaluation

The constant folding system operates as a two-phase protocol. The caller (from the ConstantFolding pass or InstCombine visitCallInst path) first invokes the eligibility checker to determine whether a call instruction is a candidate, then invokes the evaluator to produce the folded constant. This split exists for performance: the eligibility check is cheap (no operand extraction, no FP computation), while the evaluator is expensive (extracts APFloat values, calls host math library, checks FP exceptions).

Eligibility Checker: `sub_14D90D0`

The function takes a tagged IR node pointer and a context (intrinsic descriptor). The node pointer carries a 3-bit tag in its low bits; the function masks with ~7 to recover the aligned base. Before examining intrinsic IDs, it performs three attribute pre-filter checks on the callee:

Speculatable/ReadNone (attribute kind 0x15 = 21): The callee must be safe to speculatively execute. If the direct callee lacks this attribute, the function follows one level of indirection through the resolved function target at [callee + 0x70] and re-checks.
NoUnwind (attribute kind 5): The callee must not throw. Same indirection chain.
Convergent gate (attribute kind 0x34 = 52): If the callee is marked convergent, the function returns 0 immediately. This is the critical safety check for GPU code -- convergent intrinsics like __syncthreads(), __ballot_sync(), and warp shuffle operations have warp-synchronous semantics that would be violated by folding them away, even when all arguments happen to be constant.

After attribute filtering, the function reads the intrinsic ID from [context + 0x24] (offset +36, unsigned 32-bit enum) and dispatches through a two-level scheme.

Evaluation: `sub_14D1BC0`

The evaluator receives the function name string, its length, an opcode/intrinsic-ID enum, a return type descriptor, an array of constant operand IR nodes, the operand count (1, 2, or 3), a flag enabling name-based matching, and a context pointer. It returns a ConstantFP or ConstantInt IR node on success, or null on failure.

The top-level dispatch is on operand count:

Unary (count = 1): Trigonometric, exponential, logarithmic, rounding, and absolute value functions.
Binary (count = 2): pow, fmod, atan2, copysign, fmin, fmax.
Ternary (count = 3): FMA / fused multiply-add (opcodes 99 and 100 only).

Foldable Intrinsics Master Table

Standard LLVM Intrinsic IDs (0--211)

These are dispatched via a jump table at jpt_14D91F0 in the eligibility checker. The evaluator handles them via cascading opcode comparisons.

ID	Hex	Intrinsic	Category
5	`0x05`	`llvm.bswap`	Bitwise
6	`0x06`	`llvm.ceil`	Rounding
8	`0x08`	`llvm.copysign`	Sign
11	`0x0B`	`llvm.cos`	Trig
12	`0x0C`	`llvm.ctlz`	Bitwise
13	`0x0D`	`llvm.ctpop`	Bitwise
30	`0x1E`	`llvm.exp`	Exponential
31	`0x1F`	`llvm.exp2`	Exponential
32	`0x20`	`llvm.fabs`	Absolute
33	`0x21`	`llvm.floor`	Rounding
54	`0x36`	`llvm.fma`	Ternary
55	`0x37`	`llvm.fmuladd`	Ternary
96	`0x60`	`llvm.log`	Logarithmic
97	`0x61`	`llvm.log10`	Logarithmic
99	`0x63`	`llvm.log2`	Logarithmic
100	`0x64`	`llvm.lround`	Rounding
115	`0x73`	`llvm.maxnum`	MinMax
122	`0x7A`	`llvm.minnum`	MinMax
123	`0x7B`	`llvm.nearbyint`	Rounding
124	`0x7C`	`llvm.pow`	Power
129	`0x81`	`llvm.powi`	Power
132	`0x84`	`llvm.rint`	Rounding
139	`0x8B`	`llvm.round`	Rounding
140	`0x8C`	`llvm.roundeven`	Rounding
146	`0x92`	`llvm.sin`	Trig
147	`0x93`	`llvm.tan`	Trig
187	`0xBB`	`llvm.sqrt`	Root
188	`0xBC`	`llvm.trunc`	Rounding
189--211	`0xBD`--`0xD3`	Integer ops (umax, sadd.with.overflow, etc.)	Integer

NVVM-Specific Intrinsic IDs (>211)

These are dispatched via cascading range checks with bitmask tests in the eligibility checker.

ID Range	Hex	Intrinsic	Category
3637--3639	`0xE35`--`0xE37`	`nvvm.bitcast.` / `nvvm.move.`	Bitwise
3660	`0xE4C`	`nvvm.ptr.gen.to.*`	Pointer
3764--3765	`0xEB4`--`0xEB5`	`nvvm.ceil.f` / `nvvm.ceil.d`	Rounding
3778--3779	`0xEC2`--`0xEC3`	`nvvm.ctlz.i` / `nvvm.ctlz.ll`	Bitwise
3787	`0xECB`	`nvvm.cos.approx.ftz.f`	Trig
3811	`0xEE3`	`nvvm.div.*` / `nvvm.fabs` variant	Arith
3870--3871	`0xF1E`--`0xF1F`	`nvvm.exp2.approx.ftz.f` / `.d`	Exponential
3911--3912	`0xF47`--`0xF48`	`nvvm.fabs.f` / `.d`	Absolute
3924--3925	`0xF54`--`0xF55`	`nvvm.floor.f` / `.d`	Rounding
3944	`0xF68`	`nvvm.log.approx.ftz.f`	Logarithmic
3946	`0xF6A`	`nvvm.log2.approx.ftz.f`	Logarithmic
3948	`0xF6C`	`nvvm.log10.approx.ftz.f`	Logarithmic
3950	`0xF6E`	`nvvm.rcp.approx.ftz.d`	Reciprocal
3952	`0xF70`	`nvvm.rsqrt.approx.ftz.f`	Root
3954	`0xF72`	`nvvm.sqrt.f` / `.approx.ftz.f`	Root
4072--4074	`0xFE8`--`0xFEA`	`nvvm.sin`/`cos.approx.ftz` variants	Trig
4114--4115	`0x1012`--`0x1013`	`nvvm.max.i` / `.ui`	MinMax
4118--4119	`0x1016`--`0x1017`	`nvvm.min.i` / `.ui`	MinMax
4167--4168	`0x1047`--`0x1048`	`nvvm.max.ll` / `.ull`	MinMax
4170--4172	`0x104A`--`0x104C`	`nvvm.min.ll` / `.ull`	MinMax
4230--4231	`0x1086`--`0x1087`	`nvvm.mul.hi.*`	Multiply
4413	`0x113D`	`nvvm.sin.approx.ftz.f`	Trig
4475, 4478	`0x117B`, `0x117E`	`nvvm.sqrt.f` / `.rn.d`	Root
4483--4484	`0x1183`--`0x1184`	`nvvm.sqrt.approx.f` / `.ftz.f`	Root
5293	`0x14AD`	`nvvm.f2i` / `nvvm.d2i`	Conversion
5300	`0x14B4`	`nvvm.i2f` / `nvvm.i2d`	Conversion
7297--7298	`0x1C81`--`0x1C82`	`nvvm.fmax.f` / `.d`	MinMax
7301--7302	`0x1C85`--`0x1C86`	`nvvm.fmin.f` / `.d`	MinMax
7334--7335	`0x1CA6`--`0x1CA7`	`nvvm.fmax.ftz.f` / `.ftz.nan.f`	MinMax
7339--7340	`0x1CAB`--`0x1CAC`	`nvvm.fmin.ftz.f` / `.ftz.nan.f`	MinMax

Name-Based Foldable Functions (Case 0 Fallthrough)

When the intrinsic ID is 0 (unrecognized LLVM intrinsic), both the eligibility checker and the evaluator fall through to string-based matching. The evaluator uses a two-tier name matching system: fast-path intrinsic ID dispatch, then slow-path name comparison when the a7 flag is set.

Plain C library names (44 entries):

Category	Functions
Trigonometric	`sin`, `sinf`, `cos`, `cosf`, `tan`, `tanf`
Inverse trig	`acos`, `acosf`, `asin`, `asinf`, `atan`, `atanf`, `atan2`, `atan2f`
Hyperbolic	`sinh`, `sinhf`, `cosh`, `coshf`, `tanh`, `tanhf`
Exponential	`exp`, `expf`, `exp2`, `exp2f`
Logarithmic	`log`, `logf`, `log10`, `log10f`
Rounding	`ceil`, `ceilf`, `floor`, `floorf`, `round`, `roundf`
Absolute / Root	`fabs`, `fabsf`, `sqrt`, `sqrtf`
Binary	`pow`, `powf`, `fmod`, `fmodf`, `atan2`, `atan2f`

Glibc __*_finite variants (20 entries):

__acos_finite, __acosf_finite, __asin_finite, __asinf_finite, __atan2_finite, __atan2f_finite, __cosh_finite, __coshf_finite, __exp_finite, __expf_finite, __exp2_finite, __exp2f_finite, __log_finite, __logf_finite, __log10_finite, __log10f_finite, __pow_finite, __powf_finite, __sinh_finite, __sinhf_finite

C++ mangled names (~48 entries): _Z3cosf, _Z3cosd, _Z3sinf, _Z3sind, _Z3tanf, _Z3tand, _Z3expf, _Z3expd, _Z3logf, _Z3logd, _Z4acosf, _Z4acosd, _Z4asinf, _Z4asind, _Z4atanf, _Z4atand, _Z4ceilf, _Z4ceild, _Z4coshf, _Z4coshd, _Z4exp2f, _Z4exp2d, _Z4fabsf, _Z4fabsd, _Z4sinhf, _Z4sinhd, _Z4sqrtf, _Z4sqrtd, _Z4tanhf, _Z4tanhd, _Z4fmodff, _Z4fmoddd, _Z5floorf, _Z5floord, _Z5log10f, _Z5log10d, _Z5atan2ff, _Z5atan2dd, _Z5powff, _Z5powdd, _Z5roundf, _Z5roundd

Total across all three name forms: approximately 112 distinct recognized strings.

Name Matching Algorithm

The evaluator's name matching is a hand-tuned trie-like dispatch optimized for the specific set of math function names. It avoids hash tables or sorted arrays in favor of cascading character comparisons:

nameMatch(name, length):
    // Strip C++ mangling prefix
    if name[0] == '_' and name[1] == 'Z':
        dispatch on name[2]:  // length digit
            '3' -> match 3-char base: cos, sin, tan, exp, log
            '4' -> match 4-char base: acos, asin, atan, ceil, cosh, exp2, fabs, sinh, sqrt, tanh, fmod
            '5' -> match 5-char base: floor, log10, atan2, pow, round
        verify trailing type suffix: 'f' = float, 'd' = double
        return FOUND

    // Strip glibc __finite prefix
    if name[0] == '_' and name[1] == '_':
        dispatch on name[2]:
            'a' -> __acos_finite, __acosf_finite, __asin_finite, __asinf_finite,
                   __atan2_finite, __atan2f_finite
            'c' -> __cosh_finite, __coshf_finite
            'e' -> __exp_finite, __expf_finite, __exp2_finite, __exp2f_finite
            'l' -> __log_finite, __logf_finite, __log10_finite, __log10f_finite
            'p' -> __pow_finite, __powf_finite
            's' -> __sinh_finite, __sinhf_finite
        verify with memcmp against string constant
        return FOUND

    // Plain C library name
    dispatch on name[0]:
        'a' -> acos, asin, atan + 'f' variants
        'c' -> cos, cosf, ceil, ceilf, cosh, coshf
        'e' -> exp, expf, exp2, exp2f
        'f' -> fabs, fabsf, floor, floorf
        'l' -> log, logf, log10, log10f
        'p' -> pow, powf
        'r' -> round, roundf
        's' -> sin, sinf, sinh, sinhf, sqrt, sqrtf
        't' -> tan, tanf, tanh, tanhf

    // Within each group, dispatch on name length:
    length 3: direct 3-byte compare ("sin", "cos", "tan", "exp", "log", "pow")
    length 4: DWORD compare (4-byte integer, little-endian):
        0x736F6361 = "acos"    0x6E697361 = "asin"
        0x6E617461 = "atan"    0x6C696563 = "ceil"
        0x68736F63 = "cosh"    0x73626166 = "fabs"
        0x66736F63 = "cosf"    0x686E6973 = "sinh"
        0x74727173 = "sqrt"    0x686E6174 = "tanh"
        0x32707865 = "exp2"    0x66707865 = "expf"
        ...
    length 5+: memcmp against literal string constant
    return FOUND or NOT_FOUND

The 4-byte integer comparison trick deserves attention: instead of calling memcmp for 4-character names, the code loads the name as a uint32_t and compares against a pre-computed little-endian constant. For example, *(uint32_t*)name == 0x736F6361 checks for "acos" ('a'=0x61, 'c'=0x63, 'o'=0x6F, 's'=0x73). This micro-optimization eliminates function call overhead for the most common name lengths.

Exception-Safe Host Evaluation

The core safety mechanism is the FP exception wrapper used for all transcendental evaluation. Both the unary wrapper (sub_14D19F0) and binary wrapper (sub_14D1A80) follow the same protocol:

Value* safeMathEval(double (*mathFunc)(double), Type* resultType, double arg) {
    feclearexcept(FE_ALL_EXCEPT);        // clear all FP exception flags
    *__errno_location() = 0;             // clear errno

    double result = mathFunc(arg);       // call host C library

    // Check errno for domain/range error
    int e = *__errno_location();
    if (e == EDOM || e == ERANGE) {      // errno 33 or 34
        feclearexcept(FE_ALL_EXCEPT);
        *__errno_location() = 0;
        return nullptr;                  // refuse to fold
    }

    // Check FP exception flags (mask = 0x1D = 29)
    // FE_INVALID(1) | FE_DIVBYZERO(4) | FE_OVERFLOW(8) | FE_UNDERFLOW(16)
    if (fetestexcept(FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW)) {
        feclearexcept(FE_ALL_EXCEPT);
        *__errno_location() = 0;
        return nullptr;                  // refuse to fold
    }

    // FE_INEXACT (32) is intentionally NOT checked --
    // most transcendentals produce inexact results and that is acceptable.

    return createConstantFP(resultType, result);
}

This design means the folder refuses to produce a result whenever the host FPU signals any exceptional condition other than inexact. The implications:

sin(1e308) might overflow on the host -- not folded, left in IR for runtime evaluation.
log(-1.0) produces a domain error -- not folded.
sqrt(-0.01) triggers FE_INVALID -- not folded.
sin(0.5) produces an inexact result (since sin(0.5) is irrational) -- folded normally.

Domain Pre-Checks

In addition to the post-evaluation exception check, certain functions have explicit domain guards before calling the host math library:

Function	Precondition	Rationale
`log`, `logf`, `log10`, `log10f`	argument > 0.0	Negative inputs produce NaN
`sqrt`, `sqrtf`	argument >= 0.0	Negative inputs produce NaN
`acos`, `asin`	no pre-check	Relies on FP exception mechanism

The asymmetry is deliberate: log/sqrt get explicit checks because their domain violations are common and cheap to detect, while acos/asin rely on the post-evaluation FE_INVALID check.

Host FPU vs. GPU Precision

The constant folder evaluates using the host CPU's math library (j_sin, j_cos, j_exp, etc. -- PLT stubs to glibc). This creates a potential precision mismatch: the folded constant may not be bit-identical to what the GPU hardware would compute. NVIDIA mitigates this through several mechanisms:

Custom implementations for exact functions. fabs, floor, ceil, and round have custom host-side implementations that match GPU rounding semantics exactly:
- fabs (sub_14D1280): Pure SSE2 bitwise AND with 0x7FFFFFFFFFFFFFFF (clear sign bit). Bit-exact regardless of platform.
- floor (sub_14D13B0): Custom truncation: for |x| < 2^52, truncate to integer, subtract 1.0 if truncation rounded toward zero for negative values, preserve sign bit. For |x| >= 2^52, return unchanged (already integral).
- ceil (sub_14D1410): Mirror of floor: truncate to integer, add 1.0 if truncation rounded toward zero for positive values.
- round (j__round): Uses libc round() directly (round-half-away-from-zero, matching PTX round.rni).
Exception rejection for transcendentals. For sin, cos, exp, log and other transcendentals, CICC accepts the host result because IEEE-754 guarantees these are correctly rounded within 1 ULP on both host and device. The exception wrapper catches cases where host and device behavior might diverge (denormals, overflow boundary).
exp2(x) folded as pow(2.0, x). Rather than calling exp2() directly (which might differ between host and device implementations), the evaluator computes pow(2.0, x) through the binary wrapper, ensuring consistent behavior.
No half-precision transcendental folding. The type check at the evaluator's entry rejects type byte 1 (half) for all trig/exp/log functions. Only basic operations (convert, compare) work on fp16. This is safe because half-precision math functions are implemented as promote-to-float, compute, demote-to-half -- by the time the constant folder runs, the promotion has already been inlined.

FTZ and Approximate Intrinsics

NVVM intrinsics like nvvm.exp2.approx.ftz.f and nvvm.sin.approx.ftz.f carry .approx (reduced precision) and .ftz (flush-to-zero for denormals) modifiers. These are present in the foldable ID list, which may seem surprising -- folding an "approximate" intrinsic with exact host math could produce a different value than the hardware.

The rationale: constant folding evaluates the mathematical function, not the hardware instruction. If the input is a normal float and the result is a normal float, the folded value is correct regardless of FTZ or approximation quality. The FTZ modifier only affects denormal inputs (which the exception wrapper would catch via FE_UNDERFLOW), and the .approx modifier only matters for runtime execution speed. For compile-time constants, exact evaluation is strictly better.

Comparison with Upstream LLVM

Upstream LLVM's ConstantFolding.cpp (as of LLVM 19.x) handles NVPTX intrinsics in canConstantFoldCallTo and ConstantFoldCall. The overlap and gaps:

Capability	Upstream LLVM	CICC v13.0
`llvm.sin`, `llvm.cos`, `llvm.exp`, `llvm.log`, etc.	Yes	Yes
`nvvm.ceil.f`, `nvvm.floor.f`, `nvvm.fabs`, `nvvm.sqrt.*`	Yes	Yes
`nvvm.fmax.`, `nvvm.fmin.` (all variants)	Yes (including `.xorsign_abs`)	Yes (subset: `.f`, `.d`, `.ftz`, `.ftz.nan`)
`nvvm.f2i_`, `nvvm.d2i_` (FP-to-int with rounding modes)	Yes (all 32 variants)	Partial (IDs 5293, 5300 only)
Plain C math names (`sin`, `cosf`, `exp2f`, etc.)	Via TargetLibraryInfo	Direct name matching (44 entries)
Glibc `__*_finite` variants	No	Yes (20 entries)
C++ mangled `_Z3cosf`, `_Z4acosd`, etc.	No	Yes (~48 entries)
`nvvm.cos.approx.ftz.f`, `nvvm.exp2.approx.ftz.f`, etc.	No	Yes
`nvvm.rcp.approx.ftz.d`, `nvvm.rsqrt.approx.ftz.f`	No	Yes
`nvvm.mul.hi.*`	No	Yes
Convergent intrinsic rejection	Implicit (no fold path)	Explicit attribute check
FMA constant fold	Yes (via APFloat)	Yes (opcodes 99/100, APFloat fma)
Integer min/max/ctlz/cttz	Partial	Yes (full NVVM ID coverage)

The critical CICC-only capabilities are the __*_finite variants (needed when code is compiled with -ffinite-math-only), the C++ mangled names (emitted by device-side C++ math overloads), and the .approx.ftz intrinsic family.

Integer Constant Folding

The evaluator also handles integer-domain operations when operands have type tag 13 (ConstantInt) or when FP operands encode integer comparisons:

Binary integer ops (operand count = 2, both ConstantInt):

Opcodes 189, 195, 198, 209, 210, 211: APInt binary operations (add, sub, mul, sdiv, udiv, srem) via sub_16A7290 and related APInt helpers.
Opcodes 0xEC2/0xEC3 (3778/3779): ctlz (count leading zeros).
Opcodes 0x1014/0x1015, 0x1016/0x1017: Signed/unsigned min/max via APInt comparison.
Opcodes 0x104B/0x104C, 0x1087/0x1088: Additional signed/unsigned min/max encodings.
Opcode 3811: Division where divisor is known zero -- returns UndefValue.

Integer comparison fold (type tag 14 with integer-domain opcodes):

Opcode 0xBB (187), 0x8C (140): icmp eq/ne -- predicate 0.
Opcode 0x61 (97): icmp slt -- predicate 2.
Opcode 0xBC (188): icmp sgt -- predicate 4.
Opcode 0xCE (206): icmp uge -- predicate 3.
Opcode 0x08 (8): icmp ult -- predicate 1.

These produce ConstantInt 0 or 1 via sub_169EBA0/sub_169D440.

Libdevice Integration

NVIDIA's libdevice (libdevice.10.bc) provides optimized LLVM bitcode implementations of math functions. After linking libdevice, calls like __nv_sinf are typically inlined and disappear before constant folding runs. However, if inlining fails or is disabled, residual __nv_* calls may survive.

The constant folder does not recognize __nv_* prefixed names directly. The __ name-matching path only handles glibc __*_finite patterns, not NVIDIA's __nv_* convention. Un-inlined libdevice residuals are handled upstream by the NVVM InstCombine intrinsic canonicalizer (sub_1169C30), which recognizes __nv_* prefixes and may convert them to standard LLVM intrinsics that the constant folder can then process.

The __nvvm_reflect mechanism (used for __CUDA_ARCH queries) is resolved by a separate earlier pass (NVVMReflect) that replaces __nvvm_reflect("__CUDA_ARCH") with a constant integer based on the target SM. By the time the constant folder runs, all __nvvm_reflect calls have been eliminated.

Configuration Knobs

Knob	Type	Default	Effect
`disable-fp-call-folding`	`cl::opt<bool>`	`false`	Upstream LLVM hidden flag. When true, prevents constant folding of any function returning or accepting floating-point types. Checked in `canConstantFoldCallTo`.
`FPFoldDisable`	NVIDIA CiccOption	`false`	NVIDIA-specific flag that disables FP constant folding at the NVVM level.
`instcombine-negator-enabled`	`cl::opt<bool>`	`true`	Controls the negation propagation system in `sub_1169C30` (InstCombine intrinsic folder).
`instcombine-negator-max-depth`	`cl::opt<int>`	platform-dependent	Depth limit for the negator chain in InstCombine intrinsic folding. Prevents exponential blowup when pushing negation through deep arithmetic chains.

The FPFoldDisable knob is significant for debugging precision issues: when a kernel produces different results with -O0 vs -O2, disabling FP folding isolates whether constant-folded values are the source of the discrepancy.

ConstantFP Result Creation

The result builder sub_14D17B0 creates the final LLVM ConstantFP IR node from the evaluated double result. It dispatches on the return type byte at *(type + 8):

Type byte	Precision	Behavior
1	`half`	Not reached from math folder (filtered at entry). Infrastructure exists: converts through `APFloat` semantics.
2	`float`	Truncates `double` to `float` via C cast, then converts `float` to `APFloat` via `sub_169D3B0`.
3	`double`	Stores full double precision via `sub_169D3F0` (double to APFloat).

Both paths finish with sub_159CCF0(*type, &storage) which constructs the ConstantFP node from the APFloat storage. The float path's truncation via C cast means the folded float value matches what (float)host_result produces -- this is IEEE-754 correct because the cast performs round-to-nearest-even.

Function Map

Function	Address	Size	Role
`nvvmIntrinsicConstantFold`	`0x14D90D0`	27 KB	Eligibility predicate: can this intrinsic be constant-folded?
`nvvmConstantFoldLibCall`	`0x14D1BC0`	54 KB	Math evaluator: compute constant result from constant args
`extractDoubleFromConstantFP`	`0x14D1620`	--	Extract `double` from `ConstantFP` IR node
`safeMathEvalUnary`	`0x14D19F0`	--	Exception-safe unary evaluation wrapper
`safeMathEvalBinary`	`0x14D1A80`	--	Exception-safe binary evaluation wrapper
`createConstantFPResult`	`0x14D17B0`	--	Build `ConstantFP` from evaluated double
`customFabs`	`0x14D1280`	--	SSE2 sign-bit clear
`customFloor`	`0x14D13B0`	--	Truncation + sign correction
`customCeil`	`0x14D1410`	--	Truncation + sign correction
`customSqrt`	`0x14D1470`	--	Thin wrapper around libc `sqrt`
`fptoui_fptosi_fold`	`0x14D1500`	--	FP-to-integer conversion fold
`apintMoveTransfer`	`0x14D15E0`	--	APInt move/transfer helper
`vectorMathLibMapping`	`0x149E420`	26 KB	Scalar-to-vectorized math mapping table
`platformFuncCanonicalize`	`0x149FA60`	15 KB	Platform-specific name canonicalization
`constantExprFoldSCEV`	`0x14D44C0`	20 KB	ConstantExpr fold / SCEV integration
`constantFoldAggregate`	`0x14D5510`	16 KB	ConstantFold for aggregate types
`constantFoldGEPExtract`	`0x14D66F0`	17 KB	ConstantFold for GEP and extract
`constantExprSCEVBuild`	`0x14DBA90`	22 KB	ConstantExpr + SCEV builder
`AttributeList::hasAttribute`	`0x1560260`	--	Attribute query (used 8 times in eligibility checker)
`Value::getName`	`0x1649960`	--	Name string extraction (case 0 path)
NVVM InstCombine intrinsic fold	`0x1169C30`	87 KB	Algebraic simplification of NVVM intrinsics (see InstCombine)

Cross-References

InstCombine -- The NVVM intrinsic canonicalizer (sub_1169C30) handles algebraic simplification, negation propagation, and operand folding for NVVM intrinsics. It calls constant folding as a sub-step.
Pipeline & Ordering -- Where constant folding sits in the optimization pipeline (runs within InstCombine and as a standalone analysis).
Builtin Table: Math Functions -- The complete list of CUDA math builtins and their mapping to NVVM intrinsics.
CLI Flags -- FPFoldDisable and other optimization control flags.
LLVM Knobs -- The full disable-fp-call-folding flag and related InstCombine depth limits.

KnownBits & DemandedBits for GPU

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

NVIDIA's KnownBits and DemandedBits infrastructure in cicc v13.0 diverges from upstream LLVM in three structural ways. First, the two analyses are fused into a single 127 KB function (sub_11A7600) that simultaneously computes known-zero/known-one bitmasks and simplifies instructions whose demanded bits allow constant folding or narrowing -- upstream LLVM separates computeKnownBits (in ValueTracking) from SimplifyDemandedBits (in InstCombine). Second, a dedicated GPU-specific known-bits oracle (sub_F0C4B0) provides range constraints for NVIDIA special registers (%tid, %ntid, %ctaid, %nctaid, %warpsize, %laneid) that have no CPU equivalent. Third, an early NVVM pipeline pass (nvvm-intr-range at sub_216F4B0) attaches !range metadata to every special-register read intrinsic, giving downstream analyses the same bounded-range information that CPU targets only get from profile data or programmer assertions. Together these form the primary dataflow backbone for address calculation optimization, type narrowing, and dead-bit elimination in GPU kernels.


Merged computeKnownBits + SimplifyDemandedBits	`sub_11A7600` (`0x11A7600`, 127 KB, 4,156 lines)
Secondary SimplifyDemandedBits helper	`sub_11A1430` (`0x11A1430`, 6.3 KB, 6 opcodes)
Per-operand demand propagation trampoline	`sub_11AE940` (`0x11AE940`)
Generic computeKnownBits (reference)	`sub_9AC0E0` (fallback for unhandled opcodes)
Debug-only reference computeKnownBits	`sub_9AC330` (cross-validation oracle)
computeKnownBitsFromOperator	`sub_11A3F30` (`0x11A3F30`, 50 KB)
computeKnownBitsFromAssume	`sub_11A6910` (`0x11A6910`, 12.5 KB)
computeKnownBitsFromRangeMetadata	`sub_11A68C0`
Post-analysis NVIDIA fixup	`sub_99B5E0` (alignment + range refinement)
NVIDIA intrinsic known-bits oracle	`sub_F0C4B0` (special register ranges)
Intrinsic return range analysis	`sub_10CA790` + `sub_11A1390`
NVVMIntrRange pass	`sub_216F4B0` (`nvvm-intr-range`)
SelectionDAG computeKnownBits	`sub_33D4EF0` (`0x33D4EF0`, 114 KB, 3,286 lines)
Pointer alignment known-bits	`sub_BD5420` (getPointerAlignmentBits)
Debug cross-validation flag	`qword_4F90C28` (enables abort-on-mismatch)
Max recursion depth	6 (checked in `sub_11AE940`)

GPU-Specific Known-Bits Sources

The key difference from CPU targets: GPU code has dozens of values with statically knowable ranges that never exist on a CPU. Every CUDA thread reads its identity from special registers whose values are bounded by hardware launch parameters. NVIDIA exploits this in two places: the nvvm-intr-range pass adds !range metadata at the IR level, and the target-specific known-bits oracle sub_F0C4B0 provides bitmask information directly to computeKnownBits.

Special Register Range Table

The following ranges apply to every NVVM intrinsic that reads a PTX special register. The !range metadata attached by nvvm-intr-range (sub_216F4B0) encodes [lo, hi) as an LLVM MDNode. The known-bits column shows which bits are guaranteed zero given the maximum value.

Register	PTX	NVVM Intrinsic ID Range	Value Range	i32 Known Zero (upper bits)
`%tid.x/y/z`	`%tid.x`	350--352	`[0, maxntid-1]`	bits `[ceil(log2(maxntid)), 31]`
`%ntid.x/y/z`	`%ntid.x`	353--355	`[1, 1024]`	bits `[11, 31]` (at most 1024)
`%ctaid.x/y/z`	`%ctaid.x`	356--358	`[0, gridDim-1]`	bits `[ceil(log2(gridDim)), 31]`
`%nctaid.x/y/z`	`%nctaid.x`	359--361	`[1, 2^31-1]`	bit 31 (always non-negative)
`%warpsize`	`%WARP_SZ`	~370	`{32}` (constant)	bits `[0,4]` = `00000`, bit 5 = 1, bits `[6,31]` = 0
`%laneid`	`%laneid`	~371	`[0, 31]`	bits `[5, 31]`
`%warpid`	`%warpid`	~372	`[0, maxWarpsPerSM-1]`	SM-dependent upper bits
`%smid`	`%smid`	~375	`[0, numSMs-1]`	architecture-dependent
`%nsmid`	`%nsmid`	~376	`[1, numSMs]`	architecture-dependent
`%gridid`	`%gridid`	~378	`[0, 2^32-1]`	none (full range)
`%clock`	`%clock`	~380	`[0, 2^32-1]`	none
`%lanemask_eq/lt/le/gt/ge`	`%lanemask_*`	~382--386	`[0, 2^32-1]`	none

When __launch_bounds__(maxThreadsPerBlock, minBlocksPerMP) is present on a kernel, nvvm-intr-range tightens the %tid ranges to [0, maxThreadsPerBlock-1] and %ntid to [1, maxThreadsPerBlock]. Similarly, nvvm.reqntid metadata (from __launch_bounds__ with exact dimensions or reqntid pragmas) can constrain each dimension independently to an exact value.

The knob nvvm-intr-range-sm (constructor ctor_359) selects the SM variant used to determine architectural limits for registers like %warpid, %smid, and %nsmid.

Address Space Known Bits

CUDA uses separate address spaces with distinct pointer bit-widths and alignment properties. These feed directly into sub_BD5420 (getPointerAlignmentBits), which OR's known-zero low bits into the KnownBits result for any pointer-typed value:

Address Space	PTX	Pointer Width	Known Alignment	Known Bits Effect
0 (generic)	default	64 bits	none guaranteed	pointer alignment only
1 (global)	`.global`	64 bits	>= 16 bytes (typical)	low 4 bits often known-zero
3 (shared)	`.shared`	32 bits	>= 4 bytes (minimum)	low 2 bits known-zero, bits `[32,63]` irrelevant
4 (constant)	`.const`	64 bits	>= 4 bytes	low 2 bits known-zero
5 (local)	`.local`	32 bits	>= 4 bytes (stack)	low 2 bits known-zero, bits `[32,63]` irrelevant

The 32-bit address spaces (shared and local) are critical: any value known to be a shared-memory pointer has bits [32, 63] entirely dead. The DemandedBits analysis exploits this to eliminate zero-extensions and truncations around shared-memory address calculations, keeping everything in 32-bit arithmetic.

Launch Parameter Integration

The __launch_bounds__ attribute, __maxnreg__ pragma, and nvvm.reqntid / nvvm.maxntid metadata all flow into the known-bits infrastructure:

nvvm-intr-range pass (sub_216F4B0): Runs early in the pipeline. Reads kernel metadata (nvvm.reqntid, nvvm.maxntid) via sub_93AE30. Attaches !range metadata to every llvm.nvvm.read.ptx.sreg.* intrinsic call. The metadata format is !{i32 lo, i32 hi} where hi is exclusive.
computeKnownBitsFromRangeMetadata (sub_11A68C0): Called during standard computeKnownBits traversal. Reads !range metadata from any value and derives known-zero/known-one masks. For a range [0, 1024), this yields knownZero = 0xFFFFFC00 (bits 10--31 known zero).
Intrinsic return range analysis (sub_10CA790 + sub_11A1390): A separate path used when the merged computeKnownBits+SimplifyDemandedBits processes ZExt/SExt of intrinsic calls. Computes [lo, hi] bounds for the intrinsic's return value and checks whether the extension can be eliminated because the return range fits within the demanded bits.

The Merged Analysis: Algorithm and Pseudocode

Unlike upstream LLVM where InstCombiner::SimplifyDemandedBits calls computeKnownBits as a subroutine, cicc fuses them. The entry point sub_11AE870 wraps sub_11AE3E0, which calls the core sub_11A7600. A hash table at InstCombiner + 2064 tracks visited instructions to prevent infinite recursion.

Core Algorithm

// sub_11A7600 — merged computeKnownBits + SimplifyDemandedBits
// Returns: replacement instruction pointer, or NULL if no simplification
Instruction* computeKnownBitsAndSimplify(
    AnalysisCtx    *ctx,        // a1 — holds IR module, pass info
    IRNode         *inst,       // a2 — instruction to analyze
    APInt          *demanded,   // a3 — which output bits the consumer needs
    KnownBits      *result,     // a4 — output {knownZero, knownOne}
    unsigned        depth,      // a5 — recursion depth (checked in caller)
    QueryState     *state       // a6 — worklist context
) {
    uint8_t opcode = inst->opcode_tag;   // single-byte opcode at offset 0
    unsigned width = demanded->getBitWidth();

    // Stack-allocate 4 APInt accumulators for operand known bits
    APInt kz0(width, 0), ko0(width, 0);  // operand 0
    APInt kz1(width, 0), ko1(width, 0);  // operand 1

    switch (opcode) {
    case '*': // Mul — lines 654-1037
        // Pattern: if one operand is known power-of-2 from intrinsic call,
        //          replace Mul with Shl (critical for threadIdx * stride)
        if (auto *rhs = matchConstantPow2Call(inst->getOperand(1))) {
            if (inst->getOperand(0)->hasOneUse())
                return createShl(inst->getOperand(0), log2(rhs));
        }
        // Generic: narrow demanded mask by leading zeros, propagate to operands
        unsigned effectiveBits = width - demanded->countLeadingZeros();
        APInt narrowDemand = APInt::getLowBitsSet(width, effectiveBits);
        propagateDemandToOperand(ctx, inst, 0, narrowDemand, &kz0, &ko0, depth+1, state);
        propagateDemandToOperand(ctx, inst, 1, narrowDemand, &kz1, &ko1, depth+1, state);
        KnownBits::computeForMul(result, {kz0,ko0}, {kz1,ko1}, inst->hasNUW(), inst->hasNSW());
        break;

    case '6': // ZExt — lines 1677-1919
        // Check if source is intrinsic call with known return range
        if (auto range = getIntrinsicReturnRange(inst->getOperand(0))) {
            if (range.fitsBitWidth(demanded->getActiveBits()))
                return inst->getOperand(0);  // eliminate extension
        }
        // Standard: shift demanded bits down, propagate to source, zext result
        propagateDemandToOperand(ctx, inst, 0, demanded->trunc(srcWidth), ...);
        KnownBits::zext(result, srcWidth);
        break;

    case 'U': // NVIDIA Intrinsic — lines 3521-4085
        unsigned intrinsicID = getIntrinsicID(inst);
        switch (intrinsicID) {
        case 0x0F: handleBFE_BFI(inst, demanded, result); break;
        case 0x42: handlePopcount(inst, demanded, result); break;
        case 0x01: handleAbs(inst, demanded, result);      break;
        case 0xB4: handleFSHL(inst, demanded, result);     break;
        case 0xB5: handleFSHR(inst, demanded, result);     break;
        case 0x12B: handleBswap(inst, demanded, result);   break;
        default:
            // Fall through to NVIDIA intrinsic known-bits oracle
            sub_F0C4B0(inst, result, depth, state);
            break;
        }
        break;

    // ... 13 more opcode cases (Add, Sub, Xor, PHI, Trunc, SExt, etc.)

    default:
        sub_9AC0E0(inst, result, depth, state);  // generic fallback
        break;
    }

    // POST-ANALYSIS REFINEMENT (lines 2134-2281)
    // 1. Pointer alignment: if type is pointer, OR alignment bits into knownZero
    if (inst->getType()->isPointerTy()) {
        unsigned alignBits = getPointerAlignmentBits(inst);  // sub_BD5420
        result->knownZero |= APInt::getLowBitsSet(width, alignBits);
    }

    // 2. Debug cross-validation (when qword_4F90C28 is set)
    if (DEBUG_FLAG) {
        KnownBits reference;
        sub_9AC330(inst, &reference, depth, state);  // independent computation
        if (reference != *result) {
            print("computeKnownBits(): ", reference);
            print("SimplifyDemandedBits(): ", *result);
            abort();
        }
    }

    // 3. Demand-covers-known check: can we replace with constant?
    if (demanded->isSubsetOf(result->knownZero | result->knownOne))
        return ConstantInt::get(inst->getType(), result->knownOne);

    return nullptr;
}

Demand Propagation Per Operand

The trampoline sub_11AE940 is the per-operand demand propagation entry point. It increments depth, checks the depth limit (depth > 6 returns all-unknown), and dispatches between the big handler (sub_11A7600) and the binary-arithmetic-specific helper (sub_11A1430) based on opcode class:

// sub_11AE940 — per-operand demand propagation trampoline
Instruction* propagateDemandToOperand(
    AnalysisCtx *ctx, IRNode *parent, unsigned opIdx,
    APInt *demand, KnownBits *out, unsigned depth, QueryState *state
) {
    if (depth > 6)
        return nullptr;  // MaxAnalysisRecursionDepth reached

    IRNode *operand = parent->getOperand(opIdx);
    uint8_t opcode = operand->opcode_tag;

    // Binary arithmetic subset goes to the helper
    if (opcode == '*' || opcode == '9' || opcode == ':' ||
        opcode == ';' || opcode == ',' || opcode == '8')
        return sub_11A1430(ctx, operand, demand, out, depth, state);

    // Everything else goes to the big merged handler
    return sub_11A7600(ctx, operand, demand, out, depth, state);
}

The secondary helper sub_11A1430 handles Add/Sub/Xor/Mul/BitCast/ExtractElement with a tighter structure: it uses a four-accumulator cascade with three successive isSubsetOf checks per operation, which is more aggressive than upstream LLVM's single post-merge check.

The Four-Accumulator Cascade

For binary operators (Add, Sub, Xor), cicc maintains four APInt accumulators (two per operand) and performs a three-tier check:

// Three-tier demand satisfaction check (sub_11A1430 pattern)
// More aggressive than upstream single-check approach
KnownBits kb0, kb1;
computeKnownBits(op0, &kb0, depth+1, state);
computeKnownBits(op1, &kb1, depth+1, state);
KnownBits merged = mergeForOpcode(kb0, kb1, opcode);
sub_99B5E0(inst, &merged, depth, state);  // NVIDIA post-fixup

// Check 1: merged result covers demand?
if (demanded.isSubsetOf(merged.knownZero | merged.knownOne))
    return ConstantInt::get(merged.knownOne);

// Check 2: union of operand known-bits covers demand?
if (demanded.isSubsetOf((kb0.knownZero | kb1.knownZero) |
                        (kb0.knownOne  | kb1.knownOne)))
    return ConstantInt::get(...);

// Check 3: all accumulated zero|one covers demand?
if (demanded.isSubsetOf(allAccumulatedZero | allAccumulatedOne))
    return followUseDef(...);

The post-analysis fixup sub_99B5E0 is NVIDIA-specific and does not exist in upstream LLVM. It applies additional refinements from thread index range constraints, warp-level uniformity, and shared memory alignment guarantees.

DemandedBits for GPU: Narrowing Optimizations

The DemandedBits analysis is the backward complement to KnownBits' forward analysis. When a consumer only needs the low N bits of a value, the producer can be narrowed or eliminated. On GPU, this interaction is dramatically more productive than on CPU because of three factors:

32-bit address spaces: Shared memory (AS 3) and local memory (AS 5) use 32-bit pointers. When address calculations are performed in i64 (as the generic address space requires), the upper 32 bits are entirely undemanded for shared/local accesses. DemandedBits proves this and enables truncation to i32.
Bounded thread indices: threadIdx.x * stride + offset patterns produce values that fit in far fewer bits than i32. If threadIdx.x < 256 (from __launch_bounds__) and stride < 4096, the product fits in 20 bits. DemandedBits propagates this, enabling downstream shifts and masks to operate on narrower types.
Type demotion to i16/fp16: When DemandedBits proves only the low 16 bits of an i32 computation matter, cicc can demote to 16-bit operations. The function at sub_1185740 (InstCombine's visitTrunc) inserts narrowing truncations. This is particularly valuable for texture coordinate calculations and index arithmetic in tensor core operations.

Dead Bit Elimination

The core optimization check appears approximately 15 times across the analysis functions:

// Inline version (width <= 64):
uint64_t unknown = ~(knownZero | knownOne);
if ((demanded & unknown) == 0) {
    // All demanded bits are determined -> replace with constant
    return ConstantInt::get(type, knownOne);
}

// Wide version (width > 64):
if (demanded.isSubsetOf(knownZero | knownOne)) {
    return ConstantInt::get(type, knownOne);  // sub_AD6220
}

This is the heart of the analysis: backward-propagated demand meets forward-propagated known-bits. When they cover every bit the consumer needs, the entire instruction is dead and can be replaced with a compile-time constant.

GPU Patterns Enabled by Known Bits

The following simplifications are GPU-specific and do not have CPU equivalents:

Mul to Shl for threadIdx arithmetic (lines 714--861): When both operands of a multiply originate from intrinsic calls with known power-of-2 returns (e.g., threadIdx.x * blockDim.x where blockDim is a power-of-2 from __launch_bounds__), the multiply is replaced with a left shift. The pattern matcher checks sub_BCAC40 (hasOneUse) and sub_10A0620 (createShl replacement).

Bswap + BFE fusion (lines 3959--4007): Detects a byte-swap feeding into a bit-field extract and replaces with a direct byte read at the swapped offset. Common in endianness conversion code for shared memory operations.

ZExt/SExt elimination via intrinsic return range (sub_10CA790 path): When a ZExt or SExt extends the result of an NVVM intrinsic call, and the intrinsic's annotated return range fits entirely within the demanded bits, the extension is eliminated. This fires frequently for threadIdx.x reads extended to i64 for address calculations.

BitCast-through-ZExt folding (sub_11A1430 at 0x11A2360): When a BitCast's source is a ZExt and the demanded bits fit within the original narrow type, the bitcast+zext chain collapses to the original value. Common in CUDA address calculations involving zero-extension followed by pointer reinterpretation.

SelectionDAG computeKnownBits

The DAG-level known-bits analysis at sub_33D4EF0 (114 KB, 3,286 lines) mirrors the IR-level analysis but operates on SDNode opcodes. It handles 112 opcode cases organized into 14 groups.

NVPTX Target Node Known Bits

For NVPTX-specific DAG opcodes (above ISD::BUILTIN_OP_END = 499), the function delegates to NVPTXTargetLowering::computeKnownBitsForTargetNode via vtable slot 254 at offset 2032. The key NVPTX-specific cases:

Opcode Range	NVPTX DAG Node	Known-Bits Behavior
0x152--0x161 (338--353)	TEX, SULD, surface ops	Result width known: bits above element size set to zero
0x12A (298)	LoadV2, LoadParam	Extension mode from flags byte bits[2:3]: zext/sext/none
0x16A, 0x16C (362, 364)	StoreParam, StoreRetval	When flags bits[2:3] == 0b11: element type width known
0x175 (373)	ConstantPool	Uses ConstantRange::fromKnownBits intersection
0xCA (202)	INTRINSIC_WO_CHAIN	Boolean-like: bit 0 unknown, bits [1..width] known zero
>= 499	All target-specific	Delegates to vtable[254] computeKnownBitsForTargetNode

The DAG-level analysis uses the same recursion depth cap of 6 (a6 > 5 returns all-unknown), matching LLVM's MaxRecursionDepth.

Texture/Surface Fetch Result Width

Cases 0x152--0x161 encode the known bit-width of texture and surface fetch results. For an 8-bit texture fetch zero-extended to i32, the analysis sets bits [8, 31] as known-zero in the result. This enables downstream shift and mask elimination in texture sampling code.

KnownBits Data Structure Layout

Both the IR-level and DAG-level implementations use the same 32-byte struct:

struct KnownBits {                      // 32 bytes total
    union {
        uint64_t  val;                  // +0x00: inline storage (width <= 64)
        uint64_t *ptr;                  // +0x00: heap pointer  (width > 64)
    } knownZero;
    uint32_t knownZero_width;           // +0x08: bit-width
    uint32_t _pad0;                     // +0x0C: padding
    union {
        uint64_t  val;                  // +0x10: inline storage (width <= 64)
        uint64_t *ptr;                  // +0x10: heap pointer  (width > 64)
    } knownOne;
    uint32_t knownOne_width;            // +0x18: bit-width
    uint32_t _pad1;                     // +0x1C: padding
};

// Invariant: (knownZero & knownOne) == 0   (no bit both 0 and 1)
// Threshold: width > 64 triggers heap allocation via sub_C43690

Roughly 43% of sub_11A1430's binary size consists of APInt destructor sequences (cmp [rbp+var], 0x40; jbe skip; call free) for the width > 64 cleanup paths.

Configuration

Knob	Source	Default	Effect
`nvvm-intr-range-sm`	`ctor_359`	Current target SM	SM variant used to compute special register ranges for `nvvm-intr-range` pass
`scev-cgp-tid-max-value`	`ctor_XXX`	Architecture limit	Maximum value of thread ID used in SCEV-based CodeGenPrep address calculations
`nv-remat-threshold-for-spec-reg`	`unk_4FD3860`	20	Threshold controlling when special register reads are rematerialized instead of spilled (interacts with known-bits because remat preserves range metadata)
`qword_4F90C28`	internal debug flag	0 (disabled)	Enables cross-validation abort: runs independent reference `computeKnownBits` (`sub_9AC330`) and aborts if results disagree with merged analysis
Max recursion depth	hardcoded	6	Matches LLVM's `MaxAnalysisRecursionDepth`; checked in `sub_11AE940`
APInt inline threshold	hardcoded	64 bits	Values <= 64 bits use inline uint64 storage; wider values heap-allocate

Diagnostic Strings

The merged analysis emits the following diagnostics (only in debug/assert builds when qword_4F90C28 is set):

String	Location	Trigger
`"computeKnownBits(): "`	`sub_11A7600` line ~2204	Cross-validation mismatch: prints the reference implementation's result
`"SimplifyDemandedBits(): "`	`sub_11A7600` line ~2208	Cross-validation mismatch: prints the merged analysis result
`"Mismatched known bits for <inst> in <func>"`	`sub_11A7600` line ~2200	Precedes the two values above; followed by `abort()`

The nvvm-intr-range pass emits:

String	Location
`"Add !range metadata to NVVM intrinsics."`	`sub_216F4B0` (pass registration)

NVVM IR Node Layout

The KnownBits analysis traverses IR nodes using cicc's internal representation. Each node is 32 bytes:

struct IRNode {             // 32 bytes (0x20)
    uint8_t  opcode;        // +0x00: single-byte opcode tag (ASCII-based)
    uint8_t  flags;         // +0x01: bit 1, bit 2 = nsw/nuw flags
    uint16_t _reserved;     // +0x02
    uint32_t operand_idx;   // +0x04: 27-bit operand index + 5-bit flags
                            //        byte 7 bit 6 (0x40) = use-list vs indexed
    // ... remaining 24 bytes: use-list pointers, type info, metadata
};

// Operand resolution:
// If byte[7] & 0x40 (use-list flag set):
//     operand = *(node - 8) -> *(ptr + 0x20)
// If byte[7] & 0x40 == 0 (indexed):
//     idx = (node[4..7] & 0x7FFFFFF)
//     operand = node - (idx << 5)    // 27-bit index * 32 bytes

The 27-bit index allows up to 134 million nodes (4 GB theoretical IR size).

Function Map

IR-Level Known-Bits

Function	Address	Size
`computeKnownBitsAndSimplify` -- merged main analysis	`sub_11A7600`	127 KB
`SimplifyDemandedBitsHelper` -- binary arithmetic subset	`sub_11A1430`	6.3 KB
Per-operand demand propagation trampoline (depth check)	`sub_11AE940`	varies
SimplifyDemandedBits entry wrapper (allocates APInts)	`sub_11AE870`	thin
SimplifyDemandedBits result caching (hash table at IC+2064)	`sub_11AE3E0`	235 lines
`computeKnownBitsFromOperator` / PHI merge	`sub_11A3F30`	50 KB
`computeKnownBitsFromAssume` (processes `@llvm.assume`)	`sub_11A6910`	12.5 KB
`computeKnownBitsFromRangeMetadata` (reads `!range`)	`sub_11A68C0`	varies
Generic `computeKnownBits` (fallback, no simplification)	`sub_9AC0E0`	varies
Reference `computeKnownBits` (debug cross-validation only)	`sub_9AC330`	varies
NVIDIA post-analysis fixup (alignment + range refinement)	`sub_99B5E0`	varies
NVIDIA intrinsic known-bits oracle (special registers)	`sub_F0C4B0`	varies
`isNVVMFunction` check (NVIDIA-specific flag)	`sub_F0C3D0`	varies
Intrinsic return range analysis (computes `[lo, hi]`)	`sub_10CA790`	11.2 KB
Extract return range bounds from range analysis result	`sub_11A1390`	varies
`getPointerAlignmentBits` (alignment-derived known zeros)	`sub_BD5420`	varies
`isDemandedBitsFullyKnown` (demand subset-of known)	`sub_10024C0`	varies
`NVVMIntrRange` pass -- attaches `!range` metadata	`sub_216F4B0`	varies

SelectionDAG-Level Known-Bits

Function	Address	Size
`SelectionDAG::computeKnownBits` (recursive, 112 opcode cases)	`sub_33D4EF0`	114 KB
Creates all-demanded mask, delegates to `sub_33D4EF0`	`sub_33DD090`	wrapper
`computeMinLeadingZeros` (calls `sub_33D25A0` + returns)	`sub_33D4D80`	wrapper
`computeNumSignBits` (parallel switch structure)	`sub_33D25A0`	49 KB
`computeOverflowForAdd` / `computeOverflowForSub`	`sub_33DCF10`	varies

KnownBits Arithmetic Helpers

Function	Address
`KnownBits::computeForMul(result, nuw, nsw, kb0, kb1)`	`sub_C70430`
`KnownBits::add(a, b, nsw, nuw, carry)`	`sub_C74E10`
`KnownBits::sub(a, b, nsw, nuw)`	`sub_C75B70`
`KnownBits::computeForAddSub(isSub, nsw, nuw, a, b)`	`sub_C76560`
`KnownBits::shl(a, shamt)`	`sub_C73220`
`KnownBits::lshr(a, b)`	`sub_C738B0`
`KnownBits::ashr(a, b)`	`sub_C73E40`
`KnownBits::and(a, b, commutative)`	`sub_C787D0`
`KnownBits::or(a, b)`	`sub_C78F20`
`KnownBits::xor(a, b)`	`sub_C790F0`
`KnownBits::mergeForPHI` / `smax(a, b)`	`sub_C79480`
`KnownBits::truncate` / `smulh(a, b)`	`sub_C7B4D0`
`KnownBits::cttz(a, shift)`	`sub_C7BCF0`
`KnownBits::ctpop(a)`	`sub_C7BD50`
`KnownBits::bswap(a)`	`sub_C7BDB0`
`KnownBits::abs(a, known_shift)`	`sub_C746C0`
`KnownBits::umin(a, b)`	`sub_C740A0`
`KnownBits::umax(a, b)`	`sub_C74180`
`KnownBits::ctlz(a, poisonAtZero)`	`sub_C778B0`

APInt Utilities

Function	Address
`APInt(width, 0)` -- zero-init constructor (heap for width > 64)	`sub_C43690`
`APInt` copy constructor	`sub_C43780`
`APInt::operator&=`	`sub_C43B90`
`APInt::operator\	`sub_C43BD0`
`APInt::setBits(lo, hi)`	`sub_C43C90`
`APInt::flipAllBits`	`sub_C43D10`
`APInt::trunc(width)`	`sub_C44740`
`APInt::zext(width)`	`sub_C449B0`
`APInt::sext(width)`	`sub_C44830`
`APInt::countTrailingZeros`	`sub_C44590`
`APInt::countLeadingZeros`	`sub_C444A0`
`APInt::countPopulation`	`sub_C44630`
`APInt::isSubsetOf(other)`	`sub_C446F0`
`APInt::reverseBits` / `byteSwap`	`sub_C44AB0`
`ConstantInt::get(type, APInt)` -- creates constant replacement	`sub_AD6220`
`ConstantInt::get(type, value, isSigned)`	`sub_AD64C0`

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Analysis architecture	Separate `computeKnownBits` (ValueTracking) and `SimplifyDemandedBits` (InstCombine)	Fused into single 127 KB function (`sub_11A7600`) that simultaneously computes bitmasks and simplifies instructions
GPU register ranges	No special register concept; all values have full-width range	Dedicated oracle (`sub_F0C4B0`) provides known-zero bits for `%tid`, `%ntid`, `%ctaid`, `%warpsize`, `%laneid`, and 10+ PTX special registers
Range metadata injection	No equivalent pass; range info comes from profile data or programmer annotations	`nvvm-intr-range` pass (`sub_216F4B0`) attaches `!range` metadata to every special-register read; tightened by `__launch_bounds__`
Warp size	Not a concept; no constant is known	`%warpsize` is statically known to be exactly 32 (known-zero bits `[0,4]` and `[6,31]`, bit 5 = 1)
Cross-validation	No cross-validation in release builds	Debug flag `qword_4F90C28` enables abort-on-mismatch between `computeKnownBits` and `SimplifyDemandedBits` results
SelectionDAG integration	Separate DAG-level `computeKnownBits` (~60 KB)	Extended DAG-level version at `sub_33D4EF0` (114 KB, 3,286 lines) with GPU-specific value tracking
Max recursion depth	6 (configurable)	Same default 6, checked in `sub_11AE940` with identical semantics

Cross-References

InstCombine -- The primary consumer of KnownBits analysis; sub_11AE870 is called from the binary operator visitor's Phase 0
SelectionDAG -- DAG-level known-bits at sub_33D4EF0 feeds into DAGCombine and instruction selection pattern matching
Loop Strength Reduction -- LSR interacts with shared-memory known-bits through the lsr-no-ptr-address-space3 knob that disables LSR for 32-bit shared memory pointers
GVN -- sub_9AC330 (reference computeKnownBits) is also called from GVN to validate value numbering decisions
LICM -- Loop-invariant code motion uses known-bits to prove that hoisted expressions are safe (no integer overflow when known-bits constrain the range)

CodeGenPrepare and SCEV-CGP

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Upstream CodeGenPrepare is stock LLVM 20.0.0 CodeGenPrepare.cpp with all 20+ cl::opt knobs unchanged. SCEV-CGP is a fully proprietary NVIDIA pass with no upstream equivalent; it is disabled by default (nv-disable-scev-cgp = true).

cicc v13.0 contains two distinct passes that prepare LLVM IR for the NVPTX backend's instruction selection. The first is upstream LLVM's CodeGenPreparePass, registered as "codegenprepare" in the New PM pipeline (line 216 of sub_2342890), which sinks address computations, creates PHI nodes for sunk values, and splits critical edges. The second is NVIDIA's proprietary SCEV-CGP (Scalar-Evolution-based Code Generation Preparation), a fully custom pass that uses SCEV analysis to rewrite address expressions with GPU thread ID as an induction variable.

Both passes operate at the LLVM IR level, immediately before SelectionDAG construction. They share the goal of making address expressions cheap for the backend to lower, but they work at different abstraction levels: CodeGenPrepare operates syntactically on individual memory instructions; SCEV-CGP operates semantically on entire address expression families using scalar evolution. NVIDIA disables SCEV-CGP by default (nv-disable-scev-cgp defaults to true), relying on upstream CodeGenPrepare plus the downstream Base Address Strength Reduction and Common Base Elimination passes to handle GPU address optimization.

Key Facts

Property	Value
Pass name (upstream)	`codegenprepare` (New PM)
Pass name (NVIDIA)	SCEV-CGP (no formal New PM pass name found in binary)
Binary range (v12.x)	`0x1D60000`--`0x1D7FFFF` (helpers + main transforms)
Binary range (v13.0)	`0x2D75700`--`0x2D88660` (Cluster 6 in `0x2D` sweep)
Address sinking	`sub_1D73760` / `sub_2D75700` (65--72 KB), string `"sunkaddr"`
PHI sinking	`sub_1D706F0` / `sub_2D784F0` (64--68 KB), string `"sunk_phi"`
Block splitting	`sub_1D7AA30` / `sub_2D88660` (54--74 KB), strings `".unlikely"`, `".cond.split"`
Main transform	`sub_2D80050` (54 KB) -- orchestrates address mode lowering
SCEV-CGP knob ctor	`ctor_263_0` at `0x4F36F0` (9.9 KB, 44 option strings)
CGP knob ctor	`ctor_288_0` at `0x4FA950` (8.6 KB, 44 option strings)
Master disable	`nv-disable-scev-cgp` (default: `true` -- SCEV-CGP is disabled)
Upstream source	`llvm/lib/CodeGen/CodeGenPrepare.cpp`
Pipeline position	Late IR, immediately before SelectionDAG ISel

Upstream CodeGenPrepare

Purpose

CodeGenPrepare is the last IR-level pass before instruction selection. Its job is to transform the IR into a form that the SelectionDAG builder can lower efficiently: address computations should be adjacent to their memory uses (reducing live ranges), complex addressing modes should be materialized as GEP chains that ISel can pattern-match, and unlikely branches should be split into cold blocks so that block placement can isolate them.

On NVPTX this pass is less critical than on x86 because PTX has simpler addressing modes (base + offset, no scaled index), but it still performs three important transforms.

Transform 1: Address Sinking (`sunkaddr`)

The address sinking logic lives in sub_1D73760 (v12.x) / sub_2D75700 (v13.0). It identifies memory instructions whose address operand is computed in a dominating block, then sinks the computation to the block containing the memory instruction. The sunk address is named "sunkaddr" in the IR, appearing as a GEP, inttoptr, or bitcast chain:

Before:
  entry:
    %addr = getelementptr float, ptr %base, i64 %idx
    br label %loop

  loop:
    %val = load float, ptr %addr          ; addr live across loop

After:
  entry:
    br label %loop

  loop:
    %sunkaddr0 = getelementptr float, ptr %base, i64 %idx
    %val = load float, ptr %sunkaddr0     ; addr local to use

The naming convention "sunkaddr" with a numeric suffix (20+ occurrences in binary string references) is the standard LLVM naming. Each sunk address gets a unique suffix: sunkaddr0, sunkaddr1, etc.

The sinking decision is controlled by a cache called ValueToSunkAddr (a DenseMap at sub_2CE7CF0 in the v13.0 build). Before sinking a value, the pass checks whether the same address expression has already been sunk into the target block. If so, it reuses the existing sunk copy rather than creating a duplicate.

The core sinking algorithm:

for each basic block BB in function:
    for each instruction I in BB:
        if I is a memory instruction (load/store/atomic):
            addr = I.getPointerOperand()
            if addr.getParent() != BB:
                // addr defined in a dominating block
                addr_mode = matchAddressMode(addr)           // sub_2D67BB0
                if addr_mode.isFoldable():
                    sunk = materializeAddrMode(addr_mode, BB) // sub_2D68450
                    I.setPointerOperand(sunk)
                    mark changed

Key helpers in the v13.0 build:

Function	Address	Size	Role
--	--	`sub_2D749D0`	--
--	--	`sub_2D67BB0`	--
--	--	`sub_2D6E640`	--
--	--	`sub_2D68450`	--
--	--	`sub_2CE7CF0`	--

Transform 2: PHI Sinking (`sunk_phi`)

When an address computation has multiple uses in successor blocks of a conditional branch, the pass creates a PHI node in the merge block rather than sinking independent copies into each successor. The resulting PHI is named "sunk_phi":

Before:
  entry:
    %addr = getelementptr float, ptr %base, i64 %idx
    br i1 %cond, label %then, label %else

  then:
    %v1 = load float, ptr %addr
    br label %merge

  else:
    %v2 = load float, ptr %addr
    br label %merge

After (conceptual):
  then:
    %sunkaddr0 = getelementptr float, ptr %base, i64 %idx
    %v1 = load float, ptr %sunkaddr0
    br label %merge

  else:
    %sunkaddr1 = getelementptr float, ptr %base, i64 %idx
    %v2 = load float, ptr %sunkaddr1
    br label %merge

When the two sunk copies would be identical and the value is needed in the merge block for other uses, the pass instead creates:

  merge:
    %sunk_phi = phi ptr [ %sunkaddr0, %then ], [ %sunkaddr1, %else ]

The PHI creation calls sub_B44260 (PHI node setup), with naming via sub_BD6B50. The addr-sink-new-phis cl::opt knob (registered at ctor_288_0) controls whether the pass is allowed to create new PHIs during address sinking. The addr-sink-new-select knob similarly controls creation of new select instructions.

Transform 3: Block Splitting

sub_1D7AA30 (v12.x) / sub_2D88660 (v13.0) splits basic blocks to isolate unlikely paths. The pass creates blocks with suffixes ".unlikely" and ".cond.split", allowing MachineBlockPlacement to push cold code away from the hot path. This is driven by branch probability metadata and profile-guided section prefix hints.

On NVPTX, block splitting interacts with StructurizeCFG: the split blocks must still form reducible control flow, otherwise StructurizeCFG will have to insert additional flow blocks to restore structure. The profile-guided-section-prefix knob controls whether section prefix metadata (.hot, .unlikely, .unknown) is attached to split blocks.

Upstream CodeGenPrepare Knobs

All registered at ctor_288_0 (0x4FA950, 8.6 KB, 44 strings). These are standard LLVM cl::opt knobs, unchanged from upstream:

Knob	Type	Effect
`disable-cgp-branch-opts`	bool	Disable CodeGenPrepare branch optimizations
`disable-cgp-gc-opts`	bool	Disable CodeGenPrepare GC optimizations
`disable-cgp-select2branch`	bool	Disable select-to-branch conversion
`addr-sink-using-gep`	bool	Use GEP instructions for address sinking (vs. inttoptr)
`enable-andcmp-sinking`	bool	Sink `and`/`cmp` instruction pairs into branches
`disable-cgp-store-extract`	bool	Disable store-extractvalue optimization
`stress-cgp-store-extract`	bool	Stress test store-extractvalue path
`disable-cgp-ext-ld-promotion`	bool	Disable extension-load promotion
`disable-preheader-prot`	bool	Disable loop preheader protection
`profile-guided-section-prefix`	bool	Attach section prefix based on profile data
`cgp-freq-ratio-to-skip-merge`	int	Block frequency ratio threshold to skip block merging
`force-split-store`	bool	Force store splitting
`cgp-type-promotion-merge`	bool	Merge type promotions
`disable-complex-addr-modes`	bool	Disable complex addressing mode optimization
`addr-sink-new-phis`	bool	Allow creating new PHIs during address sinking
`addr-sink-new-select`	bool	Allow creating new select during address sinking
`addr-sink-combine-base-reg`	bool	Combine base register in address sink
`addr-sink-combine-gv`	bool	Combine global value in address sink
`addr-sink-combine-offs`	bool	Combine offset in address sink
`addr-sink-combine-scaled-reg`	bool	Combine scaled register in address sink
`cgp-split-large-offset-gep`	bool	Split GEPs with large offsets

GPU Relevance of Upstream Knobs

Most of these knobs are effectively no-ops on NVPTX because the target's addressing modes are simple (base + immediate offset, no scaled index register). However, a few matter:

addr-sink-using-gep: Controls whether sunk addresses use GEP or inttoptr chains. On NVPTX, GEP chains are preferred because they preserve address space information through lowering. The inttoptr path strips address space, forcing the backend to re-derive it.
cgp-split-large-offset-gep: Relevant for large array accesses where the constant offset exceeds the PTX immediate encoding width (±2^31 for 64-bit addressing). Splitting the GEP allows the backend to use a base register plus a small offset rather than a 64-bit constant.
addr-sink-new-phis: On GPU, creating new PHIs can increase divergent live ranges. If the condition driving the PHI is thread-divergent, the PHI result will be divergent, potentially requiring a wider (per-lane) register allocation.

NVIDIA SCEV-CGP

What Is It?

SCEV-CGP is a fully custom NVIDIA pass that uses LLVM's ScalarEvolution analysis to optimize address mode expressions at the function level, with specific awareness of GPU thread ID as an induction variable. Where upstream CodeGenPrepare operates syntactically (pattern-matching individual instructions), SCEV-CGP operates semantically: it analyzes address expressions as SCEV recurrences, factors out common base computations, and rewrites them to minimize register pressure.

The pass is registered in ctor_263_0 at 0x4F36F0 alongside Base Address Strength Reduction knobs. The 44 strings registered in this single constructor cover both SCEV-CGP and BASR, confirming they are part of the same address optimization subsystem.

Why NVIDIA Disables It By Default

The nv-disable-scev-cgp knob defaults to true (the description reads "Disable optimize addr mode with SCEV pass" and the raw data at ctor_609_0 marks it as def=on meaning disabled). This is a deliberate choice:

Redundancy with BASR/CBE. NVIDIA has invested heavily in Base Address Strength Reduction (62 KB) and Common Base Elimination (39 KB), which handle the most profitable GPU address optimizations (sharing base computations across array accesses in loop bodies). These passes are simpler, more predictable, and better-tested than the general SCEV-CGP framework.
Interaction with LSR. Both SCEV-CGP and Loop Strength Reduction operate on SCEV expressions. If both are active, they can fight over the same address expressions: LSR rewrites IVs for loop-carried efficiency, then SCEV-CGP undoes part of that work to optimize address modes. The result can be worse than either pass alone. By disabling SCEV-CGP, NVIDIA lets LSR (with its full GPU-aware formula solver) handle SCEV-based address optimization without interference.
Compile-time cost. SCEV-CGP with aggressive mode (do-scev-cgp-aggresively [sic]) is expensive. The scev-cgp-inst-limit and scev-cgp-control knobs exist precisely because uncontrolled SCEV-CGP can balloon compile times on large kernels with many address expressions.
Overflow hazards. The ignore-32-bit-overflow and ignore-signed-32-bit-overflow knobs in ctor_263_0 indicate that SCEV-CGP can produce address arithmetic that overflows 32-bit intermediates. On GPU where 32-bit addressing is common (shared memory, constant memory), this is a correctness risk that NVIDIA mitigates by keeping the pass off by default.

When SCEV-CGP Would Be Beneficial

Despite being disabled by default, the pass has 11 dedicated knobs -- NVIDIA clearly uses it selectively:

Kernels with complex strided access patterns where thread ID participates in multi-dimensional address calculations (e.g., base + tid.x * stride_x + tid.y * stride_y + tid.z * stride_z). BASR handles the case where multiple accesses share a base, but it does not factor thread ID expressions across dimensions.
Register-pressure-critical kernels at occupancy cliffs where SCEV-based address strength reduction can save enough registers to cross an occupancy boundary. The scev-cgp-tid-max-value knob lets the pass reason about the bounded range of thread IDs, enabling tighter value range analysis.
Function-level address optimization (enabled by do-function-scev-cgp) where cross-loop base sharing matters more than per-loop IV optimization.

Thread ID Max Value Knob

The scev-cgp-tid-max-value knob deserves special attention. It provides SCEV analysis with the maximum possible value of a GPU thread ID, which is architecture-dependent:

threadIdx.x: max 1024 (all architectures sm_70+)
threadIdx.y: max 1024
threadIdx.z: max 64
blockIdx.x: max 2^31 - 1

By telling SCEV that threadIdx.x is bounded by 1024, the analysis can prove that threadIdx.x * element_size fits in 32 bits for element sizes up to ~2 million bytes. This enables 32-bit address arithmetic where the expression would otherwise be widened to 64 bits. The knob links to the Known Bits analysis documented in Known Bits, where the nvvm-intr-range pass provides similar bounded-range information for special registers.

SCEV-CGP Knobs (Complete Reference)

All registered in ctor_263_0 at 0x4F36F0. These are NVVMPassOptions values, stored in the 222-slot pass option registry.

Knob	Type	Default	Effect
`do-scev-cgp`	bool	true `[MEDIUM confidence]`	Master enable for SCEV-based CodeGenPrepare transforms. Default inferred from the fact that `nv-disable-scev-cgp` exists as an override, implying this defaults to enabled.
`do-scev-cgp-aggresively` [sic]	bool	false `[MEDIUM confidence]`	Enable aggressive SCEV-CGP mode with expanded search. Default inferred from naming convention (aggressive modes typically off by default).
`do-function-scev-cgp`	bool	false `[MEDIUM confidence]`	Enable function-level (cross-loop) SCEV-CGP. Default inferred from naming convention.
`nv-disable-scev-cgp`	bool	true	Master disable switch in NVPTX backend (overrides `do-scev-cgp`)
`scev-cgp-control`	int	unknown	Limit the total number of SCEV-CGP transformations per function
`scev-cgp-cross-block-limit`	int	unknown	Max number of common base expressions from a single block
`scev-cgp-idom-level-limit`	int	unknown	Max dominator tree depth for hoisting base computations
`scev-cgp-inst-limit`	int	unknown	Max instructions analyzed per parameter expression
`scev-cgp-old-base`	bool	unknown	Use old (legacy) base computation method instead of new
`scev-cgp-tid-max-value`	int	arch-dependent	Maximum value of thread ID for address range analysis
`scev-cgp-check-latency`	int	unknown	Latency threshold for address computation profitability
`scev-cgp-norm`	int	unknown	Normalization control for SCEV expression canonicalization
`print-after-scev-cgp`	bool	false	Dump function IR after SCEV-CGP completes
`dump-scev-cgp`	bool	false	Debug dump during SCEV-CGP execution

The same constructor also registers these knobs, documented in their respective pages:

Knob	See
`do-base-address-strength-reduce`	Base Address Strength Reduction
`do-base-address-strength-reduce-chain`	Base Address Strength Reduction
`base-address-strength-reduce-iv-limit`	Base Address Strength Reduction
`base-address-strength-reduce-max-iv`	Base Address Strength Reduction
`topo-sort-begin`	Topological sort starting point for address expression graph
`ignore-bad-base`	Bypass validity checks on base pointer classification
`ignore-32-bit-overflow`	Skip 32-bit overflow checks in address arithmetic
`ignore-signed-32-bit-overflow`	Skip signed 32-bit overflow checks

Interaction with LSR

CodeGenPrepare/SCEV-CGP and Loop Strength Reduction both optimize address expressions, but at different pipeline stages and granularities.

Aspect	LSR	CodeGenPrepare	SCEV-CGP
Pipeline position	Late IR optimization (loop passes)	Pre-ISel (after all IR opts)	Pre-ISel (NVIDIA custom position)
Scope	Per-loop IV rewriting	Per-instruction address sinking	Per-function address expression rewriting
SCEV usage	Full: formula generation, stride factoring, chain construction	None (syntactic pattern matching)	Full: base decomposition, range analysis
Register pressure	Explicit RP tracking with occupancy ceiling	Implicit (sinking reduces live ranges)	Implicit via `scev-cgp-cross-block-limit`
Address space	Full awareness (shared memory protection, 64-bit IV gating)	No special GPU handling	Thread ID aware (`scev-cgp-tid-max-value`)
Default status	Enabled (with GPU-custom formula solver)	Enabled (standard upstream)	Disabled (`nv-disable-scev-cgp = true`)

The key insight is the pipeline ordering: LSR runs first during the optimization phase, rewriting IVs across the loop. CodeGenPrepare runs later, sinking the results into individual use sites. If SCEV-CGP were also enabled, it would run between these two, potentially undoing LSR's IV choices to create "better" address modes -- which may conflict with LSR's register-pressure-informed formula selection.

NVIDIA's solution is pragmatic: keep SCEV-CGP off, let LSR handle SCEV-level optimization, let BASR/CBE handle GPU-specific base sharing, and let upstream CodeGenPrepare handle the final address sinking.

Differences from Upstream LLVM

Area	Upstream LLVM	cicc v13.0
CodeGenPrepare pass	Standard, used as-is	Retained unchanged from LLVM 20.0.0
SCEV-CGP	Does not exist	NVIDIA proprietary, disabled by default
Address sinking	Always uses `TTI::getAddrModeType`	Same, but NVPTX TTI returns simple modes (base+offset only)
Block splitting	Hot/cold based on PGO	Same, but must preserve reducibility for StructurizeCFG
BASR/CBE	Do not exist	NVIDIA proprietary alternatives to SCEV-CGP for GPU
Knob count	~20 cl::opt for CGP	20 upstream CGP + 14 SCEV-CGP + 8 BASR = 42 total

Function Map

CodeGenPrepare (v12.x Addresses)

Function	Address	Size	Role
--	`sub_1D73760`	65 KB	`optimizeMemoryInst` -- address sinking, creates `"sunkaddr"`
--	`sub_1D706F0`	68 KB	PHI optimization, creates `"sunk_phi"`
--	`sub_1D7AA30`	74 KB	Block splitting, creates `".unlikely"`, `".cond.split"`
--	`sub_1D779D0`	71 KB	IR transform (DAG combine-level, possibly `optimizeInst`)
--	`sub_1D765D0`	34 KB	Select lowering (`"cond.false"`, `"cond.end"`)
--	`sub_1D7F9D0`	31 KB	Deque-based worklist processor

CodeGenPrepare (v13.0 Addresses)

Function	Address	Size	Role
--	`sub_2D75700`	72 KB	Address sinking with `"sunk_phi"`, ValueToSunkAddr DenseMap
--	`sub_2D784F0`	64 KB	Address mode lowering orchestrator, calls `sub_2D75700`
--	`sub_2D80050`	54 KB	Main CodeGenPrepare transform, calls TTI and address mode logic
--	`sub_2D82850`	62 KB	Late lowering/expansion (type widening, custom lowering)
--	`sub_2D88660`	70 KB	Block splitting with branch weights (`"hot"`, `"unlikely"`, `"unknown"`)
--	`sub_2D749D0`	--	Address mode cache lookup
--	`sub_2D67BB0`	--	Address mode legality test
--	`sub_2D6E640`	--	Address mode cache insert
--	`sub_2D68450`	--	Address mode materialization
--	`sub_2D6DEE0`	--	Address mode matching
--	`sub_2D69E90`	--	Cleanup/init

Helper Range (0x1D60000--0x1D6FFFF)

This 64 KB sub-range contains CodeGenPrepare helper functions. The sweep identifies it as "CodeGenPrepare helpers" but no individual functions are called out with string evidence. These likely include address mode computation utilities, operand analysis, and GEP canonicalization.

SCEV-CGP Option Registration

Function	Address	Size	Role
--	`ctor_263_0` (`0x4F36F0`)	9.9 KB	Registers 44 cl::opt strings for SCEV-CGP + BASR
--	`ctor_288_0` (`0x4FA950`)	8.6 KB	Registers 44 cl::opt strings for upstream CodeGenPrepare
--	`ctor_591` (`0x57C1A0`)	9.3 KB	Additional CodeGenPrepare sink/split options
--	`ctor_544_0` (`0x56C190`)	13.1 KB	CodeGenPrepare options (v13.0 duplicate registration)
--	`ctor_609_0` (`0x585D30`)	37.3 KB	NVPTX backend mega-block, includes `nv-disable-scev-cgp`

Cross-References

Loop Strength Reduction -- SCEV-based IV rewriting, runs before CGP
Base Address Strength Reduction -- NVIDIA's preferred GPU address optimization
Common Base Elimination -- inter-block complement to BASR
SCEV Analysis -- the scalar evolution infrastructure both LSR and SCEV-CGP depend on
Known Bits -- thread ID range analysis that scev-cgp-tid-max-value feeds into
Code Generation Overview -- pipeline position context
NVPTX Target & TTI -- the nv-disable-scev-cgp registration in ctor_609_0
Optimizer Pipeline -- do-scev-cgp in the NVVMPassOptions system

ScalarEvolution Overview & Construction

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Analysis/ScalarEvolution.cpp, llvm/include/llvm/Analysis/ScalarEvolution.h (LLVM 20.0.0)

LLVM version note: CICC v13.0 is based on LLVM 20.0.0 ScalarEvolution.cpp. Evidence: the non-recursive worklist-based createSCEV driver (sub_DD8130) matches the LLVM 16+ refactoring that replaced the recursive createNodeForValue. The getSmallConstantTripCount/getSmallConstantMaxTripCount API matches LLVM 17+ signatures. NVIDIA's three extension categories -- simple_mode complexity control, GPU-specific SCEV sources (thread index bounds), and CUDA loop idiom recognition (warp-stride, grid-stride) -- are layered on top of the stock LLVM 20 analysis with no modifications to the core SCEV algebra.

ScalarEvolution (SCEV) is the foundational analysis that models how values change across loop iterations. Every loop optimization in cicc -- vectorization, unrolling, strength reduction, interchange, distribution -- depends on SCEV to answer three questions: "what is the trip count?", "what is the stride?", and "what is the value range?" NVIDIA's cicc v13.0 ships an LLVM 20.0.0-based ScalarEvolution with three categories of proprietary extensions: a complexity control system (simple_mode) that prevents SCEV from spending unbounded time on GPU kernels with hundreds of induction variables, GPU-specific SCEV sources that inject thread index bounds and launch configuration constraints into the analysis, and recognition of CUDA-specific loop idioms (warp-stride and grid-stride patterns) that have no analog in CPU code. This page documents SCEV expression construction -- the core getSCEV / createSCEV / createNodeForInstruction call chain. Range computation and trip count analysis are covered in SCEV Range Analysis & Trip Counts; cache invalidation and delinearization in SCEV Invalidation & Delinearization.

Key Facts

Property	Value
LLVM base version	20.0.0 `ScalarEvolution.cpp`
Top-level entry	`sub_DD8400` (`getSCEV`)
Core builder	`sub_DD65B0` (`createNodeForInstruction`, 1103 lines)
Worklist driver	`sub_DD8130` (non-recursive worklist `createSCEV`, 154 lines)
Instruction decomposer	`sub_D94080` (452 lines)
PHI handler	`sub_DD92B0` (`createNodeForPHI`)
GEP handler	`sub_DD3A70` (`getGEPExpr`)
Cache lookup	`sub_D98300` (`lookupSCEV`)
Cache store	`sub_DB77A0` (`insertSCEV`)
NVIDIA complexity scorer	`sub_DB3670` (expression size estimator)
SE object size	>1572 bytes (fields documented through offset +1572)
Calling conventions bypassing budget	CC 42, CC 43 (PTX kernel entry points)

ScalarEvolution Object Layout

The ScalarEvolution context (SE) is a large heap-allocated object. The fields relevant to SCEV construction:

Offset	Type	Field	Notes
`+0`	`Module*`	LLVM module / context pointer
`+8`	`TargetLibraryInfo*`	TLI	Used for intrinsic recognition
`+32`	`DominatorTree*`	Dominator tree	Required for PHI analysis
`+40`	`LoopInfo*`	Loop analysis	AddRec construction needs this
`+48`	`void*`	Analysis pointer	Used by complexity scorer
`+320`	`SmallDenseSet`	PHI visited set	Prevents infinite recursion
`+976`	`void*`	Unsigned range cache table	40-byte entries, open addressing
`+992`	`uint32_t`	Unsigned range cache capacity	Power-of-two
`+1008`	`void*`	Signed range cache table	Same structure
`+1024`	`uint32_t`	Signed range cache capacity
`+1560`	`uint8_t`	`simple_mode` flag	0 = normal, 1 = NVIDIA complexity control
`+1564`	`uint32_t`	`failure_count`	Simple mode: bailed instructions
`+1568`	`uint32_t`	`recursion_count`	Normal mode: depth counter
`+1572`	`uint8_t`	Complexity config bits	Tuning for the scorer

The SE object also contains the ValueExprMap (primary SCEV cache mapping Value* to SCEV*), the backedge-taken count cache at offset +648/+656/+672, and the per-exit BTC cache at +1168/+1184. These are documented in the range/BTC page.

The getSCEV Entry Point

sub_DD8400 (getSCEV) is the single entry point for obtaining a SCEV expression for any LLVM Value*. Every consumer -- LoopVectorize, LoopUnroll, LSR, IndVarSimplify, LoopInterchange -- calls this function. The algorithm:

SCEV* getSCEV(SE *se, Value *V) {
    // 1. Memo-table check
    SCEV *cached = lookupSCEV(se, V);      // sub_D98300
    if (cached) return cached;

    // 2. Dispatch based on mode
    if (se->simple_mode == 0) {
        // NORMAL PATH
        CallingConv cc = V->getParent()->getParent()->getCallingConv();
        if (cc == 42 || cc == 43) {
            // PTX kernel entry: bypass budget entirely
            return createSCEV(se, V);
        }
        se->recursion_count++;
        if (se->recursion_count <= MaxRecursionDepth) {
            return createSCEV(se, V);
        }
        return getUnknown(se, V);           // budget exceeded
    }

    // NVIDIA SIMPLE MODE (complexity control)
    if (se->failure_count > MaxExprFailures) {
        SCEV *u = getUnknown(se, V);
        insertSCEV(se, V, u);              // cache the Unknown
        return u;
    }
    uint64_t complexity = computeExprSize(se, V);  // sub_DB3670
    if (complexity > MaxExprSize) {
        se->failure_count++;
        SCEV *u = getUnknown(se, V);
        insertSCEV(se, V, u);
        return u;
    }
    // Expression is small enough: run normal path with mode toggled off
    se->simple_mode = 0;
    se->recursion_count = 0;
    SCEV *result = createSCEV(se, V);
    se->simple_mode = 1;
    return result;
}

The PTX kernel bypass (calling conventions 42 and 43) is significant: kernel functions always receive full SCEV analysis regardless of budget. NVIDIA considers kernels important enough that truncating their analysis would lose more performance than the extra compile time costs. Device helper functions, by contrast, are subject to the budget.

NVIDIA Simple Mode (Complexity Control)

Upstream LLVM uses a single recursion counter to bound getSCEV. NVIDIA replaces this with a two-stage gating system called simple_mode (enabled by the scalar-evolution-complexity-control flag, default true). The system is stored entirely in four bytes of the SE object:

Offset	Type	Field	Role
`+1560`	`uint8`	`simple_mode`	0 = normal (upstream-style), 1 = NVIDIA complexity control
`+1564`	`uint32`	`failure_count`	Running count of instructions classified as `SCEVUnknown` by the size gate
`+1568`	`uint32`	`recursion_count`	Upstream-style depth counter, only active when `simple_mode == 0`
`+1572`	`uint8`	`complexity_config`	Tuning bits read by the expression size scorer

When scalar-evolution-complexity-control is true (the default), the SE constructor initializes simple_mode to 1. The gating operates in three stages:

Stage 1 -- Failure gate. Before scoring anything, getSCEV checks failure_count > scalar-evolution-max-expr-failures (global qword_4F88348, default 100). If the function has already exceeded the failure budget, the instruction is classified as SCEVUnknown, the result is cached via sub_DB77A0 (insertSCEV), and control returns immediately. This prevents a single pathological function from burning O(N^2) time trying to score thousands of instructions that will all fail.

Stage 2 -- Expression size scoring. The scorer sub_DB3670 (expressionComplexity, 35KB binary, self-recursive) estimates how large the resulting SCEV expression tree would be. It walks the instruction's def-use chain bottom-up, counting nodes and weighting by expression kind:

uint64_t expressionComplexity(SE *se, Value *V) {
    // sub_DB3670 -- self-recursive, calls sub_CF4090 for SCEV node size
    if (V is Constant)     return 1;
    if (V is Argument)     return 1;
    if (!isSCEVable(V))    return 0;      // non-integer/pointer: free

    // Look up V in the SCEV cache; if already a SCEV node,
    // delegate to the node-size estimator
    SCEV *cached = lookupSCEV(se, V);
    if (cached)
        return sub_CF4090(cached);         // count nodes in SCEV tree

    // Not yet in cache: estimate from instruction structure
    Instruction *I = dyn_cast<Instruction>(V);
    if (!I) return 1;

    uint64_t score = 1;                    // 1 for this node
    Loop *L = LoopInfo->getLoopFor(I);
    if (L) {
        uint32_t depth = L->getLoopDepth();
        score += depth;                    // loop nesting multiplier
    }

    // Walk operands, accumulating recursively
    for (unsigned i = 0; i < I->getNumOperands(); i++) {
        score += expressionComplexity(se, I->getOperand(i));
    }

    // Apply configuration scaling from SE+1572
    if (se->complexity_config & 0x1)
        score = score * 3 / 2;            // 50% penalty for aggressive mode
    if (se->complexity_config & 0x2)
        score += depth * 2;               // extra loop nesting weight

    return score;
}

The helper sub_CF4090 counts nodes in an existing SCEV expression tree: it returns 1 for SCEVConstant and SCEVUnknown, recurses into operands for SCEVAddExpr/SCEVMulExpr/SCEVAddRecExpr (summing child sizes + 1), and handles casts (Truncate/ZeroExtend/SignExtend) as 1 + child size. The node-size estimate is precise because SCEV expressions are uniqued -- the same sub-expression pointer is never double-counted within a single scoring call.

If the total score exceeds scalar-evolution-max-expr-size (global dword_4F88428, default 384), the instruction is classified as SCEVUnknown and failure_count is incremented. The SCEVUnknown result is cached immediately so that later queries from different loop passes return instantly rather than re-running the scorer.

Stage 3 -- Mode toggle. When an instruction passes the size check (score <= 384), simple_mode is temporarily set to 0 and the recursion counter reset to 0 before calling createSCEV:

se->simple_mode = 0;        // disable complexity gating
se->recursion_count = 0;    // reset upstream counter for this sub-tree
SCEV *result = createSCEV(se, V);
se->simple_mode = 1;        // restore

This prevents double budget-checking: the upstream recursion counter inside createSCEV starts from 0 for the sub-expression tree rather than inheriting a parent depth. Each createSCEV call thus gets a fresh budget of scalar-evolution-max-recursion-depth (default 100) for its own sub-tree.

Practical effect: GPU kernels with hundreds of address computations (common in tiled matrix multiply, convolution stencils) hit the complexity wall early for outer variables, but the important inner loop induction variables -- which have simple affine structure -- always get analyzed. The two-stage gate (score first, then depth-limit) avoids the upstream problem where a single deep operand chain exhausts the entire recursion budget for the function.

Why not just raise the upstream recursion limit? The upstream counter is a global depth counter -- raising it means every instruction in the function gets more budget, including ones that will never produce useful SCEV expressions. The NVIDIA approach is per-instruction: each instruction is independently scored, and only instructions with manageable complexity get the full treatment. This keeps total SCEV compile time bounded at O(N * max_expr_size) rather than O(N * max_recursion_depth^2).

Worklist-Driven createSCEV

sub_DD8130 implements a non-recursive worklist to avoid deep stack frames. NVIDIA replaced the upstream recursive createSCEV with this iterative approach to handle GPU kernels that can have extremely deep expression trees (deeply nested address computations involving multiple grid dimensions).

The worklist stores Value* pointers with tag bits in the low 3 bits:

Bit	Meaning
Bit 2 (`0x4`)	First visit: needs full `createNodeForInstruction`
Bits 0-1 clear	Post-processing: operands have been evaluated, collect results

Algorithm:

Push initial value with bit 2 set.
Pop top entry.
- If bit 2 set: call sub_DD80F0 (createSCEV wrapper), which checks isSCEVable(V->getType()) via sub_D97040, then delegates to sub_DD65B0 (createNodeForInstruction).
- If the result is immediately available: cache it via sub_DB77A0 and continue.
- If operands are needed: push operands (without bit 2) for deferred processing.
Repeat until worklist empty.
Return lookupSCEV(initial_value).

The isSCEVable check (sub_D97040) accepts integer types and pointer types. Floating-point values and aggregate types produce SCEVUnknown.

Instruction Decomposer

Before the main opcode dispatch, sub_D94080 (decomposeIRInstruction) analyzes each instruction and fills a 48-byte decomposition struct:

struct SCEVDecomp {          // 48 bytes
    uint32_t kind;           // +0   decomposition opcode
    void    *operandL;       // +8   left operand (Value*)
    void    *operandR;       // +16  right operand (Value*)
    bool     hasNUW;         // +24  no-unsigned-wrap flag
    bool     hasNSW;         // +25  no-signed-wrap flag
    void    *extra;          // +32  third operand / loop variable
    bool     valid;          // +40  decomposition succeeded
};

The decomposer extracts NUW/NSW flags from inst->byte[1] (bit 2 = NUW, bit 1 = NSW), and these flags are only captured for opcodes matching the bitmask 0x40540000000000 -- covering add, sub, mul, shl, and related flag-bearing arithmetic. The kind field values:

Kind	Decimal	SCEV Construction
`0x0D`	13	Add/Sub -- iterative addend collection
`0x0F`	15	MulRec -- multiply-recurrence (loop-carried)
`0x11`	17	Multiply -- iterative multiplicand collection
`0x13`	19	UDiv
`0x16`	22	UMax select pattern
`0x19`	25	Shl -- converted to multiply by 2^N
`0x1A`	26	Generic shift/bitop fallback
`0x1B`	27	LShr -- complex truncate+extend chain
`0x1C`	28	AShr -- sign-extend analysis
`0x1D`	29	ICmp / comparison
`0x1E`	30	And (bitwise) -- pointer truncation patterns

The decomposer includes a GPU-specific PHI detection path (kind 64): when a PHI node's incoming value chain traces through a comparison instruction (byte == 0x55) whose operand is a function-entry value (byte == 0) that resolves to one of the recognized NVIDIA builtins (intrinsic IDs 312, 333, 339, 360, 369, 372), the decomposer creates a specialized recurrence form. This is how threadIdx.x-bounded loop variables become proper AddRec expressions.

createNodeForInstruction: The Core Builder

sub_DD65B0 (1103 lines) is the largest function in the SCEV subsystem. It operates in three phases:

Phase 1: Fast Path (lines 300-312)

Checks the instruction's type byte. Constants (byte 17) go directly to getConstant. Non-instruction values go to getUnknown. Real instructions check loop depth via LoopInfo -- if the instruction's loop nesting exceeds the maximum tracked depth, it bails to getUnknown with a simplified operand from sub_ACADE0.

Phase 2: Decomposition-Based Dispatch (lines 336-933)

After calling the instruction decomposer, dispatches on decomp.kind:

Add/Sub (kind 13): Iteratively collects addends into a SmallVector. For each operand with a non-zero extra field (the loop iteration variable), checks the SCEV cache, and if the operand has a known loop context (from sub_DD86E0 / getLoopForExpr), builds an SCEVAddRecExpr. Otherwise recursively calls getSCEV and optionally negates (for subtraction via getNegativeSCEV). Final result: getAddExpr(collected_operands).

Multiply (kind 17): Same iterative structure as Add but builds getMulExpr. For loop-carried chains, constructs getAddRecExpr(start, step, flags).

Shl (kind 25): Converts shift-left to multiplication by a power of two. When the shift amount is a constant: extracts the shift amount, verifies it fits in the type width (sub_986EE0), then builds getMulExpr(getSCEV(base), getConstant(1 << shamt), flags). Handles nested shl-of-shl by re-decomposing.

LShr (kind 27): When shifting right by a constant amount, builds a chain of getMulExpr + getTruncateExpr + getZeroExtendExpr to represent the bit extraction pattern. Falls back for non-constant shifts.

AShr (kind 28): Complex bit-extraction logic. For constant shifts, analyzes known bits to determine whether the shift extracts only zeros from the sign position. If provable, builds getSignExtendExpr(getTruncateExpr(getSCEV(base), intermediate_type), original_type). For non-constant shifts, tries SMin/SMax pattern matching.

And (kind 30): Handles pointer truncation patterns. When the mask equals (1 << ptr_bits) - 1 (a ptrtoint-then-mask pattern), builds getPtrToIntExpr + getSignExtendExpr. Otherwise bails.

Phase 3: Opcode-Based Dispatch (lines 936-1101)

Handles instructions not captured by the decomposer. The normalized opcode maps raw instruction bytes to semantic categories:

Call/Intrinsic (cases 5, 56): First tries the intrinsic SCEV lookup table (sub_B494D0). For known intrinsics, dispatches on intrinsic ID:

ID	Hex	SCEV Construction	Likely Intrinsic
1	`0x001`	`getNotSCEV(op0)`	bitwise NOT
7	`0x007`	`getSCEV(op0)` (identity)	`llvm.assume`
292	`0x124`	`getSCEV(op0)` (identity)	PTX intrinsic passthrough
329	`0x149`	`getUMinExpr(op0, op1)`	`llvm.umin`
330	`0x14A`	`getSMinExpr(op0, op1)`	`llvm.smin`
344	`0x158`	`getSCEV(op0)` (identity)	passthrough
359	`0x167`	`getSMinExpr + getUDivExpr + getAddExpr`	complex min/div
365	`0x16D`	`getSMaxExpr(op0, op1)`	`llvm.smax`
366	`0x16E`	`getSMinExpr(op0, op1)`	`llvm.smin` variant
371	`0x173`	`getAddRecExpr(op0, getUDivExpr(op0, op1))`	recurrence with division
493	`0x1ED`	`getConstant(inst->qword[1])`	constant from intrinsic metadata

PHI Node (case 34): Dispatches to sub_DD92B0 (createNodeForPHI). Walks PHI incoming values, checks for loop recurrence. If the PHI forms a recurrence: builds {start, +, step} as an SCEVAddRecExpr. Otherwise returns SCEVUnknown.

GEP (case 47): Calls sub_DD3A70 (getGEPExpr). Computes the SCEV of the base pointer, then adds the SCEV of each index scaled by the element size. If the result is SCEVUnknown, bails.

Casts (cases 38-40): Trunc produces getTruncateExpr. SExt produces getSignExtendExpr. ZExt has a special optimization: if the source decomposes as a multiply-recurrence (kind 15), it builds separate zero-extensions of start and step, then constructs getAddRecExpr(zext(start), zext(step), NUW) -- preserving the recurrence structure across the extension.

BitCast/AddrSpaceCast (case 49): If both source and target types are SCEV-able, returns getSCEV(source) (transparent). Otherwise getUnknown.

Select (cases 20, 23): If condition and true-value are loop-invariant (sub_DBED40), builds getUDivExpr (case 20) or getUMaxExpr (case 23) of the branches.

GPU-Specific SCEV Sources

Thread and Block Index Builtins

When the instruction decomposer encounters a PHI whose incoming value chain traces to one of NVIDIA's special register intrinsics, it recognizes it as a bounded induction variable. The recognized intrinsic IDs and their SCEV significance:

Intrinsic ID	CUDA Variable	SCEV Range Bound
312	`blockDim.x` / `gridDim.x`	Dimension query -- provides trip count upper bound
333	`threadIdx.x`	Range: `[0, blockDim.x)`
339	`threadIdx.y` / `blockIdx.x`	Range: `[0, blockDim.y)` or `[0, gridDim.x)`
360	`threadIdx.z` / `blockIdx.y`	Range: `[0, blockDim.z)` or `[0, gridDim.y)`
369	`blockIdx.z`	Range: `[0, gridDim.z)`
372	`warpSize` / `laneid`	Range: `[0, 32)` (constant on all architectures)

These ranges are injected during SCEV construction, not during range analysis. When a PHI node tests a value against threadIdx.x (for example, a loop for (int i = threadIdx.x; i < N; i += blockDim.x)), the decomposer produces an SCEVAddRecExpr whose start value carries the constraint [0, blockDim.x). This propagates through all downstream SCEV consumers.

The CUDA variable to LLVM intrinsic mapping is:

CUDA	LLVM Intrinsic	PTX Register
`threadIdx.x`	`@llvm.nvvm.read.ptx.sreg.tid.x`	`%tid.x`
`threadIdx.y`	`@llvm.nvvm.read.ptx.sreg.tid.y`	`%tid.y`
`threadIdx.z`	`@llvm.nvvm.read.ptx.sreg.tid.z`	`%tid.z`
`blockDim.x`	`@llvm.nvvm.read.ptx.sreg.ntid.x`	`%ntid.x`
`blockIdx.x`	`@llvm.nvvm.read.ptx.sreg.ctaid.x`	`%ctaid.x`
`gridDim.x`	`@llvm.nvvm.read.ptx.sreg.nctaid.x`	`%nctaid.x`

PTX Kernel Calling Convention Bypass

Functions with calling convention 42 or 43 (PTX __global__ kernels) bypass the SCEV recursion budget entirely. The rationale: kernels are the units of work the programmer explicitly marked for GPU execution. Spending extra compile time to fully analyze their loop structure always pays off because:

Kernels are where vectorization decisions have the highest payoff.
GPU hardware constraints (occupancy, shared memory) demand precise trip count knowledge.
Kernel functions are few per compilation unit, so the budget bypass does not cause compile-time explosion.

Device functions (__device__, conventions other than 42/43) remain subject to the standard budget.

Warp-Stride and Grid-Stride Loop Patterns

Two CUDA-specific loop idioms produce distinctive SCEV expressions. Neither has an analog in CPU code, and cicc's SCEV subsystem recognizes both at construction time -- not as a post-hoc pattern match.

Warp-Stride Loop

for (int i = threadIdx.x; i < N; i += warpSize) { ... }

The PHI decomposer (sub_D94080) recognizes the increment value as the constant 32 (warpSize is a compile-time constant on all NVIDIA architectures). The resulting SCEV:

{threadIdx.x, +, 32}<nuw><loop>

Start: SCEVUnknown(@llvm.nvvm.read.ptx.sreg.tid.x), range [0, blockDim.x) (injected from the builtin table, intrinsic ID 333).
Step: SCEVConstant(32).
Flags: NUW (no-unsigned-wrap) is set because the start is non-negative and the step is positive. The PHI decomposer sets this flag when the incoming value (intrinsic ID 372 = warpSize) resolves to a constant and the start range has a non-negative lower bound.
Trip count: The backedge-taken count (sub_DB9E00) computes:
```
BTC = udiv(N - threadIdx.x + 31, 32)
    = udiv(sext(N) - sext(start) + step - 1, step)
```
This is the standard SCEV computeExitCountFromICmpUN path for i < N with stride 32.

The NUW flag is critical: it allows the loop vectorizer to prove that the induction variable never wraps, enabling vectorization without a runtime overflow check. Without the warp-stride recognition, the vectorizer would see SCEVUnknown(threadIdx.x) as an opaque value and conservatively assume wrapping is possible.

Grid-Stride Loop

for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * gridDim.x) { ... }

The instruction decomposer traces through the PHI's increment chain. The addition blockDim.x * gridDim.x is recognized as two calls to special register intrinsics (IDs 312 for blockDim.x and 312 again for gridDim.x) combined in a multiply. The resulting SCEV:

{blockIdx.x * blockDim.x + threadIdx.x, +, blockDim.x * gridDim.x}<loop>

Decomposition detail:

Start: SCEVAddExpr(SCEVMulExpr(SCEVUnknown(blockIdx.x), SCEVUnknown(blockDim.x)), SCEVUnknown(threadIdx.x)).
- blockIdx.x (ID 339): range [0, gridDim.x).
- blockDim.x (ID 312): range [1, 1024] (hardware limit).
- threadIdx.x (ID 333): range [0, blockDim.x).
- The combined start range is [0, gridDim.x * blockDim.x) = [0, total_threads).
Step: SCEVMulExpr(SCEVUnknown(blockDim.x), SCEVUnknown(gridDim.x)) -- this is the total grid size. Both operands are SCEVUnknown values with ranges from the builtin table.
Trip count: computeBackedgeTakenCount (sub_DB9E00) produces:
```
BTC = udiv(N - start + step - 1, step)
```
where start and step are symbolic. The trip count itself is SCEVUnknown (the exact value depends on runtime launch configuration), but the maximum trip count can be bounded using the range constraints.

Delinearization of Grid-Stride Patterns

The delinearization system (sub_DE9D10, documented in SCEV Invalidation & Delinearization) specifically recognizes the grid-stride pattern. In the ZeroExtend/SignExtend handlers (cases 3 and 4 of the delinearizer), when an AddRecExpr whose step matches the delinearization context's step_recurrence field (ctx+0x68):

The delinearizer checks if step == blockDim.x * gridDim.x by comparing the step SCEV pointer against ctx[+0x68].
If matched and the AddRec has exactly 2 operands (start + step), the delinearizer treats this as a dimension boundary -- the step represents the stride of the outer dimension in a multi-dimensional array access.
The dimension size is extracted and added to the term collector at ctx[+0x58]. The element count is obtained via sub_D33D80 (getElementSize) and sub_DA4270 (getConstant).
The delinearizer reconstructs the multi-dimensional subscript by applying getZeroExtendExpr (or getSignExtendExpr) to the start and step separately, preserving the recurrence structure across the extension.

This is how cicc recovers the original multi-dimensional array indices from grid-stride loops over flattened arrays -- essential for dependence analysis in LoopVectorize and LoopInterchange.

Block-Stride Loop (Variant)

A less common but recognized pattern:

for (int i = threadIdx.x; i < N; i += blockDim.x) { ... }

Produces: {threadIdx.x, +, blockDim.x}<loop>. The step is SCEVUnknown(blockDim.x) with range [1, 1024]. The trip count is udiv(N - threadIdx.x + blockDim.x - 1, blockDim.x) -- symbolic but bounded. This pattern is common in reduction kernels and shared-memory tiling.

Aggressive Positive Stride Analysis

The NVIDIA-specific knob aggressive-positive-stride-analysis (see nvbug 3972412) enables additional reasoning about stride signs. When enabled, the SCEV range analysis assumes that strides derived from blockDim.x, gridDim.x, and warpSize are always positive (range [1, ...) rather than [0, ...)). This allows the loop vectorizer and LSR to prove monotonic increase of induction variables, eliminating runtime overflow checks. The knob is registered in ctor_131_0 (constructor at 0x4E1CD0 area) and can be disabled via -no-aggressive-positive-stride-analysis.

The special-reassociate-for-threadid knob (description: "Don't move back expressions with threadid") prevents SCEV-based reassociation from hoisting threadIdx.x expressions out of their canonical position. Without this guard, the reassociator might combine threadIdx.x + offset into a form that obscures the warp/grid-stride pattern for downstream consumers.

SCEV Expression Types and the FoldingSet

SCEV expressions are uniqued in a FoldingSet (LLVM's hash-based deduplication container). Each expression type is identified by a uint16 opcode at scev_expr+24:

Opcode	Type	Operands	Notes
0	`SCEVConstant`	1 (APInt)	Leaf: integer constant
1	`SCEVUnknown`	1 (Value*)	Leaf: opaque value, possibly with range info
2	`SCEVTruncateExpr`	1 + type	Truncation cast
3	`SCEVZeroExtendExpr`	1 + type	Zero extension
4	`SCEVSignExtendExpr`	1 + type	Sign extension
5	`SCEVAddExpr`	N-ary	Commutative sum
6	`SCEVMulExpr`	N-ary	Commutative product
7	`SCEVUDivExpr`	2	Unsigned division
8	`SCEVAddRecExpr`	2+ (start, step, ...)	`{start, +, step}<loop>` recurrence
9	`SCEVSMaxExpr`	N-ary	Signed maximum
10	`SCEVUMaxExpr`	N-ary	Unsigned maximum
11	`SCEVSMinExpr`	N-ary	Signed minimum
12	`SCEVUMinExpr`	N-ary	Unsigned minimum
13	(variant min/max)	N-ary	Additional min/max form
14	`SCEVCouldNotCompute`	0	Sentinel: analysis failed
15	`SCEVSequentialUMinExpr`	N-ary	Short-circuit unsigned min

The expression node layout:

Offset	Size	Field
`+0`	8	Vtable / tag
`+24`	2	Opcode (SCEV kind)
`+28`	2	Flags: NUW=`0x2`, NSW=`0x4`
`+32`	8	Operand array pointer or first operand
`+40`	varies	Operand count (for N-ary) or second operand

Pointer comparisons suffice for SCEV equality because of the uniquing: two SCEV* values are equal if and only if they point to the same node.

SCEV Constructor Functions

Each expression type has a dedicated constructor that canonicalizes and deduplicates:

Address	Function	Signature
`sub_DC8BD0`	`getAddExpr`	`(SmallVector &operands, flags, depth)`
`sub_DC7ED0`	`getAddExpr`	`(SCEV a, SCEV b, flags, depth)`
`sub_DCA690`	`getMulExpr`	`(SCEV a, SCEV b, flags, depth)`
`sub_DCC810`	`getAddRecExpr`	`(SCEV start, SCEV step, flags, depth)`
`sub_DCB270`	`getUDivExpr`	`(SCEV lhs, SCEV rhs)`
`sub_DCFA50`	`getUMaxExpr`	`(SCEV a, SCEV b)`
`sub_DCEE80`	`getSMinExpr`	`(SCEV a, SCEV b)`
`sub_DCE050`	`getSMaxExpr`	`(SCEV a, SCEV b)`
`sub_DCDFA0`	`getUMinExpr`	`(SCEV a, SCEV b)`
`sub_DC5200`	`getTruncateExpr`	`(SCEV op, Type ty, depth)`
`sub_DC5000`	`getZeroExtendExpr`	`(SCEV op, Type ty, depth)`
`sub_DC2B70`	`getSignExtendExpr`	`(SCEV op, Type ty, depth)`
`sub_DD1D00`	`getPtrToIntExpr`	`(SCEV *ptr)`
`sub_DA26C0`	`getConstant`	`(APInt val)`
`sub_DA3860`	`getUnknown`	`(Value *V)`
`sub_DCAF50`	`getNegativeSCEV`	`(SCEV *expr, flags)`
`sub_DCE000`	`getNotSCEV`	`(SCEV *expr, bool isNSW)` -- `-1 - x`

The N-ary constructors (getAddExpr, getMulExpr, min/max) canonicalize operand order and fold constants. For example, getAddExpr({5, x, 3}) folds to getAddExpr({8, x}) and orders the constant first.

The SCEV Cache

The primary SCEV cache (ValueExprMap) maps Value* to SCEV* using an open-addressed hash table with the standard hash function used throughout cicc's SCEV subsystem:

slot = ((uint32_t)key >> 9) ^ ((uint32_t)key >> 4)
slot &= (capacity - 1)

Sentinels: EMPTY = 0xFFFFFFFFFFFFF000 (-4096), TOMBSTONE = 0xFFFFFFFFFFFFE000 (-8192). Capacity is always a power of two. Growth occurs at 75% load factor (doubling), and in-place rehashing (tombstone cleanup) triggers when fewer than 1/8 of slots are truly empty.

Cache lookup (sub_D98300) is called at the top of every getSCEV invocation. Cache store (sub_DB77A0) is called after every successful SCEV construction, and also when the complexity control bails to SCEVUnknown (caching the Unknown result prevents re-scoring the same instruction).

The simple mode's failure caching is critical for performance: once an instruction is classified as SCEVUnknown, the result is cached so that subsequent queries (from different loop analysis passes) return instantly rather than re-running the complexity scorer.

How SCEV Feeds Loop Optimizations

SCEV is consumed by every loop optimization in cicc. The key interfaces:

LoopVectorize (sub_DFAE00 and callers): Calls getBackedgeTakenCount (sub_DCF980) to determine whether the loop has a computable trip count. If not, vectorization is abandoned. Uses getSmallBestKnownTC (sub_2AA7EC0) for the trip count upper bound, which is compared against -vectorizer-min-trip-count. SCEV range analysis (sub_DBB9F0) proves that the epilogue trip count is sufficient for the minimum vector factor. Runtime SCEV overflow checks generate scev.check basic blocks.

LoopUnroll (sub_19B6690): The unroll factor selection function extracts MaxTripCount from SCEV. Runtime trip counts below flat-loop-tripcount-threshold (default 5) mark the loop as "flat" and skip unrolling. Partial unrolling requires BackedgeCount % UnrollCount computation. After unrolling, sub_2A13F00 reconciles SCEV and LoopInfo for the modified loop.

IndVarSimplify (sub_1945A50): Uses SCEV to compute exit values, rewrite loop exit conditions, and perform LFTR (Linear Function Test Replace). NVIDIA adds two guards:

Disable-unknown-trip-iv (registered in ctor_203 at 0x4E1CD0, global qword_4FAF520): When set, the pass is skipped entirely for loops whose trip count is SCEVCouldNotCompute. The check in the run() wrapper (sub_19489B0, lines 119-122) calls sub_1CED350 (trip count query) and sub_1CED620 (trip count for header). This protects GPU-specific loops with divergent control flow from incorrect IV transforms.
iv-loop-level (default 1, global qword_4FAF440): Limits IndVarSimplify to loops at nesting depth <= the configured level. sub_193DD90 (getLoopDepth) returns 1 for outermost loops. The default restricts IV simplification to outermost loops only, avoiding compile-time explosion on deeply-nested GPU kernels (stencil, tensor code).

Loop Strength Reduction (sub_19A87A0): The NVIDIA custom LSR reads SCEV expressions for each loop use (base SCEV at +0, stride SCEV at +8, loop bounds at +712/+720). The formula solver generates alternatives by factoring common strides out of SCEV expressions. SCEV normalization (sub_199D980) provides canonical forms for hash-table keying. NVIDIA adds disable-unknown-trip-lsr to skip LSR entirely for unknown-trip-count loops, plus lsr-check-rp / lsr-rp-limit to gate LSR on register pressure.

LoopInterchange (sub_E05-loop-interchange): Uses SCEV stride analysis to determine which loops carry memory strides. If a subscript has stride in both inner and outer loops, it is marked "ambiguous" and interchange is blocked. For grid-stride loops, the step blockDim.x * gridDim.x is recognized as an outer-loop stride, allowing interchange when the array subscript depends on a single loop dimension.

Configuration: All SCEV Knobs

NVIDIA-Specific Knobs

Knob	Default	Effect
`scalar-evolution-complexity-control`	true	Enables the `simple_mode` system
`scalar-evolution-max-expr-size`	384	Max SCEV expression complexity score before bailing to Unknown
`scalar-evolution-max-expr-failures`	100	Max bailed instructions before giving up on entire function
`scalar-evolution-max-add-items`	500	Max addends in a single `SCEVAddExpr`
`do-sign-ext-expand`	false	Expand sign-extensions during SCEV construction
`do-sign-ext-simplify`	(bool)	Simplify SCEV on sign-extend expressions
`track-trip-count-more`	true	More aggressive trip count tracking
`common-factor-with-mr265`	true	SCEV common factor optimization (internal MR reference)
`scalar-evolution-classify-expressions`	true	Enable SCEV expression classification
`aggressive-positive-stride-analysis`	(bool)	Aggressive stride sign reasoning for blockDim/gridDim/warpSize (see nvbug 3972412)
`special-reassociate-for-threadid`	(bool)	Prevent hoisting threadIdx expressions out of canonical position
`Disable-unknown-trip-iv`	(bool)	Skip IndVarSimplify for loops with `SCEVCouldNotCompute` trip count
`disable-unknown-trip-lsr`	(bool)	Skip Loop Strength Reduction for unknown-trip-count loops
`iv-loop-level`	1	Max loop nesting depth for IndVarSimplify (1 = outermost only)
`scev-cgp-tid-max-value`	(int)	Max value of thread ID for SCEV-CGP address mode optimization

Upstream LLVM Knobs (Preserved in cicc)

Knob	Default	Effect
`scalar-evolution-max-recursion-depth`	100	Hard counter for `getSCEV` depth in normal mode
`scalar-evolution-max-iterations`	100	Max iterations for constant evolution
`scalar-evolution-max-arith-depth`	32	Max arithmetic simplification depth
`scalar-evolution-max-cast-depth`	8	Max cast folding depth
`scalar-evolution-max-ext-depth`	8	Max extension analysis depth
`scalar-evolution-max-constant-evolving-depth`	32	Max depth for constant evolving analysis
`scalar-evolution-max-scev-compare-depth`	32	Max depth for SCEV comparison
`scalar-evolution-max-scev-operations-implication-depth`	2	Max depth for implication reasoning
`scalar-evolution-max-value-compare-depth`	2	Max depth for value comparison
`scev-mulops-inline-threshold`	32	Max multiply operands before outline
`scev-addops-inline-threshold`	500	Max add operands before outline
`verify-scev`	false	Enable SCEV verification
`verify-scev-strict`	false	Stricter SCEV verification
`verify-scev-maps`	false	Verify SCEV map consistency

SCEV Global Variables (Binary Addresses)

Global	Knob String	Default	Used By
`dword_4F88268`	`scalar-evolution-max-recursion-depth`	100	`getSCEV` normal mode depth counter
`qword_4F88348`	`scalar-evolution-max-expr-failures`	100	`getSCEV` simple mode failure gate
`dword_4F88428`	`scalar-evolution-max-expr-size`	384	`expressionComplexity` size threshold
`qword_4F88DC8`	(loop iteration bound)	--	Exit analysis iteration limit
`qword_4F88EA8`	(range recursion limit)	--	`getRangeRef` max recursion depth

SCEV-CGP Knobs (Address Mode Optimization)

Knob	Effect
`do-scev-cgp`	Enable SCEV-based CodeGenPrepare
`do-scev-cgp-aggresively`	Aggressive mode (sic -- typo preserved in binary)
`do-function-scev-cgp`	Function-level SCEV-CGP
`nv-disable-scev-cgp`	Disable the SCEV-CGP pass entirely
`scev-cgp-control`	Control number of transformations
`scev-cgp-cross-block-limit`	Max common bases from a single block
`scev-cgp-idom-level-limit`	Limit IDOM traversal level
`scev-cgp-inst-limit`	Max instructions considered per parameter
`scev-cgp-old-base`	Use old base computation method
`scev-cgp-tid-max-value`	Max thread ID value for address mode analysis
`print-after-scev-cgp`	Print function IR after SCEV-CGP

Differences from Upstream LLVM

The cicc v13.0 SCEV subsystem diverges from upstream LLVM 20.0.0 ScalarEvolution.cpp in the following ways:

Feature	Upstream LLVM	cicc v13.0
Budget system	Single `recursion_count` depth counter	Two-stage: expression size scoring (`sub_DB3670`) + failure counting, toggled via `simple_mode` flag
Kernel bypass	No concept of calling convention bypass	CC 42/43 (PTX `__global__`) bypass all SCEV budgets
`createSCEV`	Recursive	Non-recursive worklist (`sub_DD8130`) to handle deep GPU expression trees
GPU builtin ranges	No thread/block index knowledge	Intrinsic IDs 312/333/339/360/369/372 inject ranges at SCEV construction time
PHI decomposition	Standard recurrence detection	GPU-specific path (kind 64) traces PHI chains through NVIDIA special register intrinsics
Delinearization	Standard dimension recovery	Polymorphic predicate collector recognizes grid-stride patterns; `step_recurrence` field enables GPU memory coalescing analysis
Trip count tracking	Standard	`track-trip-count-more` (default true) enables more aggressive BTC computation
Stride sign reasoning	Standard	`aggressive-positive-stride-analysis` assumes blockDim/gridDim/warpSize are always positive
Expression canonicalization	Standard	`special-reassociate-for-threadid` prevents moving threadIdx expressions
SCEV-CGP	Not present	Complete NVIDIA SCEV-based CodeGenPrepare pass with 11 dedicated knobs
Knob count	~15 standard knobs	15 upstream + 15 NVIDIA-specific + 11 SCEV-CGP = ~41 total SCEV knobs

The most consequential divergence is the simple_mode system: it changes the compile-time complexity class of SCEV analysis from O(N * D^2) (where D is recursion depth) to O(N * S) (where S is the per-instruction size limit), making SCEV analysis tractable on large GPU kernels without sacrificing accuracy on the important inner-loop induction variables.

Function Map

Function	Address	Size	Role
`getSCEV`	`sub_DD8400`	--	Top-level entry; cache + mode dispatch
Worklist `createSCEV`	`sub_DD8130`	--	Non-recursive worklist driver
`createSCEV` wrapper	`sub_DD80F0`	--	Type check + delegate
`createNodeForInstruction`	`sub_DD65B0`	--	Core 3-phase opcode dispatch
`decomposeIRInstruction`	`sub_D94080`	--	Instruction to decomposition struct
`createNodeForPHI`	`sub_DD92B0`	--	PHI to AddRec conversion
`createNodeForSelectOrPHI`	`sub_DD99C0`	--	Select/PHI combined handler
`getExistingExpr`	`sub_DD6410`	--	Fast path for phi recurrence
`getGEPExpr`	`sub_DD3A70`	--	GEP to SCEV conversion
`getLoopForExpr`	`sub_DD86E0`	--	Determine loop context for expression
`lookupSCEV`	`sub_D98300`	--	Cache lookup (ValueExprMap)
`insertSCEV`	`sub_DB77A0`	--	Cache store
`expressionComplexity`	`sub_DB3670`	--	NVIDIA expression size scorer; self-recursive, uses `sub_CF4090`
SCEV node size counter	`sub_CF4090`	--	Counts nodes in existing SCEV tree for complexity scoring
`getSmallConstantTripCount`	`sub_DB04E0`	--	Extract small constant trip count
`classifyExpressions` / `print`	`sub_1495EB0`	--	Debug: "Classifying expressions for: "
`isSCEVable`	`sub_D97040`	--	Type is integer or pointer
`isUnknown` / `isFailedSCEV`	`sub_D96A50`	--	Check SCEVUnknown
`getSCEVType`	`sub_D95540`	--	Extract LLVM Type from SCEV expr
`getTypeBitWidth`	`sub_D97050`	--	Bit width of a type
`lookupIntrinsicSCEV`	`sub_B494D0`	--	Intrinsic fast-path table
`isIntrinsicCall`	`sub_988010`	--	Intrinsic detection
`isLoopInvariant`	`sub_DBED40`	--	Loop invariance check
`isIntegerTy`	`sub_BCAC40`	--	Integer type check
`getRangeRef`	`sub_DBB9F0`	--	ConstantRange evaluator (see range page)
`computeBackedgeTakenCount`	`sub_DB9E00`	--	BTC computation (see range page)
`forgetLoop`	`sub_DE2750`	--	Cache invalidation (see invalidation page)
`delinearize`	`sub_DE9D10`	--	Array delinearization (see invalidation page)

Cross-References

LoopVectorize & VPlan -- primary consumer of trip counts and SCEV ranges
Loop Unrolling -- uses SCEV for unroll factor selection and trip count analysis
Loop Strength Reduction (NVIDIA) -- uses SCEV expressions for formula generation
SCEV Range Analysis & Trip Counts -- ConstantRange computation and backedge-taken count
SCEV Invalidation & Delinearization -- cache eviction and multi-dimensional array recovery
Builtin Table Structure -- intrinsic ID assignments for threadIdx/blockIdx/etc.
IndVarSimplify -- SCEV-dependent IV transforms with Disable-unknown-trip-iv guard
SCEV-CGP (CodeGenPrepare) -- NVIDIA SCEV-based address mode optimization
LLVM Knobs (1,689) -- full knob catalog including all SCEV knobs
GPU Execution Model -- why GPU kernels need special SCEV treatment

SCEV Range Analysis & Backedge-Taken Counts

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Every loop optimization in cicc ultimately depends on two questions: "what values can this expression take?" and "how many times does this loop iterate?" The SCEV range analysis (sub_DBB9F0, corresponding to ScalarEvolution::getRangeRef) answers the first by propagating ConstantRange intervals through SCEV expression trees. The backedge-taken count (BTC) machinery (sub_DB9E00 / sub_DB9040, corresponding to computeBackedgeTakenCount / computeExitCountForBranch) answers the second by solving loop exit conditions algebraically. The two systems feed each other: range analysis uses trip counts to bound AddRec expressions, and trip count computation uses ranges to prove overflow behavior. On GPU targets, these analyses gain additional precision from NVIDIA-specific range sources -- thread indices are bounded by block dimensions, warpSize is the constant 32, and __launch_bounds__ metadata constrains block dimensions -- all of which flow into tighter ranges and more computable trip counts.

Key Facts

Property	Value
Range evaluator	`sub_DBB9F0` (0xDBB9F0), 31 KB
BTC dispatcher	`sub_DCF3A0` (0xDCF3A0), mode 0=exact, 1=constant-max, 2=symbolic-max
BTC cache builder	`sub_DB9E00` (0xDB9E00), 2,265 bytes
Exit count engine	`sub_DB9040` (0xDB9040), 18 KB
howFarToZero	`sub_DBA850` (0xDBA850), 8 KB
howManyLessThans	`sub_DCE310` (0xDCE310), 317 lines
Range cache (unsigned)	`scev_ctx+976`, 40-byte entries, open-addressing
Range cache (signed)	`scev_ctx+1008`, 40-byte entries, open-addressing
BTC cache	`scev_ctx+656`, 168-byte entries, open-addressing
Per-exit BTC cache	`scev_ctx+1168`, 56-byte entries
Max range recursion depth	`qword_4F88EA8` (global, configurable)
Extended exit analysis flag	`qword_4F88C08` (global, enables Phase D)
NVIDIA knobs	`track-trip-count-more`, `aggressive-positive-stride-analysis`, `do-sign-ext-simplify`, `do-sign-ext-expand`

ConstantRange Propagation Algorithm

The range evaluator sub_DBB9F0 takes a SCEV expression, a signedness flag (is_signed: 0=unsigned, 1=signed), and a recursion depth counter. It returns a pointer to a cached 32-byte ConstantRange representing the half-open interval [lower, upper) with wrapping semantics. The algorithm is a recursive descent over the SCEV expression tree with aggressive caching.

Cache Structure

Two separate hash tables store signed and unsigned ranges:

if (is_signed) {
    table    = scev_ctx[+1008];   // signed range cache
    capacity = scev_ctx[+1024];
} else {
    table    = scev_ctx[+976];    // unsigned range cache
    capacity = scev_ctx[+992];
}

Each entry is 40 bytes: an 8-byte key (SCEV pointer, with 0xFFFFFFFFFFFFF000 as the empty sentinel) followed by a 32-byte ConstantRange value. The hash function is:

slot = ((uint32_t)scev_ptr >> 9) ^ ((uint32_t)scev_ptr >> 4);
slot &= (capacity - 1);  // capacity is always a power of two

Linear probing resolves collisions. On a cache hit, the function returns immediately without recomputation.

Dispatch by SCEV Kind

After a cache miss, the evaluator dispatches on the SCEV opcode at scev_expr+24 (uint16):

Opcode	Kind	Range Computation
0	SCEVConstant	Single-value range from the constant's APInt
1	SCEVUnknown	`sub_988CD0`: range from ValueTracking / instruction semantics
2	SCEVTruncate	Recurse on operand, apply `ConstantRange::truncate`
3	SCEVZeroExtend	Recurse on operand, apply `ConstantRange::zeroExtend`
4	SCEVSignExtend	Recurse on operand, apply `ConstantRange::signExtend`
5	SCEVAddExpr	Fold operand ranges with `addWithNoWrap`, respecting NUW/NSW
6	SCEVMulExpr	Fold operand ranges with `ConstantRange::multiply`
7	SCEVUDivExpr	`ConstantRange::udiv` of LHS and RHS ranges
8	SCEVAddRecExpr	Multi-phase analysis (see below)
9-13	SMax/UMax/SMin/UMin	Fold via lookup table `dword_3F74E60[opcode-9]` + `sub_ABD750`
14	SCEVCouldNotCompute	Passthrough (identity range)
15	SCEVSequentialUMin	Complex instruction-level analysis (PHI, intrinsics, metadata)

Every computed range is intersected with an initial range derived from the type's bit width and any known-bits / sign-bits information before being stored in the cache. This intersection can only narrow the range, never widen it.

Initial Range Narrowing

Before the SCEV-kind dispatch, the evaluator computes an initial range from type information:

Unsigned mode: calls sub_DB5510 (getKnownBits) to extract known high zero bits, constructs a range [0, 2^(bitwidth - leading_zeros)) and intersects it with the full-set range.
Signed mode: calls sub_DB55F0 (getNumSignBits) and constructs a symmetric signed range from the sign-bit count, e.g., if 3 sign bits are known, the range is [-2^(bw-3), 2^(bw-3)).

This pre-narrowing ensures that even when the SCEV-kind dispatch returns a full-set (e.g., for complex expressions at the depth limit), the result still reflects type-level constraints.

AddRec Range Analysis (The Core)

The SCEVAddRecExpr case (opcode 8) is the most complex, executing up to five phases that progressively narrow the range of a loop induction variable {start, +, step}:

Phase A -- NoWrap Start Refinement. If the AddRec has NUW or NSW flags (bits at scev_expr+28), the unsigned range of the start value is computed and intersected. This ensures that the IV's initial value constrains the overall range even before considering the step.

Phase B -- Step Monotonicity. If the NSW flag (bit 2, value 0x4) is set:

sub_DBED40 checks if all step operands are non-negative (monotone up). If so, the signed minimum of start becomes the lower bound: range [smin(start), SMAX].
sub_DBEC80 checks if all steps are non-positive (monotone down). If so, the signed maximum of start becomes the upper bound: range [SMIN, smax(start)+1].

Phase C -- Trip Count Refinement. For simple two-operand recurrences ({start, +, step} with operand count == 2):

Call sub_DCF3A0(ctx, loop, 1) to get the max backedge-taken count.
If the trip count is computable, compute range(start + step * [0, trip_count]) for both unsigned (sub_DBEFC0) and signed (sub_DBF480) domains.
Intersect both results into the accumulated range.

This is where range analysis and BTC computation form their feedback loop: the BTC is used to bound the AddRec's range.

Phase D -- Exit Value Analysis (NVIDIA-gated). Enabled only when global qword_4F88C08 is set. Gets the exact backedge-taken count (mode=2 via sub_DCF3A0), and if the trip count bit width fits within the AddRec's bit width and NSW is set, calls sub_DE4FD0 to compute the exit value range. This provides the tightest possible bound but is more expensive.

Phase E -- Cache and Return. The final accumulated range (from all intersections) is stored in the cache.

SCEVUnknown and Instruction-Level Analysis

For SCEVUnknown (opcode 1) and the complex instruction-level path (opcode 15), the range evaluator performs several specialized analyses:

!range metadata: if the underlying instruction carries !range metadata (kind=4), sub_B91C10 extracts it and sub_ABEA30 builds a ConstantRange directly.
Predecessor merging: sub_DBB110 computes ranges by analyzing incoming values from predecessor basic blocks, intersecting the results.
PHI node analysis: for PHI nodes (instruction opcode 84), the evaluator iterates all incoming values, computes each one's SCEV range, and unions them. A visited-PHI set at scev_ctx+320 prevents infinite recursion through cyclic PHIs.
Intrinsic ranges: sub_988010 identifies specific intrinsics (e.g., ctpop, ctlz, cttz) and constrains their ranges to non-negative values via sub_ABB6C0.
Stride alignment: sub_BD4FF0 computes stride/alignment information for loads and stores, narrowing the range to multiples of the known stride.

Signed/Unsigned Cross-Pollination

A critical detail: the AddRec analysis explicitly recurses with the opposite signedness flag in certain sub-analyses. Phase A always computes the start in unsigned mode (is_signed=0), while Phase B always uses signed mode (is_signed=1). This cross-referencing allows information from one domain to constrain the other, producing tighter bounds than either domain alone.

GPU-Specific Range Sources

Three categories of NVIDIA-specific range information feed into SCEV range analysis, all derived from the CUDA execution model:

Thread and Block Index Ranges

The intrinsics @llvm.nvvm.read.ptx.sreg.tid.{x,y,z} (threadIdx) produce values in [0, blockDim-1]. The intrinsics @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z} (blockIdx) produce values in [0, gridDim-1]. When these intrinsics appear as SCEVUnknown nodes, the range evaluator propagates their constrained ranges through the expression tree.

The block dimension intrinsics @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z} are bounded by __launch_bounds__ metadata when present. Specifically, nvvm.maxntid (from __launch_bounds__(maxThreads)) provides an upper bound on ntid.x * ntid.y * ntid.z, and nvvm.reqntid provides an exact value. These bounds are read by sub_CE8D40 (NvvmMeta_getMaxNTID) and sub_CE8DF0 (NvvmMeta_getReqNTID).

warpSize (@llvm.nvvm.read.ptx.sreg.warpsize) is the constant 32 on all architectures from sm_70 onward, producing the singleton range [32, 33).

Grid-Stride Loop Patterns

SCEV delinearization (sub_DE9D10) specifically recognizes the grid-stride pattern:

// CUDA:  for (int i = tid + bid * bdim; i < N; i += bdim * gdim)
// SCEV:  {threadIdx.x + blockIdx.x * blockDim.x, +, blockDim.x * gridDim.x}

The step blockDim.x * gridDim.x inherits known-positive range from both operands, enabling the monotonicity analysis in Phase B to prove the IV is non-decreasing. Combined with the bounded start value (tid.x + bid.x * bdim.x is non-negative), the range of the entire AddRec is [0, N) rather than full-set.

KnownBits and DemandedBits Integration

The sub_99B5E0 post-analysis in SimplifyDemandedBits applies NVIDIA-specific refinements including thread index range constraints (threadIdx.x < blockDim.x) and warp-level uniformity assumptions. These propagate through SCEV's getKnownBits (sub_DB5510) to tighten the initial unsigned range of expressions involving GPU special registers.

Backedge-Taken Count Computation

The BTC machinery computes how many times a loop's backedge executes before any exit is taken. The result has three variants:

Exact count: the precise number of iterations, or SCEVCouldNotCompute if unknown.
Constant max: a constant upper bound on the iteration count.
Symbolic max: a SCEV expression bounding the iteration count (may involve loop-invariant values).

BTC Cache Layout

The primary BTC cache at scev_ctx+656 uses 168-byte entries:

Offset	Size	Field
+0x00	8	Key: SCEV pointer (sentinels: empty=`-4096`, tombstone=`-8192`)
+0x08	128	Per-exit count data (SmallVector of `{BasicBlock, SCEV count, flags}`)
+0x88	8	Exact backedge-taken count (SCEV pointer or null)
+0x90	1	Flag: exact count is valid
+0x98	8	Max backedge-taken count (SCEV pointer or null)
+0xA0	1	Flag: max count is valid

The hash function is identical to the range cache: ((key >> 9) ^ (key >> 4)) & (capacity - 1). Load factor threshold is 75% for capacity doubling (via sub_DB6980) and 87.5% (only capacity/8 truly empty slots remaining) for in-place rehash to reclaim tombstones.

A secondary per-exit table at scev_ctx+1168 stores 56-byte entries indexing individual exit block trip counts, avoiding linear scans through the main entry's embedded exit array.

Exit Count Computation Pipeline

sub_DB9040 (computeExitCountForBranch) is the heavy lifter. For each exiting block, it:

Extracts the branch condition's ICmp instruction.
Identifies the comparison operands as SCEV expressions.
Classifies the exit condition into one of the standard shapes.
Dispatches to the appropriate solver.

The two primary solvers are:

howFarToZero (sub_DBA850, 8 KB) -- handles x != 0 exit conditions. The exit condition is normalized to V = LHS - RHS, so the loop exits when V == 0. For affine AddRec {Start, +, Step}:

// The loop exits when: Start + Step * N = 0 (mod 2^BW)
// Solving: N = -Start / Step (mod 2^BW)
// For positive step (counting up to overflow): N = -Start / Step
// For negative step (counting down to zero):  N = Start / (-Step)

For quadratic AddRec {L, +, M, +, N}, it solves the quadratic equation via SolveQuadraticAddRecExact. If the expression is not affine or quadratic, it returns CouldNotCompute.

howManyLessThans (sub_DCE310, 317 lines) -- handles x < bound (signed or unsigned) exit conditions. For affine IV = {Start, +, Step} with loop-invariant Bound:

// Unsigned: count = ceil_div(max(Bound, Start) - Start, Step)
// Signed:   count = ceil_div(max_signed(Bound, Start) - Start, Step)
// With overflow checks based on NUW/NSW flags

This function also contains special logic for zero-extended IVs: if the comparison involves zext(IV) < Bound, it can infer NUW on the inner AddRec by proving that the bound is small enough that unsigned overflow cannot occur before the exit.

Loop Shape Handling

The BTC computation handles several loop shapes through the exit condition classification:

Countable (for-style): for (i = 0; i < N; i++) produces {0, +, 1} < N, solved by howManyLessThans as N - 0 = N iterations.
While-do: the exit test precedes the body. Trip count equals the number of backedge traversals, which is one less than the number of condition evaluations.
Do-while: the exit test follows the body. The backedge is taken at least once if the loop is entered. Trip count comes directly from the exit condition solver.
Multiple exits: computeBackedgeTakenCount (sub_DB9E00) iterates all exiting blocks, computes per-exit counts, and takes the minimum. If any exit is not computable, the exact count is CouldNotCompute but the max count may still be known from the computable exits.
Exhaustive evaluation: sub_DCFD50 (computeExitCountExhaustively) brute-force iterates small constant-evolving loops (up to scalar-evolution-max-iterations = 100 iterations) to find exit counts that algebraic methods cannot handle.

Overflow Handling and NoWrap Flags

Trip count precision depends critically on the NoWrap flags (NUW = bit 1, NSW = bit 2) stored at scev_expr+28:

NUW (No Unsigned Wrap): if an AddRec {Start, +, Step} has NUW, unsigned arithmetic cannot wrap, so Start + Step * N is monotonically increasing in the unsigned domain. This allows howManyLessThans to compute an exact count without overflow guards.
NSW (No Signed Wrap): similarly for signed arithmetic. Enables signed comparison trip counts and the Phase B monotonicity analysis in range computation.
Neither flag: the solver must account for wrapping. howFarToZero solves modular arithmetic; howManyLessThans may fall back to constant-max estimates or CouldNotCompute.

The NVIDIA-specific knob aggressive-positive-stride-analysis (documented as "See nvbug 3972412") enables more aggressive inference of NUW flags on AddRec expressions with positive strides, particularly for GPU loop patterns where the step is a known-positive grid dimension.

How BTC Feeds Loop Optimizations

Loop Unrolling

The unroll decision engine (sub_19BB5C0) queries getSmallBestKnownTC (sub_2AA7EC0) which calls the BTC machinery. The result determines the unroll strategy:

Exact trip count known and small: enables full unrolling -- the loop body is replicated exactly N times with no remainder loop. This is the most profitable case for GPU code since it eliminates all loop overhead.
Exact trip count known but large: enables partial unrolling with an exact remainder. The unroll factor is chosen to divide the trip count, avoiding a remainder loop entirely.
Only max trip count known: enables partial unrolling with a runtime remainder check. The unroll factor is bounded by the max trip count.
Trip count unknown: unrolling is gated by the NVIDIA knob Disable-unknown-trip-iv -- when set, IndVarSimplify (sub_19489B0) skips loops entirely if the trip count is not computable.

Loop Vectorization

The vectorizer (sub_2AE3460) uses BTC in two ways:

Minimum trip count threshold: getSmallBestKnownTC is compared against dword_500EAE8 (-vectorizer-min-trip-count). If the known trip count is below this threshold, vectorization bails with "LowTripCount" (note the preserved typo: "The trip count is below the minial threshold value.").
Divisibility for epilogue: when the exact trip count is known, the vectorizer checks if it is divisible by the vectorization factor. If so, no scalar epilogue is needed. If not, it generates an epilogue loop. The exact trip count from SCEV enables eliminating the runtime divisibility check.

IRCE (Inductive Range Check Elimination)

IRCE (sub_194D450) uses SCEV ranges to split loops into pre-loop / main-loop / post-loop regions. The BTC determines the main loop's iteration space, and the range checks within the loop body define the boundaries for the pre/post loops. Tighter SCEV ranges mean tighter pre/post loops (fewer wasted iterations), which is significant for GPU kernels where every wasted iteration occupies a warp lane.

IndVarSimplify

IndVarSimplify (sub_1945A50) uses the exact BTC for Linear Function Test Replacement (LFTR): replacing the original loop exit test with a comparison against the trip count. This is gated by three NVIDIA knobs: disable-lftr, Disable-unknown-trip-iv, and iv-loop-level (default 1, restricting IV simplification to outermost loops only to limit compile-time on deeply nested GPU kernels).

GPU-Specific Trip Count Patterns

Grid-Stride Loops

for (int i = threadIdx.x + blockIdx.x * blockDim.x;
     i < N;
     i += blockDim.x * gridDim.x)

SCEV representation: {tid.x + ctaid.x * ntid.x, +, ntid.x * nctaid.x}. The start is bounded by [0, ntid.x * nctaid.x) and the step is provably positive (product of two positive values). Trip count: ceil((N - start) / step). With __launch_bounds__, the step's range can be computed precisely, enabling exact trip count computation when N is loop-invariant.

Warp-Stride Loops

for (int i = threadIdx.x % 32; i < N; i += 32)

SCEV representation: {tid.x urem 32, +, 32}. The start is [0, 31] (since warpSize=32), and the step is the constant 32. Trip count: ceil((N - (tid.x % 32)) / 32). This is always computable when N is loop-invariant.

Block-Bounded Loops

for (int i = 0; i < blockDim.x; i++)

When nvvm.reqntid metadata is present, blockDim.x has a known constant value, and the loop has a compile-time-known trip count. This enables full unrolling -- common for shared memory initialization and reduction loops.

Configuration Knobs

Knob	Default	Effect
`scalar-evolution-max-iterations`	100	Max iterations for exhaustive BTC evaluation
`scalar-evolution-max-scev-compare-depth`	32	Recursion limit for SCEV comparison
`scalar-evolution-max-arith-depth`	32	Recursion limit for arithmetic simplification
`scalar-evolution-max-cast-depth`	8	Recursion limit for ext/trunc handling
`scalar-evolution-max-ext-depth`	8	Recursion limit for extension expressions
`scalar-evolution-max-constant-evolving-depth`	32	Depth limit for constant evolution
`scalar-evolution-max-expr-size`	384	Expression complexity budget (NVIDIA simple mode)
`scalar-evolution-max-expr-failures`	100	Max failures before all expressions bail to Unknown
`scev-addops-inline-threshold`	500	Max add operands before bailing
`scev-mulops-inline-threshold`	32	Max mul operands before bailing
`scev-cheap-expansion-budget`	(default)	Cost budget for SCEVExpander materialization
`track-trip-count-more`	false	"Track loop trip count more aggressively" (NVIDIA-specific)
`aggressive-positive-stride-analysis`	true	More aggressive NUW inference for positive strides (nvbug 3972412)
`do-sign-ext-simplify`	(default)	Simplify sign-extension SCEV expressions
`do-sign-ext-expand`	(default)	Expand sign-extensions during SCEV construction
`qword_4F88EA8`	(global)	Max recursion depth for range computation
`qword_4F88C08`	(global)	Enable extended exit-value analysis (Phase D)

The NVIDIA-specific knobs are particularly important. track-trip-count-more enables additional effort in BTC computation that upstream LLVM does not attempt -- the exact mechanism is not fully reversed, but the typo in its description string ("aggresively") matches the binary. aggressive-positive-stride-analysis is tied to a specific NVIDIA bug (nvbug 3972412) and enables proving NUW on AddRec expressions whose step is known positive from range analysis, creating a positive feedback loop between range computation and NoWrap inference.

Function Map

Function	Address	Size	Role
`ScalarEvolution::getRangeRef()` -- core range evaluator	`sub_DBB9F0`	--	--
`getRangeForAffineARViaRange()` -- predecessor-based range	`sub_DBB110`	--	--
`computeUnsignedRangeFromAddRecTripCount()`	`sub_DBEFC0`	--	--
`computeSignedRangeFromAddRecTripCount()`	`sub_DBF480`	--	--
`computeExitValueRange()` -- Phase D exit value analysis	`sub_DE4FD0`	--	--
`getFullRangeFallback()` -- depth-exceeded fallback	`sub_DDFBD0`	--	--
`cacheRange()` -- insert range into hash table	`sub_DB0AC0`	--	--
`getKnownBits()` for SCEV (unsigned known bits)	`sub_DB5510`	--	--
`getNumSignBits()` for SCEV (signed known bits)	`sub_DB55F0`	--	--
`isKnownNonNegative(step)`	`sub_DBED40`	--	--
`isKnownNonPositive(step)`	`sub_DBEC80`	--	--
`getBackedgeTakenCount(loop, mode)` -- BTC dispatcher	`sub_DCF3A0`	--	--
`computeBackedgeTakenCount()` -- per-loop BTC with caching	`sub_DB9E00`	--	--
`computeExitCountForBranch()` -- exit condition analysis	`sub_DB9040`	--	--
`howFarToZero()` -- "reaches zero" trip count	`sub_DBA850`	--	--
`howManyLessThans()` -- "less than" trip count	`sub_DCE310`	--	--
`computeExitCountExhaustively()` -- brute-force small loops	`sub_DCFD50`	--	--
`computeExitLimit()` -- exit limit from condition	`sub_DCB270`	--	--
`getSmallConstantTripCount()`	`sub_DB04E0`	--	--
`getSmallConstantMaxTripCount()`	`sub_DB06C0`	--	--
BTC hash table growth / rehash	`sub_DB6980`	--	--
BTC hash table rehash-in-place (tombstone cleanup)	`sub_DE0180`	--	--
`getRangeFromUnknownSCEV()` -- range for SCEVUnknown	`sub_988CD0`	--	--
`ConstantRange::intersectWith()`	`sub_AB2160`	--	--
`ConstantRange::unionWith()`	`sub_AB3510`	--	--
`ConstantRange::addWithNoWrap()`	`sub_ABA0E0`	--	--
`ConstantRange::multiply()`	`sub_AB5480`	--	--
`ConstantRange::udiv()`	`sub_AB6A50`	--	--
`ConstantRange::minmax_combine()`	`sub_ABD750`	--	--
`ConstantRange` from `!range` metadata	`sub_ABEA30`	--	--
`ConstantRange` from KnownBits	`sub_C4B490`	--	--

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Range sources	Profile data, `__builtin_assume`, `!range` metadata from user annotations	Additional GPU-specific sources: `nvvm-intr-range` pass injects `!range` on all special register reads; `__launch_bounds__` constrains `%tid`/`%ntid` ranges; `warpSize` = 32 constant
Thread index bounds	No concept of bounded thread indices	`%tid.x/y/z` bounded by `[0, maxntid-1]`, `%ntid.x/y/z` by `[1, 1024]`, `%laneid` by `[0, 31]`; these tighten trip count computation for thread-indexed loops
Trip count precision	Depends on programmer-visible range annotations	Substantially higher precision on GPU due to statically known hardware launch bounds; most CUDA loops have computable trip counts
Range feedback loop	Range analysis and BTC computation feed each other	Same mutual feeding, but GPU-specific ranges make the feedback loop converge faster and more precisely
Warp-stride loops	No concept; stride analysis treats all strides equally	NVIDIA SCEV recognizes warp-stride patterns (`stride = warpSize` or `stride = blockDim.x`), enabling specialized BTC computation for cooperative thread loops
Overflow analysis	Standard NSW/NUW flag analysis	Same flags, plus GPU-specific insight: 32-bit IVs with `%tid` or `%ctaid` bases are often provably non-wrapping given launch dimension bounds

Cross-References

SCEV Overview & Construction -- expression creation, caching, simple mode
Loop Unrolling -- how trip counts drive unroll factor selection
LoopVectorize & VPlan -- min trip count threshold, epilogue generation
Loop Strength Reduction -- IV manipulation driven by SCEV ranges
KnownBits & DemandedBits -- GPU-specific known-bits feeding into range analysis
LLVM Knobs -- all SCEV-related knob values

SCEV Invalidation & Delinearization

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

SCEV analysis results are expensive to compute and are cached aggressively. When the IR mutates -- a loop is unrolled, a value is replaced, a block is deleted -- cached SCEV expressions, range information, and backedge-taken counts can become stale. The invalidation subsystem (forgetLoop, forgetValue, forgetAllLoops) determines exactly which cache entries must be discarded after each transformation. Get it wrong in either direction and the compiler either produces incorrect code (stale data) or wastes time recomputing everything (over-invalidation).

Delinearization is the complementary recovery problem: given a flat pointer expression like base + i*N*M + j*M + k, recover the original multi-dimensional subscripts [i][j][k]. This is critical for GPU code because memory coalescing analysis needs to know whether adjacent threads in a warp are accessing adjacent addresses -- a question that can only be answered by examining per-dimension subscripts against the thread index structure.

In cicc v13.0, both subsystems carry NVIDIA-specific modifications. The invalidation engine has an extended exit-analysis depth threshold and an early-out for simple two-operand AddRec expressions common in GPU loops. The delinearization engine has a polymorphic predicate collector that supports GPU-aware strategies for shared memory bank conflict detection and coalescing analysis, plus at least 9 configuration knobs not present in upstream LLVM.

Property	Value
`forgetLoop` address	`sub_DE2750` (`0xDE2750`)
`forgetLoop` size	10,051 bytes (~2,271 asm lines)
`forgetValue` address	`sub_D9EE30` (`0xD9EE30`)
`forgetValue` size	~9 KB
`forgetAllLoops` address	`sub_D9D700` (`0xD9D700`)
`forgetAllLoops` size	~8 KB
`delinearize` address	`sub_DE9D10` (`0xDE9D10`)
`delinearize` size	3,614 bytes (~849 asm lines)
`collectParametricTerms` address	`sub_DE8D20` (`0xDE8D20`)
Hash function	`(key >> 9) ^ (key >> 4) & (capacity - 1)`
Empty sentinel	`0xFFFFFFFFFFFFF000`
Tombstone sentinel	`0xFFFFFFFFFFFFE000`

Cache Invalidation

The Seven Caches

SCEV maintains seven distinct cache tables that must be kept consistent. Each has its own eviction path inside forgetLoop:

#	Cache	Entry size	Key	Value	Context offset
1	ValueExprMap (primary)	16 bytes	`Value*`	`SCEV*`	main SE object
2	Unsigned range cache	40 bytes	`SCEV*`	`ConstantRange`	`+976`
3	Signed range cache	40 bytes	`SCEV*`	`ConstantRange`	`+1008`
4	BTC cache	168 bytes (`0xA8`)	loop `SCEV*`	`BackedgeTakenInfo`	`+0x290`
5	Per-exit BTC cache	16 bytes	exit `SCEV*`	exit count	`+0x490`
6	AddRec folding cache	per-expression	AddRec pair	folded form	per-expression
7	Predicated BTC cache	16 bytes	loop `SCEV*`	predicated count	secondary table

All hash tables use the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth/compaction thresholds.

forgetLoop: The 8-Phase Algorithm

sub_DE2750 is the largest invalidation function -- 10 KB of machine code organized into eight sequential phases. It is called after every loop transformation that might invalidate SCEV data.

Signature:

void forgetLoop(
    ScalarEvolution *SE,    // rdi -- the SE context
    Loop *L,                // rsi -- loop being forgotten
    BasicBlock *Header,     // rdx -- loop header block
    ExitInfo *Exits,        // rcx -- exit block info (nullable)
    int DepthFlag,          // r8d -- 0=shallow, 1=deep, >1=bounded
    int ExtraFlag,          // r9d -- controls AddRec early-out
    SmallDenseSet *Visited  // stack -- prevents cycles in nested loops
);

Phase 1 -- Block value collection (0xDE27C9). Iterates the loop's basic blocks and collects all Values that have cached SCEV entries. The block array is at loop[+0x20] -> [+0x10] (pointer) / [+0x18] (count), stored as 32-byte entries. For each Value, a dominance check (sub_B19D00) confirms it belongs to the loop, then the SCEV index is extracted from a 27-bit field at value[+4] & 0x7FFFFFF. Collected pointers are stored in a SmallVector (inline capacity 6) with bit 2 set as a tag.

Phase 1B -- Scope chain collection (0xDE28A7). Walks a scope chain obtained via sub_B6AC80(SE[0][+0x28], 0x99), where 0x99 is the SCEV scope identifier. Filters to SCEVUnknown entries (type byte 0x55) with specific flag conditions (byte [+0x21] & 0x20), verifying loop membership and dominance. This captures values not directly in the loop's blocks but semantically part of its analysis scope.

Phase 2 -- Exit block processing (0xDE29D9). Enumerates exit blocks via sub_AE6EC0 and processes their AddRec chains. For each exit, reads the chain at [exit+0x30] & ~7 (stripping tag bits), checks expression kind byte (range 0x1E--0x28), and extracts operands. For the common case of simple {start, +, step} two-operand recurrences, an early-out stops after processing 2 operands when ExtraFlag != 0. If the loop has exactly 2 exits and ExtraFlag >= qword_4F88DC8 (a global threshold for maximum exit analysis depth), deep exit analysis is skipped entirely.

Phase 3 -- Expression dependency analysis (0xDE2BC5). The core invalidation loop. Iterates the collected values in reverse order and builds a transitive closure of all dependent SCEV expressions. Uses a stack-based worklist (SmallVector, inline capacity 8) and a SmallDenseSet for visited tracking. The dependency walk dispatches on expression type:

Type 0x52 ('R' = AddRec):  Follow Start and Step operands via getSCEV,
                            compare ranges with getRangeRef
Type 0x56 ('V' = variant):  Check function pointer equality at [-0x60]
                            and [+0x08], follow if simple recurrence
Type 0x39/0x3A (flagged):   Check bit 6 of flags byte, follow base pointer
                            or compute canonical form from 27-bit index
General:                    Follow underlying object, check for pointer
                            types (0x11/0x12), verify integer type

Phase 4 -- Primary cache eviction (0xDE2DFF). For each expression identified by Phase 3, looks it up in the ValueExprMap, computes both unsigned and signed ranges via getRangeRef (sub_DBB9F0), compares old and new ranges via ConstantRange::contains (sub_AB1BB0), and clears validity bits in the range cache ([entry+0x20] for unsigned, [entry+0x21] for signed). Wide APInt buffers (>64 bits) are freed through __libc_free.

Phase 5 -- BTC eviction (0xDE3D2F). For each collected value, looks it up in the BTC hash table. On hit: writes TOMBSTONE, decrements entry count, increments tombstone count, then calls forgetMemoizedResults (sub_DE2690) to recursively invalidate any expressions that depended on this backedge-taken count. Also evicts the corresponding predicated BTC entry from the secondary table.

Phase 6 -- AddRec folding cache cleanup (0xDE3230). For AddRec expressions (type 0x52), invalidates pre-computed folding results. Extracts the 6-bit opcode from [expr+2] & 0x3F and dispatches:

Opcode 0x20 (shift/power-of-two multiply): checks via countPopulation whether the step is a power of two, then calls tryFoldAddRecWithStep (sub_DCFD50)
Opcodes 0x22--0x29 (binary operations): constructs the appropriate folded expression per operation type and marks it for invalidation
Opcode 0x24 with pointer type (0x0E): skips pointer-integer cast invalidation

Phase 7 -- Predicate and assumption cleanup (0xDE3856). Processes the predicate hash table via the loop object's fields. Performs range intersection (sub_AB0910), union (sub_AB0A00), and emptiness/fullness checks (sub_AAFBB0, sub_AAF760). If the resulting range is neither empty nor full, stores the updated BTC in the loop's entry.

Phase 8 -- Final output (0xDE3CCD). Writes 0x0101 to loop->flags[+0x20], marking the loop as SCEV-forgotten (bit 0 = primary cache invalidated, bit 8 = secondary cache invalidated). Frees heap-allocated collection and output buffers.

forgetValue and forgetAllLoops

forgetValue (sub_D9EE30, ~9 KB) performs single-value eviction. It removes the value's entry from the ValueExprMap, then walks all expressions that transitively depend on it and evicts those as well. Used when a single instruction is replaced (RAUW) or deleted.

forgetAllLoops (sub_D9D700, ~8 KB) iterates every loop in the function's LoopInfo and calls forgetLoop for each one. Used when the entire function's loop structure changes (e.g., after inlining or full function cloning).

Which Passes Trigger Invalidation

forgetLoop is called after these loop transformations:

Pass	Why invalidation is needed
LoopUnroll	Trip counts change; unrolled body has different IVs
LoopVectorize	Widened types; vector IVs replace scalar ones
LoopPeeling	Peeled iterations change the start value of recurrences
LoopUnswitching	Exit conditions change; control flow restructured
LICM	Hoisted values have new SCEV forms outside the loop
LoopSimplify	Preheader/exit block insertion changes loop structure
LoopRotate	Header/latch swap requires BTC recomputation
LoopDistribute	Original loop split into multiple loops
LoopIdiomRecognize	Pattern replacement changes loop body
LoopIndexSplit (NVIDIA)	IV range split into subranges
MemorySpaceOpt (NVIDIA)	Address space changes invalidate pointer SCEVs

The DepthFlag parameter controls the aggressiveness of invalidation: 0 does shallow invalidation (only direct loop values), 1 follows all dependency chains, and values >1 impose a bounded depth useful for performance in deeply nested loops. The Visited parameter (a SmallDenseSet*) prevents infinite cycles when nested loops have mutual SCEV dependencies.

The forget-scev-loop-unroll knob (boolean) controls whether SCEV cache is invalidated after unrolling -- disabling it is unsound but can be used for compile-time experimentation.

Delinearization

The Problem

CUDA kernels routinely access multi-dimensional arrays:

float val = A[blockIdx.x * BLOCK_H + threadIdx.y][blockIdx.y * BLOCK_W + threadIdx.x];

By the time this reaches LLVM IR, the address computation has been flattened:

%addr = getelementptr float, ptr %A, i64 %flat_idx
; where %flat_idx = (blockIdx.x * BLOCK_H + threadIdx.y) * N + (blockIdx.y * BLOCK_W + threadIdx.x)

SCEV sees this as a single polynomial. Delinearization recovers the per-dimension subscripts, which are essential for:

Coalescing analysis: determining whether adjacent threads (threadIdx.x, threadIdx.x+1, ...) access adjacent memory addresses (coalesced) or strided addresses (uncoalesced). This requires isolating the dimension where threadIdx.x appears.
Shared memory bank conflict detection: 32 banks, 4-byte stride. Knowing whether the innermost subscript is threadIdx.x (conflict-free) vs. threadIdx.x * stride (potential conflicts) requires dimensional decomposition.
Dependence analysis: per-dimension dependence tests (Banerjee, GCD, MIV) are more precise than whole-expression tests. Delinearized subscripts feed DependenceInfo for vectorization legality.

Delinearization Context

The delinearizer (sub_DE9D10) operates on a context object:

Offset	Type	Field	Purpose
`+0x00`	`ScalarEvolution*`	SE	Parent SCEV context
`+0x08`	`SCEV*`	ElementSize	Innermost element size
`+0x10`	`uint8_t`	Flags	Bit 0: inline cache mode
`+0x18`	64 bytes	InlineCache	4-slot direct-mapped table (inline mode)
`+0x20`	`uint32_t`	Capacity	Heap table capacity (heap mode)
`+0x58`	`SCEV*`	TargetArrayPtr	Array being delinearized
`+0x60`	`void*`	PredicateCollector	Nullable; collects validity predicates
`+0x68`	`SCEV*`	StepRecurrence	AddRec step for innermost dimension

The inline cache (4 slots of 16 bytes at +0x18) is a small-buffer optimization sized for the overwhelmingly common GPU case of 1D or 2D array accesses. Cache entries use the same (key >> 9) ^ (key >> 4) hash as all other SCEV tables.

The Recursive Delinearization Algorithm

sub_DE9D10 is a recursive function dispatching on 17 SCEV expression kinds via a jump table:

Kind	Expression type	Handling
0, 1, 16	Constant, TruncateExpr (ident), Unknown	Leaf -- return unchanged
2	TruncateExpr	Recurse into inner, rebuild with `getTruncateExpr`
3	SignExtendExpr	Recurse; dimension discovery on AddRec step match
4	ZeroExtendExpr	Recurse; dimension discovery on AddRec step match
5	AddExpr	N-ary: delinearize each operand, rebuild with `getAddExpr`
6	MulExpr	N-ary: delinearize each factor, rebuild with `getMulExpr`
7	UDivExpr	Delinearize both operands, rebuild with `getUDivExpr`
8	AddRecExpr	N-ary with wrap flag preservation; critical path
9--13	SMax/UMax/SMin/UMin/SeqUMin	N-ary: delinearize operands, rebuild
14	PtrToIntExpr	Recurse into pointer, rebuild
15	GEP	Primary dimension discovery entry point

The N-ary pattern. Cases 5, 6, 8--13 share a common template:

SmallVector<const SCEV*, 2> NewOps;  // inline capacity 2
bool Changed = false;
for (auto *Op : Expr->operands()) {
    const SCEV *NewOp = delinearize(Ctx, Op);  // recursive
    NewOps.push_back(NewOp);
    if (NewOp != Op) Changed = true;
}
if (!Changed) return Expr;  // pointer identity optimization
return rebuildExpr(SE, Kind, NewOps);

The "changed" flag enables pointer identity short-circuiting: if no operand was modified during recursion, the original expression pointer is returned without allocation.

AddRecExpr (case 8) is the most critical case for GPU code. Multi-dimensional array accesses manifest as nested AddRec expressions: {A[0][0], +, dim1}<outer_loop> wrapping {init, +, 1}<inner_loop>. The delinearizer preserves wrap flags (NSW/NUW/NW from bits [+0x1C] & 7) and the step value ([+0x30]) when reconstructing via getAddRecExpr (sub_DBFF60).

ZeroExtend/SignExtend (cases 3, 4) are secondary dimension discovery points. When the inner operand is an AddRec whose step matches Ctx->StepRecurrence (+0x68) and the AddRec has exactly 2 operands (the common {start, +, step} form), the handler extracts dimension information: it calls getElementSize (sub_D33D80) and getConstant (sub_DA4270) to compute the element count, then pushes a new term into the term collector at Ctx[+0x58]. This identifies a dimension boundary -- the extend operation wrapping a matching-step AddRec indicates the point where one array dimension ends and another begins.

GEP (case 15) is the primary entry for actual dimension discovery. It first checks the predicate collector (Ctx[+0x60]). If present, it searches the collector's table for a matching GEP index entry (type field == 1, matching scev_expr, operation == 0x20). If no predicate collector or no match, it falls back to structural delinearization via sub_DE97B0, which analyzes the GEP's index computation structure, iterates discovered terms, and classifies them by dimension type. Terms matching Ctx->StepRecurrence go to the direct collector; others go through the predicate collector's virtual dispatch (vtable[+0x10]).

Fixed-Point Iteration

The function itself is a single recursive pass, but its callers implement a fixed-point loop:

Initialize the context with an initial guess for dimension sizes
Call sub_DE9D10 to delinearize using those dimensions
During recursion, the GEP and extend handlers collect new dimension information into Ctx[+0x58] (term collector) and Ctx[+0x60] (predicate collector)
If collected dimensions differ from the initial guess, update and repeat from step 2
Terminate when dimensions stabilize or a maximum iteration count is exceeded

The memoization cache ensures unchanged sub-expressions are not recomputed across iterations.

Parametric vs Fixed-Size Arrays

Upstream LLVM has the delinearize-use-fixed-size-array-heuristic knob (default: false). When the standard parametric delinearization fails -- typically because dimension sizes are runtime values with no SCEV relationship -- the fixed-size heuristic uses compile-time-known array dimensions from type metadata to guide decomposition.

cicc extends this with an alternative delinearization entry point at sub_147EE30 (25 KB), which applies additional heuristics controlled by at least 3 of the delinearization config globals (dword_4F9AB60, dword_4F9AE00, dword_4F9B340). This second path is likely NVIDIA-enhanced for cases common in GPU code, such as dynamically-allocated shared memory with dimensions derived from kernel launch parameters.

The dependence analysis subsystem has its own entry points into delinearization (sub_146F1B0 at 40 KB for delinearizeAccess, sub_146B5E0 at 18 KB for tryDelinearize) that combine delinearization with per-dimension dependence testing in a single pass.

GPU-Specific Delinearization Patterns

Thread grid indexing. The canonical GPU pattern threadIdx.x + blockIdx.x * blockDim.x produces an AddRec with step = blockDim.x (grid stride). The delinearizer recognizes this by matching the step recurrence against Ctx[+0x68]. When the step corresponds to a grid dimension, the subscript identifies which dimension of a multi-dimensional array is parallelized across the thread grid.

Shared memory bank conflicts. For shared memory arrays, the delinearizer feeds into bank conflict analysis. Shared memory has 32 banks with 4-byte interleaving. If delinearization reveals A[threadIdx.y][threadIdx.x] with row stride 32 (or any multiple of 32), every thread in a warp hits the same bank -- a 32-way conflict. If the stride is relatively prime to 32, accesses are conflict-free. This analysis requires knowing per-dimension subscripts, which only delinearization can provide from the flat pointer arithmetic.

Predicate collector polymorphism. The PredicateCollector at Ctx[+0x60] uses virtual dispatch (vtable[+0x10]), allowing different delinearization strategies to be plugged in:

Standard delinearization for host code
GPU-aware delinearization that considers shared memory bank geometry
Coalescing-aware delinearization that checks whether the innermost subscript varies with threadIdx.x

High-dimensional tensors. The term collector at Ctx[+0x58] is a growable SmallVector, supporting arrays with arbitrary dimensionality. This matters for tensor operations in CUDA (e.g., CUTLASS library patterns, which cicc special-cases elsewhere -- see the cutlass substring check in the dependence analysis region).

SCEV Term Collection

Before delinearization runs, collectParametricTerms (sub_DE8D20) walks the SCEV expression tree to extract candidate terms:

SCEVAddRecExpr operands yield stride candidates (the step of each AddRec)
SCEVUnknown and SCEVMulExpr nodes yield dimension-size candidates
SCEVSignExtendExpr nodes are also collected (they often wrap dimension-related terms)

These candidates are passed to findArrayDimensions (sub_147B0D0) which uses product decomposition to determine which terms correspond to array dimensions. The resulting dimension list seeds the delinearization context before sub_DE9D10 is invoked.

Configuration

SCEV Invalidation Knobs

Knob	Default	Effect
`forget-scev-loop-unroll`	`true`	Enable SCEV invalidation after loop unrolling
`verify-scev`	`false`	Verify SCEV consistency after transformations
`verify-scev-strict`	`false`	Stricter verification (compare old/new trip counts)
`verify-scev-maps`	`false`	Verify SCEV map consistency
`qword_4F88DC8` (max exit analysis depth)	unknown	Threshold beyond which deep exit analysis is skipped

SCEV Analysis Depth Limits (shared with invalidation)

Knob	Default	Effect
`scalar-evolution-max-iterations`	100	Maximum loop iterations for constant evaluation
`scalar-evolution-max-scev-compare-depth`	32	Maximum SCEV comparison recursion depth
`scalar-evolution-max-arith-depth`	32	Maximum SCEV arithmetic simplification depth
`scalar-evolution-max-ext-depth`	8	Maximum sign/zero-extend nesting depth
`scalar-evolution-max-cast-depth`	8	Maximum cast chain depth
`scalar-evolution-max-constant-evolving-depth`	32	Maximum constant evolution depth
`scalar-evolution-max-expr-size`	384	Maximum expression node count
`scalar-evolution-max-expr-failures`	100	Maximum SCEV creation failures before bailout
`scalar-evolution-max-scev-operations-implication-depth`	2	Maximum depth for implications
`scalar-evolution-max-value-compare-depth`	2	Maximum value comparison depth

NVIDIA-Specific SCEV Knobs

Knob	Effect
`aggressive-positive-stride-analysis`	More aggressive positive-stride IV analysis (nvbug 3972412)
`do-sign-ext-simplify`	Simplify SCEV sign-extend expressions
`do-sign-ext-expand`	Expand sign-extends during SCEV construction
`track-trip-count-more`	Track loop trip counts more aggressively
`scev-mulops-inline-threshold` (32)	Max MulExpr operands before out-of-line
`scev-addops-inline-threshold` (500)	Max AddExpr operands before out-of-line

Delinearization Knobs

Global	Likely identity	Notes
`byte_4F9A8C0`	Delinearization enable flag	Master enable for the delinearization subsystem
`dword_4F9A620`	Config 1	Referenced by combined delinearize-and-test
`dword_4F9A700`	Config 2	Referenced by `delinearizeAccess` core
`dword_4F9A7E0`	Config 3	Referenced by `delinearizeAccess` core
`dword_4F9AB60`	Config 4	Referenced by alternative delinearization v2
`dword_4F9AC40`	Config 5	Referenced by dependence distance with delinearization
`dword_4F9AE00`	Config 6 (shared)	Referenced by both combined-test and v2 paths
`dword_4F9B260`	Config 7	Referenced by combined delinearize-and-test
`dword_4F9B340`	Config 8	Referenced by alternative delinearization v2
`da-delinearize`	Try to delinearize array references	`DependenceAnalysis` pass knob (upstream LLVM)
`da-miv-max-level-threshold`	MIV test depth limit	`DependenceAnalysis` pass knob (upstream LLVM)

Function Map

Invalidation Functions

Function	Address	Size	Role
`ScalarEvolution::forgetLoop`	`sub_DE2750`	10,051 B	8-phase loop invalidation
`ScalarEvolution::forgetValue`	`sub_D9EE30`	~9 KB	Single-value eviction
`ScalarEvolution::forgetAllLoops`	`sub_D9D700`	~8 KB	Invalidate all loops
`forgetMemoizedResults`	`sub_DE2690`	small	Recursive BTC invalidation helper
`ScalarEvolution::verify`	`sub_DE5FA0`	~52 KB	Debug verification (old/new trip count comparison)
Loop invalidation helper	`sub_DE5640`	~178 lines	Helper for `forgetLoop`
SCEV expression invalidator	`sub_DCE1C0`	small	Callback for AddRec folding cleanup

Delinearization Functions

Function	Address	Size	Role
`ScalarEvolution::delinearize`	`sub_DE9D10`	3,614 B	Recursive delinearizer (17-case switch)
`collectParametricTerms`	`sub_DE8D20`	~521 lines	Term extraction before delinearization
Structural GEP delinearization	`sub_DE97B0`	small	Sub-analysis called from GEP case
`canonicalizeExpr`	`sub_D9ABD0`	small	SCEV normalization
`computeAccessFunctions`	`sub_D94080`	~12 KB	Access function computation
`SCEV_delinearize` (dependence region)	`sub_CF5550`	6,276 B	Alternate copy in dependence analysis

Dependence Analysis Delinearization

Function	Address	Size	Role
`delinearizeAccess`	`sub_146F1B0`	40 KB	Core delinearization for dependence analysis
`tryDelinearize`	`sub_146B5E0`	18 KB	Delinearization attempt with fallback
Delinearize subscript	`sub_1472640`	10 KB	Per-subscript extraction
Array dimension inference	`sub_1473850`	12 KB	Infers dimensions from access patterns
`collectSubscripts`	`sub_1476060`	22 KB	Multi-dimensional GEP subscript collection
Dependence distance with delinearization	`sub_14747F0`	15 KB	Computes dependence vectors using delinearized subscripts
`findArrayDimensions`	`sub_147B0D0`	11 KB	Dimension sizes from SCEV product decomposition
Combined delinearize-and-test	`sub_147C070`	34 KB	Delinearize + per-dimension dependence test
Alternative delinearization v2	`sub_147EE30`	25 KB	NVIDIA-enhanced heuristics
Partial result combiner	`sub_147DF40`	11 KB	Combines partial delinearization results

Key SCEV Callees (shared by both subsystems)

Function	Address
`getRangeRef` -- range computation	`sub_DBB9F0`
`ConstantRange::contains`	`sub_AB1BB0`
`ConstantRange::intersectWith`	`sub_AB0910`
`ConstantRange::unionWith`	`sub_AB0A00`
`ConstantRange::isEmptySet`	`sub_AAFBB0`
`ConstantRange::isFullSet`	`sub_AAF760`
`getSCEV` -- expression resolution	`sub_DD8400`
`tryFoldAddRecWithStep`	`sub_DCFD50`
`getAddExpr` (N-ary)	`sub_DC7EB0`
`getMulExpr` (N-ary)	`sub_DC8BD0`
`getAddRecExpr`	`sub_DBFF60`
`getUDivExpr`	`sub_DCB270`
`getZeroExtendExpr`	`sub_DC5000`
`getSignExtendExpr`	`sub_DC2B70`
`getTruncateExpr`	`sub_DC5200`
`getPtrToIntExpr`	`sub_DD3A70`
`DominatorTree::dominates`	`sub_B19D00`
`SmallDenseSet::insert`	`sub_C8CC70`
Cache insert (delinearization result memoization)	`sub_DB11F0`

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Delinearization purpose	Optimize for cache locality; multi-dimensional subscript recovery for polyhedral analysis	Optimize for memory coalescing: recover subscripts to determine whether adjacent warp threads access adjacent addresses
Invalidation triggers	Standard loop transformations (unroll, vectorize, simplify)	Additional triggers from NVIDIA-specific passes: MemorySpaceOpt (address space transformations), IV Demotion (narrowing changes SCEV types), NVLoopStrengthReduce
Delinearization result caching	No explicit memoization in upstream	Memoization cache via `sub_DB11F0` prevents redundant delinearization of the same GEP across multiple consumers
Thread index awareness	No concept of thread-index-based access patterns	Delinearized subscripts are analyzed against `threadIdx` dimensions to determine coalescing quality; feeds into vectorization and LSR decisions
`forget-scev-loop-unroll` knob	Present in upstream LLVM	Same knob, but more critical on GPU because over-invalidation forces expensive SCEV recomputation on deeply nested kernel loops
Range source diversity	Profile data, programmer assertions (`__builtin_assume`)	Additional sources: `!range` metadata from `nvvm-intr-range`, `__launch_bounds__`, `warpSize` constant, special register bounded ranges

Cross-References

ScalarEvolution Overview & Construction -- SCEV expression creation, the ValueExprMap, and the expression DAG structure that invalidation walks
SCEV Range Analysis & Trip Counts -- range caches and BTC caches that invalidation must clear; the getRangeRef and BTC computation functions called during eviction
LoopVectorize & VPlan -- primary consumer of delinearization results for vectorization legality; calls forgetLoop after vectorizing
Loop Unrolling -- calls forgetLoop after unrolling; the forget-scev-loop-unroll knob controls this
Loop Strength Reduction (NVIDIA) -- uses SCEV for IV analysis; its transformations trigger forgetValue calls
MemorySpaceOpt -- NVIDIA-specific pass that triggers SCEV invalidation after address space transformations
Alias Analysis & NVVM AA -- delinearization results feed into alias analysis for disambiguating multi-dimensional array accesses

Loop Optimization Passes

Loop optimization is the single most performance-sensitive area of the cicc pipeline. On an NVIDIA GPU, the constraints are fundamentally different from CPU: register pressure dominates (every additional register per thread reduces SM occupancy), memory coalescing replaces cache locality as the primary memory optimization target, and warp divergence caused by loop-carried control flow destroys SIMT efficiency. NVIDIA's cicc v13.0 addresses these constraints by shipping a mix of stock LLVM loop passes, LLVM passes with GPU-specific threshold overrides, and fully proprietary loop transformations -- all orchestrated through a carefully ordered pipeline where the position of each pass reflects hard-won engineering tradeoffs between register pressure, instruction count, and memory access patterns.

This page provides the big-picture view of loop optimization in cicc: what passes exist, how they are ordered, what analyses they share, and why the ordering matters for GPU targets. Each pass links to a dedicated sub-page with full algorithmic detail.

Why Loop Optimization Is Different on GPU

Four properties of the GPU execution model distinguish GPU loop optimization from the CPU case that upstream LLVM targets:

Register pressure is the primary constraint. Every loop transformation that increases live values (unrolling, vectorization, LICM hoisting) must be evaluated against the SM's register budget and its discrete occupancy cliffs -- adding one register can drop occupancy by a full warp group. CPU compilers never face this tradeoff.

Memory coalescing replaces cache line optimization. Loop transformations that improve stride-1 access patterns (interchange, vectorization) improve coalescing; transformations that increase the number of live pointers (unrolling, distribution) may degrade it by interleaving access streams.

No out-of-order execution. Warps execute instructions in program order; the only latency-hiding mechanism is warp-level multithreading. Unrolling creates ILP within a single warp by exposing independent instructions that the ptxas backend can interleave, but the benefit is bounded by the register pressure cost.

Address space semantics. GPU memory is partitioned into address spaces with different pointer widths, hardware addressing modes, and performance characteristics. Loop passes that rewrite address computations (LSR, IndVarSimplify) must respect these distinctions -- strength-reducing a 32-bit shared memory pointer into 64-bit generic form defeats the backend's ability to emit efficient .shared:: instructions.

Pipeline Ordering

The loop passes execute within the main optimization pipeline assembled by sub_12E54A0. The ordering below reflects the Tier 1/2/3 optimization path (the normal path for -O1 and above). Passes marked with (N) are NVIDIA-specific or have significant NVIDIA modifications; unmarked passes are stock LLVM with at most threshold overrides.

LoopSimplify + LCSSA                   (canonicalization)
    |
    v
LoopRotate                             (do-while canonical form)
    |
    v
LICM (hoist)                           (move invariants out)
    |
    v
LoopIndexSplit **(N)**                 (split index-dependent branches)
    |
    v
IndVarSimplify **(N)**                 (canonicalize IVs, LFTR)
    |
    v
LoopIdiomRecognize                     (memcpy/memset/mismatch idioms)
    |
    v
LoopDistribute                         (fission for vectorization)
    |
    v
LoopVectorize **(N)**                  (widen scalar loops to v2/v4)
    |
    v
LoopUnroll **(N)**                     (replicate body, GPU-tuned)
    |
    v
LoopInterchange                        (swap nest levels for coalescing)
    |
    v
IRCE                                   (range check elimination)
    |
    v
NVLoopStrengthReduce **(N)**           (NVIDIA custom LSR solver)
    |
    v
LoopDeletion                           (remove dead loops)
    |
    v
LoopSink / LICM (sink)                 (demote unprofitable hoists)

Several passes appear more than once. LICM runs in both hoist and sink mode. LoopUnroll has an early invocation in the main pipeline and a late invocation gated by opts[1360] (nv-disable-loop-unrolling). IndVarSimplify runs before vectorization to canonicalize induction variables, then again after unrolling to clean up newly exposed IVs. LoopSimplify and LCSSA are implicit -- they run as required analyses whenever any loop pass requests them, ensuring loops remain in canonical form throughout.

The ordering reflects a deliberate strategy: canonicalize first (LoopSimplify, LoopRotate, IndVarSimplify), transform for parallelism (LoopDistribute, LoopVectorize, LoopInterchange), replicate for ILP (LoopUnroll), and clean up addressing (LSR, LoopDeletion, LoopSink). Reordering these passes produces measurably different code: running LSR before LoopVectorize would pollute the cost model with strength-reduced IVs that confuse SCEV; running LoopUnroll before LoopVectorize would prevent vectorization of unrolled-but-still-vectorizable loops.

LoopPassManager Structure

cicc uses the LLVM New Pass Manager's LoopPassManager infrastructure. Loop passes are grouped inside a FunctionPassManager that contains a LoopToFunctionPassAdaptor wrapping the LoopPassManager. The adaptor iterates over all loops in the function in reverse post-order of the loop forest (innermost first), running the full sequence of loop passes on each loop before moving to the next.

The LoopStandardAnalysisResults struct is threaded through all loop passes, providing shared access to:

Analysis	Typical Accessor	Purpose
`ScalarEvolution`	`AR.SE`	Trip counts, strides, value ranges
`LoopInfo`	`AR.LI`	Loop structure, nesting depth
`DominatorTree`	`AR.DT`	Dominance queries for code motion
`AssumptionCache`	`AR.AC`	`__builtin_assume` facts
`TargetTransformInfo`	`AR.TTI`	Cost model, addressing modes
`MemorySSA`	`AR.MSSA`	Memory alias queries for LICM/DSE
`AAResults`	`AR.AA`	Alias analysis chain

Passes that structurally modify loops (LoopUnroll, LoopDistribute, IRCE) call LPMUpdater::markLoopAsDeleted() or LPMUpdater::addSiblingLoops() to inform the pass manager of changes. SCEV is invalidated per-loop via SE.forgetLoop() after any transformation that changes the loop's backedge-taken count.

Complete Pass Inventory

The table below lists every loop pass present in cicc v13.0 with its pipeline position, NVIDIA modification status, and primary function address.

Pass Name	Pipeline Position	NVIDIA Modified	Entry Address	Status
`loop-simplify`	Infrastructure (on demand)	No	stock LLVM	Canonicalizes loop form
`lcssa`	Infrastructure (on demand)	No	stock LLVM	Ensures loop-closed SSA
`loop-rotate`	Early, before LICM	No	stock LLVM	Converts to do-while form
`licm`	Early (hoist) + Late (sink)	Threshold only	stock LLVM	Invariant code motion
`loop-index-split`	After LICM, before IndVars	Yes (proprietary)	`sub_2CBEC60` (New PM)	Splits index-dependent branches
`indvars`	Before vectorize	Yes (3 knobs)	`sub_19489B0`	IV canonicalization + LFTR
`loop-idiom`	Before distribute	No	stock LLVM	Memcpy/memset/mismatch recognition
`loop-distribute`	Before vectorize	Threshold only	`sub_1A8CD80`	Loop fission for vectorization
`loop-vectorize`	Main loop slot	Yes (cost model)	`sub_2AF1970`	Vectorize inner loops to v2/v4
`loop-unroll`	After vectorize (x2)	Yes (decision engine)	`sub_19BE360`	Replicate loop body
`loop-interchange`	After unroll	Threshold only	`sub_1979A90`	Swap loop nest levels
`irce`	After interchange	No	`sub_194D450`	Range check elimination
`loop-reduce`	Late, after unroll	Yes (complete rewrite)	`sub_19CE990` (NV wrapper)	Strength reduction for GPU
`loop-deletion`	Late	No	stock LLVM	Remove dead/empty loops
`loop-sink`	Late	No	stock LLVM	Sink invariants back into loops
`loop-instsimplify`	Utility	No	stock LLVM	Simplify instructions in loops
`loop-flatten`	Utility	No	stock LLVM	Flatten nested counted loops
`loop-guard-widening`	Utility	No	stock LLVM	Widen loop guards
`loop-predication`	Utility	No	stock LLVM	Predicate unswitched loops
`loop-reroll`	Utility	No	stock LLVM	Reverse unrolling (rarely used)

Passes marked "Utility" are registered in the pipeline infrastructure but are not part of the default optimization sequence -- they are available for explicit pipeline specification via -mllvm -passes=....

Pass Descriptions and Sub-Page Links

Canonicalization Passes

LoopSimplify and LCSSA run on demand before any loop transformation pass executes. LoopSimplify ensures each loop has a single preheader, a single backedge (latch), and dedicated exit blocks. LCSSA (Loop-Closed SSA) ensures that values defined inside a loop and used outside it pass through PHI nodes at loop exit blocks. These are stock LLVM utilities with no NVIDIA modifications. Together they establish the invariants that all subsequent loop passes depend on.

LoopRotate converts a loop from while-form (while (cond) { body }) to do-while form (do { body } while (cond)). This creates a single-entry loop body and moves the exit test to the latch, which is the canonical form expected by SCEV, LoopVectorize, and LoopUnroll. Stock LLVM, no NVIDIA modifications.

NVIDIA-Custom Loop Passes

Loop Index Split is a revived and heavily reworked version of a pass removed from upstream LLVM 3.0. It splits loops when the loop body contains a condition that depends on the induction variable (e.g., if (i == K)), producing two or three loops where each has a uniform body. On GPU, this eliminates warp divergence caused by index-dependent branches. The pass implements three transformation modes: all-but-one peel (for i == K), only-one collapse (for nearly-empty special iterations), and full range split (for i < K vs i >= K). Proprietary, no upstream equivalent.

IndVarSimplify (NVIDIA) is upstream LLVM's induction variable canonicalization pass with three NVIDIA-specific extensions: Disable-unknown-trip-iv (bool, qword_4FAF520) -- bypasses the pass entirely when SCEV cannot compute the trip count, preventing aggressive IV transforms on warp-divergent loops; iv-loop-level (int, default 1, qword_4FAF440) -- restricts the pass to loops at a maximum nesting depth to control compile time on deeply nested stencil kernels; and disable-lftr (bool, byte_4FAF6A0) -- disables Linear Function Test Replace when the IV canonicalization would increase register pressure.

LoopVectorize (GPU-Adapted) is the largest single pass in the cicc loop pipeline (88 KB). On GPU, vectorization means generating ld.v2/ld.v4 wide loads rather than filling SIMD lanes. The pass builds VPlans, selects VF through a GPU-aware cost model that penalizes register pressure, and caps VF at 4 for most GPU targets. Scalable vectors are always disabled. The pass includes an outer-loop vectorization path (rarely triggered on GPU) and an inner-loop path (the main code path).

Loop Unrolling (GPU-Tuned) ships a substantially reworked computeUnrollCount decision engine with GPU heuristics: a local-array threshold multiplier that aggressively unrolls loops over __shared__ arrays, power-of-two factor enforcement, a pragma threshold 200x larger than stock LLVM, and a register-pressure-aware cost model. The transformation engine is lightly modified upstream UnrollLoop. The pass runs twice: once in the main pipeline, once as a late cleanup.

NVLoopStrengthReduce (NVIDIA Custom) is the most GPU-specific LLVM pass in cicc. NVIDIA ships a complete replacement formula solver (160 KB, 2688 lines) with 11 custom knobs controlling register pressure checking, address-space-aware formula selection, sign-extension optimization, and 64-bit IV handling. The stock LLVM LSR remains in the binary but the NVIDIA overlay replaces the formula generation and selection phases.

Standard Loop Passes (Threshold Overrides Only)

LICM (Loop-Invariant Code Motion) hoists loop-invariant computations above the loop and sinks them below it. On GPU, LICM's hoist mode must be conservative: hoisting increases register pressure in the loop preheader, which may push past occupancy cliffs. The sink mode (running later) undoes unprofitable hoists. Stock LLVM with NVIDIA-tuned thresholds.

LoopInterchange swaps the nesting order of a perfectly-nested loop pair when doing so improves memory access locality. In cicc, the threshold loop-interchange-threshold (dword_4FB07E0) defaults to 0, meaning interchange is only performed when the net locality benefit is non-negative AND parallelism improves. The pass has a 100-pair dependence limit (0x960 bytes) as a compile-time safety valve. There is no visible CUDA-specific memory space awareness -- the standard LLVM stride-1 locality model applies uniformly. See the standard loop passes page for details.

IRCE (Inductive Range Check Elimination) splits a loop into preloop/mainloop/postloop regions, eliminating range checks from the mainloop where the induction variable is provably within bounds. The implementation is stock LLVM with no visible NVIDIA modifications. Configuration globals include a block count threshold (dword_4FB0000), a debug flag (byte_4FAFE40), and a "constrained" relaxation mode (byte_4FAFBA0) that handles slightly non-canonical range checks common in GPU thread-coarsened loops.

LoopDistribute (loop fission) splits a single loop into multiple loops to separate unsafe memory dependences from safe ones, enabling LoopVectorize to vectorize the safe partition. Stock LLVM algorithm. The SCEV runtime check threshold (qword_4FB5480) is likely GPU-tuned. The pass runs before LoopVectorize in the pipeline.

LoopIdiomRecognize detects loops that implement common patterns (byte-by-byte copy, memset, mismatch search, string search) and replaces them with optimized multi-block IR or library calls. The expansion routines generate vectorized mismatch detection (sub_2AA00B0, 48 KB) and vectorized first-occurrence string search (sub_2AA3190, 40 KB), both with page-boundary-safe masked vector loads. Stock LLVM pass; the expansion quality benefits GPU targets where wide loads are profitable.

LoopDeletion removes loops proven dead (no observable side effects). Stock LLVM. LoopSink moves loop-invariant operations that were hoisted by LICM back into the loop body when doing so reduces register pressure -- particularly valuable on GPU where the register pressure tradeoff is acute.

Loop Analysis Infrastructure

All loop passes share three core analysis frameworks.

ScalarEvolution (SCEV)

SCEV models how values evolve across loop iterations. Every loop pass depends on it for trip count computation, stride analysis, and value range queries. cicc ships an LLVM 20.0.0-based SCEV with three NVIDIA extensions: a complexity control system (simple_mode) that prevents unbounded analysis time, GPU-specific SCEV sources that inject thread index bounds, and recognition of CUDA loop idioms (warp-stride, grid-stride). See ScalarEvolution Overview, Range Analysis & Trip Counts, and Invalidation & Delinearization.

LoopInfo

LoopInfo provides the loop forest structure: which basic blocks belong to which loops, nesting depth, header/latch/exit identification. It is the primary structural query interface for all loop passes. Stock LLVM, no NVIDIA modifications.

DependenceInfo

DependenceInfo computes memory dependence direction vectors between instruction pairs across loop iterations. LoopInterchange and LoopDistribute are its primary consumers. The analysis uses SCEV to classify dependences as forward (<), backward (>), equal (=), scalar (S), independent (I), or unknown (*). Direction vectors drive the legality checks for loop interchange (no reversed backward-carried dependences after swap) and loop distribution (which instructions must stay in the same partition).

The following table consolidates all loop-pass-specific configuration knobs discovered in cicc v13.0. These are controllable via -mllvm -<knob>=<value>.

Knob	Pass	Type	Default	Effect
`Disable-unknown-trip-iv`	IndVarSimplify	bool	false	Skip IV canonicalization for unknown-trip loops
`iv-loop-level`	IndVarSimplify	int	1	Max nesting depth for IV simplification
`disable-lftr`	IndVarSimplify	bool	false	Disable Linear Function Test Replace
`replexitval`	IndVarSimplify	enum	1 (cheap)	Exit value replacement strategy: 0=never, 1=cheap, 2=always
`indvars-widen-indvars`	IndVarSimplify	bool	true	Allow IV widening to eliminate sign/zero extension
`loop-interchange-threshold`	LoopInterchange	int	0	Minimum net locality improvement for interchange
`vectorize-loops`	LoopVectorize	bool	true	Master vectorization enable
`enable-early-exit-vectorization`	LoopVectorize	bool	false	Allow vectorization of early-exit loops
`force-vector-width-outer`	LoopVectorize	bool	false	Force VF=4 for outer loops
`nv-disable-loop-unrolling`	LoopUnroll	bool	false	Disable the late unroll invocation
`disable-unknown-trip-lsr`	NV LSR	bool	false	Skip LSR for unknown-trip loops
`lsr-check-rp`	NV LSR	bool	true	Enable register pressure checking in LSR
`lsr-rp-limit`	NV LSR	int	~32-64	Register pressure ceiling for LSR
`filter-bad-formula`	NV LSR	bool	true	NVIDIA custom formula filtering
`do-lsr-64-bit`	NV LSR	bool	arch-dep	Enable LSR for 64-bit IVs (false on sm_3x-5x)
`count-sxt-opt-for-reg-pressure`	NV LSR	bool	true	Credit sign-ext savings in cost model
`lsr-sxtopt`	NV LSR	bool	true	Fold sign-extensions into IV expressions
`lsr-loop-level`	NV LSR	int	0 (all)	Restrict LSR to specific loop nesting depth
`lsr-skip-outer-loop`	NV LSR	bool	false	Skip outer loop IVs in nested loops
`disable-lsr-for-sharedmem32-ptr`	NV LSR	bool	false	Disable LSR for `addrspace(3)` pointers
`disable-lsr-complexity-discount`	NV LSR	bool	false	Disable complexity discount in cost model
`irce-block-threshold`	IRCE	int	varies	Max basic blocks before IRCE bails
`enable-loop-distribute`	LoopDistribute	bool	false	Force-enable distribution
`loop-distribute-scev-check-threshold`	LoopDistribute	int	varies	Max SCEV runtime checks allowed

Cross-References

Pipeline context: LLVM Optimizer -- two-phase compilation, tier dispatch, NVVMPassOptions
Pipeline ordering: Pipeline & Pass Ordering -- complete pass registration table
Vectorization: LoopVectorize & VPlan -- GPU-adapted vectorizer with full cost model
Unrolling: Loop Unrolling -- decision cascade with GPU-specific heuristics
Strength reduction: Loop Strength Reduction (NVIDIA) -- the most GPU-specific pass in cicc
NVIDIA custom passes: Loop Index Split, NVVM Peephole
SCEV infrastructure: ScalarEvolution Overview, Range Analysis & Trip Counts, SCEV Invalidation
Standard loop passes: Standard Loop Passes -- IndVarSimplify, LoopInterchange, IRCE, LoopDistribute, LoopIdiomRecognize details

Standard Loop Passes

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

CICC v13.0 includes a full complement of LLVM loop transformation passes beyond the major ones (LoopVectorize, LoopUnroll, LICM, LSR) that have their own pages. This page covers the remaining loop passes: LoopInterchange, IRCE, IndVarSimplify, LoopDistribute, LoopIdiom, LoopRotate, LoopSimplify, and LCSSA. Most are stock LLVM with default thresholds, but IndVarSimplify carries three NVIDIA-specific knobs that materially change behavior on GPU code. LoopRotate appears multiple times in the pipeline as a canonicalization prerequisite for LICM and unrolling. The canonicalization trio -- LoopSimplify, LCSSA, and LoopRotate -- run so frequently they constitute the backbone of loop pass infrastructure in cicc.

Barrier awareness. None of these 8 passes have explicit barrier (__syncthreads()) awareness. Barrier handling in cicc occurs through dedicated NVIDIA passes: Dead Barrier Elimination (sub_2C83D20) and convergence control token verification (sub_E35A10). The structural passes (LoopRotate, LoopSimplify, LCSSA) do not move instructions across basic blocks in ways that could reorder barriers. LoopInterchange and LoopDistribute could theoretically reorder barriers, but barriers in CUDA kernels typically occur outside perfectly-nested loop bodies (interchange) or create non-distributable loop bodies (distribution).

Occupancy interaction. None of the 8 passes interact with occupancy or register pressure directly. Occupancy-aware loop optimization occurs in LSR (register pressure tracking at a1+32128 with occupancy ceiling), LoopUnroll (TTI-based register pressure estimation), and register allocation. These 8 passes are IR-level transforms that run before register allocation.

Address space awareness. None of the 8 passes distinguish between addrspace(0) (generic), addrspace(1) (global), addrspace(3) (shared), or addrspace(5) (local). Only LSR has address space awareness via the disable-lsr-for-sharedmem32-ptr knob. This is a notable gap: LoopInterchange's cost model should ideally weight global memory coalescing higher than shared memory locality, and LoopDistribute could benefit from knowing that shared-memory and global-memory partitions have different cost characteristics.

LoopInterchange

Swaps the iteration order of a perfectly-nested loop pair to improve memory access locality. On GPUs, interchange can convert non-coalesced global memory accesses (strided across warps) into coalesced ones (consecutive addresses per warp), which is often the single largest performance lever for memory-bound kernels.

Property	Value
Entry point	`sub_1979A90` (69 KB) -- `processLoopList`
Legality checker	`sub_1975210` (45 KB)
Dependence helper	`sub_1978000` (37 KB)
Pass name	`"loop-interchange"`
Knob	`loop-interchange-threshold` at `dword_4FB07E0`, default 0
Knob constructor	`ctor_208` at `0x4E39E0`
NVIDIA delta	None -- stock LLVM algorithm and threshold

Required analyses (from sub_19743F0): ScalarEvolution (unk_4F9A488), LoopInfoWrapperPass (unk_4F96DB4), DominatorTreeWrapperPass (unk_4F9E06C), AAResultsWrapperPass (unk_4F9920C), DependenceAnalysisWrapperPass (unk_4F98D2D), OptimizationRemarkEmitter (unk_4FB66D8), TargetTransformInfoWrapperPass (unk_4FB65F4), LoopAccessLegacyAnalysis (unk_4F99CB0). The pass preserves both DominatorTree and LoopInfo.

Algorithm. The pass collects the loop nest as a SmallVector by walking the single-subloop chain (enforcing the "perfectly nested" constraint -- each loop must have exactly one child). For nests with fewer than two levels, it returns immediately. It then builds direction vectors for every memory-dependence pair via DependenceInfo (sub_13B1040), encoding each dimension as one of < (forward), > (backward), = (equal), S (scalar), I (independent), or * (unknown). A hard bail-out fires if the number of dependence pairs exceeds 100 (0x960 bytes at 24 bytes per entry) -- a compile-time safety valve.

For each candidate pair from outermost inward, the decision pipeline runs five checks in sequence:

Dependence safety -- any * or backward-carried dependence that would be reversed by interchange bails with remark "Dependence". The safety check uses two bitmasks: 0x803003 for valid direction combination and 0x400801 for the "all equal-like before inner" pattern. A special case allows inner > when all preceding levels are = or S (zero distance in those dimensions).
Call instructions -- calls in the inner body that are not provably readonly intrinsics bail with "CallInst". The intrinsic check calls sub_1560260(callee, -1, 36) and sub_1560260(callee, -1, 57) for two classes of safe intrinsics.
Tight nesting -- extra computation between the loops (non-PHI, non-terminator instructions) bails with "NotTightlyNested". Checks sub_15F3040 (extra computation), sub_15F3330 (volatile/atomic operations), and sub_15F2ED0 (calls with side effects).
Exit PHI validation -- complex PHI nodes at the loop exit bail with "UnsupportedExitPHI". For each exit PHI, the pass walks the use chain checking operand count via (v287 & 0xFFFFFFF), verifying each operand references the latch block and that sub_157F120 (hasLoopInvariantOperands) returns true.
Cost model -- counts memory subscripts with stride in the inner vs. outer loop. Net cost = benefit - penalty. Interchange proceeds only if cost >= -threshold (default: >= 0) AND all direction vectors show a parallelism improvement (outer dimension becomes scalar/independent while inner becomes equal).

Cost model details. For each memory instruction (opcode byte 0x38 at offset -8), the pass extracts the subscript count via (*(_DWORD*)(instr-4) & 0xFFFFFFF) and calls sub_146F1B0(ScalarEvolution, operand) to get the SCEV expression. Strides are classified per-loop. Subscripts with stride in both loops are counted as penalties (ambiguous). The net cost is locality_benefit - locality_penalty. The parallelism override requires ALL direction vectors to have the outer dimension as S (83) or I (73) and the inner dimension as = (61) -- even a non-negative cost is rejected if this pattern fails, with remark "InterchangeNotProfitable".

Post-interchange bookkeeping. After transformation, the pass: (a) calls sub_1AF8F90 to update LCSSA form for inner loop first, then outer; (b) reruns legality check via sub_1975210 as a safety recheck after LCSSA updates; (c) swaps direction-vector columns and loop-list positions; (d) decrements indices to try the next pair inward. The TTI availability boolean at a1+192 (checked via sub_1636850) is passed to the LCSSA updater as its 4th argument, controlling rewrite aggressiveness.

GPU considerations. The cost model counts memory accesses generically via SCEV stride analysis. There is no visible special handling for address spaces (shared vs. global vs. texture). The standard "stride-1 is good" locality model applies uniformly. For a reimplementation targeting GPUs, you would want to weight global-memory accesses (addrspace 1) far more heavily than shared-memory accesses (addrspace 3), since shared memory has no coalescing requirement. The 100-pair dependence limit prevents the pass from even being considered for CUDA kernels with massive shared-memory access patterns (e.g., tiled matrix multiplication). The pass does not check for barriers -- perfectly-nested loops with __syncthreads() in the inner body would be blocked by the call-instruction check unless the barrier is lowered to an intrinsic classified as safe (which it is not).

IRCE (Inductive Range Check Elimination)

Splits a loop into pre/main/post regions so that inductive range checks (bounds checks on the induction variable) can be eliminated from the main loop body, which executes the vast majority of iterations.

Property	Value
Entry point	`sub_194D450` (71 KB) -- `InductiveRangeCheckElimination::run`
Pass name	`"irce"`
Block threshold	`dword_4FB0000` -- max basic blocks before bail-out
Debug flag	`byte_4FAFE40` -- prints `"irce: looking at loop"`
Constrained mode	`byte_4FAFBA0` -- relaxes canonical-form requirements
SCEV verify	`byte_4FAFC80` -- post-transform range verification
Metadata flag	`byte_4FAFF20` -- propagate `"irce.loop.clone"` metadata
NVIDIA delta	Minimal -- stock algorithm, "constrained" mode may help GPU strided patterns

Stack frame and signature. The function allocates ~0x960 bytes (2400 bytes) of local state. Signature: sub_194D450(void *this_pass, void *Loop, void *LoopAnalysisManager, void *LoopStandardAnalysisResults, void *LPMUpdater). Returns PreservedAnalyses by value.

Algorithm (8 phases).

Phase 1 -- Early validation. Extracts ScalarEvolution, DominatorTree, LoopInfo, and BranchProbabilityInfo from LoopStandardAnalysisResults. Loads block count threshold from dword_4FB0000 and bails if the loop exceeds it. Checks simplify form (single latch, single exit, proper preheader).

Phase 2 -- Range check discovery. IRCE scans conditional branches in the loop body for ICmp instructions comparing the induction variable against loop-invariant bounds. The ICmp predicate dispatch table:

Predicate value	LLVM predicate	Range check kind
`0x20` (32)	SLT (signed less-than)	UPPER
`0x22` (34)	SGT (signed greater-than)	LOWER (swapped operands)
`0x24` (36)	SGE (signed greater-equal)	LOWER
`0x26` (38)	UGE (unsigned greater-equal)	LOWER
`0x28` (40)	ULT (unsigned less-than)	UPPER

Each candidate is classified into one of four kinds:

RANGE_CHECK_UNKNOWN = 0   (skip)
RANGE_CHECK_LOWER   = 1   (indvar >= lower_bound)
RANGE_CHECK_UPPER   = 2   (indvar < upper_bound)
RANGE_CHECK_BOTH    = 3   (lower <= indvar < upper)

The InductiveRangeCheck structure is 40 bytes (0x28), iterated with stride 0x28: Begin (SCEV, +0x00), Step (SCEV, +0x08), End (SCEV, +0x10), CheckUse (Use*, +0x18), Operand (Value*, +0x20), Kind (uint32, +0x24).

Phase 3 -- Filtering and validation. Calls sub_1949EA0 (classifyRangeCheckICmp) to validate each candidate. A bitvector (allocated at [rbp+var_460]) tracks valid checks. The "constrained" relaxation flag (byte_4FAFBA0) routes to sub_1949670 (canHandleRangeCheckExtended), allowing range checks where the induction variable relationship is slightly non-canonical -- useful for GPU thread-coarsened loops with strided access patterns. Validation requires: constant step (+1 or -1), loop-invariant bounds, simplify form, and SCEV-computable trip count.

Phase 4 -- SCEV-based bound computation. For each valid check, computes the safe iteration range [safe_begin, safe_end) using SCEV. Calls sub_145CF80 (SCEV getConstant), sub_147DD40 (SCEV getAddRecExpr / max/min), and sub_3870CB0 (isSafeToExpandAt). If expansion safety fails, the check is abandoned.

Phase 5 -- Preloop creation. Calls sub_194C320 (createPreLoop, ~1200 bytes) to clone the loop for iterations [0, safe_begin). Creates basic blocks named "preloop" and "exit.preloop.at". The clone remaps instructions and PHI nodes, creates the branch from preloop exit to mainloop entry, and updates dominator tree and loop info.

Phase 6 -- Postloop creation. Calls sub_194AE30 (createPostLoop, ~1300 bytes) for iterations [safe_end, trip_count). Calls sub_1949270 (adjustSCEVAfterCloning) to refresh SCEV expressions invalidated by cloning.

Phase 7 -- Two-path splitting for BOTH checks. When kind=3, IRCE creates TWO separate cloning operations, producing three loop clones total. Both sub_194C320 and a second call produce pre/main/post regions with BOTH range checks eliminated from the center.

Phase 8 -- Cleanup. Cleans up InductiveRangeCheck entries (stride 0x40 after alignment). If metadata flag byte_4FAFF20 is set, propagates "irce.loop.clone" metadata to cloned loops via red-black tree manipulation. Releases SCEV expression references via sub_1649B30.

GPU considerations. The block count threshold (dword_4FB0000) protects against pathologically large GPU kernel loops from unrolled or tiled computations. The constrained relaxation mode helps with range checks in GPU kernels where induction variables use non-canonical strides (common after thread coarsening). IRCE has no barrier awareness -- if a loop body contains __syncthreads(), the loop cloning would duplicate the barrier into all three clones (pre/main/post), which is correct but increases code size and instruction cache pressure. The pass does not check for convergent calls, so it could clone a loop containing warp-level primitives; this is safe because all three clones execute the same iterations as the original (just partitioned differently).

Pipeline position. IRCE runs after LoopSimplify and before LoopUnroll. It consumes canonicalized induction variables produced by IndVarSimplify and feeds into vectorization by removing bounds checks that would otherwise prevent LoopVectorize.

IndVarSimplify

Canonicalizes induction variables: simplifies IV users, performs Linear Function Test Replace (LFTR), replaces exit values with closed-form SCEV expressions, and sinks dead IV computations. This is the pass with the most significant NVIDIA modifications in this group.

Property	Value
Core function	`sub_1945A50` (65 KB) -- `IndVarSimplify::run`
NewPM wrapper	`sub_19489B0` -- applies NVIDIA guards before core
Pass name	`"indvars"`
NVIDIA knob 1	`Disable-unknown-trip-iv` at `qword_4FAF520` -- skip pass for unknown-trip loops
NVIDIA knob 2	`iv-loop-level` at `qword_4FAF440`, default 1 -- max nesting depth
NVIDIA knob 3	`disable-lftr` at `byte_4FAF6A0` -- disable LFTR entirely
Upstream knob	`replexitval` at `dword_4FAF860` -- `{never=0, cheap=1, always=2}`
All knobs registered	`ctor_203` at `0x4E1CD0`
NVIDIA delta	Significant -- two custom guard knobs plus depth limiter

NVIDIA guards. Before the core algorithm runs, sub_19489B0 checks two NVIDIA-specific conditions:

Loop depth gate (iv-loop-level): if sub_193DD90(loop) > qword_4FAF440[20], the pass is skipped entirely. sub_193DD90 is a recursive getLoopDepth() returning 1 for outermost loops. Default 1 means only outermost loops receive IV simplification. This controls compile time on deeply-nested stencil and tensor kernels that commonly have 3-5 nested loops.
Unknown trip count gate (Disable-unknown-trip-iv): if LOBYTE(qword_4FAF520[20]) is set AND (sub_1CED350(loop) <= 1 OR !sub_1CED620(loop, header)), the pass is skipped. sub_1CED350 returns the SCEV-computed trip count; values <= 1 indicate unknown or trivial loops. This protects GPU kernels with divergent or dynamic bounds (where trip count depends on threadIdx or blockIdx) from aggressive IV transforms that can cause correctness issues with warp-level scheduling assumptions.

Core algorithm (five phases):

Header PHI collection -- walks the loop header's instruction list via **(a2+32)+48, collecting all PHI nodes (opcode 77) as candidate induction variables into worklist v342.
Per-IV rewriting -- for each PHI, calls sub_1B649E0 (SimplifyIndVar::simplifyIVUsers, via vtable at off_49F3848) to fold truncs/sexts/zexts, fold comparisons with known ranges, and eliminate redundant increment chains. Sets changed flag at a1+448. Then calls sub_1943460 (rewriteLoopExitValues) to replace uses of the IV outside the loop with closed-form SCEV expressions. New PHIs discovered during rewriting are pushed back to the worklist for fixpoint iteration.
LFTR (Linear Function Test Replace) -- gated by four conditions: dword_4FAF860 != 0 (replexitval not "never") AND trip count not constant (!sub_14562D0), !byte_4FAF6A0 (disable-lftr not set), hasCongruousExitingBlock (sub_193E1A0), and exitValueSafeToExpand (sub_193F280). Selects the best IV via sub_193E640 (isBetterIV) preferring non-sign-extending, wider IVs with higher SCEV complexity (sub_1456C90). Computes a wide trip count via sub_1940670 (computeWideTripCount). Three rewriting strategies:
- Strategy A: Integer IV with matching types -- computes exact exit value via APInt arithmetic, materializes as constant.
- Strategy B: Type mismatch -- expands SCEV expression via sub_14835F0 (SCEVExpander::expandCodeFor), creates "wide.trip.count" instruction using ZExt (opcode 37) or SExt (opcode 38).
- Strategy C: Direction check failure -- creates "lftr.wideiv" as a truncation (opcode 36, Trunc) down to exit condition type.
- Finally creates "exitcond" ICmp instruction (opcode 51) with computed predicate v309 = 32 - depth_in_loop_set.
Exit value replacement -- materializes closed-form exit values via SCEVExpander. The "cheap" mode (replexitval=1) adds a cost gate at sub_1941790 where dword_4FAF860 == 1 && !v136 && v31[24] skips expensive expansions (v136 = simple loop flag, v31[24] = per-candidate "expensive" flag from sub_3872990, the SCEV expansion cost model).
Cleanup -- dead instruction removal (drains worklist at a1+48..a1+56, using opcode check: type <= 0x17 = LLVM scalar type), IV computation sinking (walks latch block backwards, tracks live set in red-black tree via sub_220EF30/sub_220EF80/sub_220F040, sinks dead IVs past loop exit via sub_15F2240), PHI predecessor fixup (handles Switch opcode 27 and Branch opcode 26 terminators), and sub_1AA7010 (deleteDeadPhis) on the loop header.

Additional upstream knobs present: indvars-post-increment-ranges (bool, default true), indvars-predicate-loops (bool, default true), indvars-widen-indvars (bool, default true), verify-indvars (bool, default false).

Pass state object layout:

Offset	Type	Content
+0	ptr	TargetTransformInfo
+8	ptr	DataLayout / Module
+16	ptr	DominatorTree
+24	ptr	LoopInfo
+32	ptr	DeadInstVector
+40	ptr	ScalarEvolution
+48	ptr	DeadInstWorklist array
+56	u32	DeadInstWorklist count
+60	u32	DeadInstWorklist capacity
+448	byte	Changed flag

GPU relevance. The depth limiter is important because CUDA stencil codes often have 3-5 nested loops, and running IndVarSimplify on inner loops can blow up compile time without meaningful benefit (inner loops typically have simple IVs already). The unknown-trip guard prevents miscompiles on kernels where the trip count depends on threadIdx or blockIdx. The interaction with IV Demotion (sub_1CD74B0) is notable: IndVarSimplify runs first and may widen IVs to 64-bit, then IV Demotion (a separate NVIDIA pass) narrows them back to 32-bit where the value range permits, reducing register pressure -- a critical factor for GPU occupancy.

LoopDistribute

Splits a single loop into multiple loops (loop fission), each containing a subset of the original instructions. The primary motivation is separating memory accesses with unsafe dependences from safe ones, enabling LoopVectorize to vectorize the safe partition.

Property	Value
Entry point	`sub_1A8CD80` (63 KB) -- `LoopDistributePass::run`
Pass name	`"loop-distribute"`
Force flag	`byte_4FB5360` -- force distribution ignoring metadata
SCEV check threshold	`qword_4FB5480` -- max runtime checks before bail-out
Secondary limit	`qword_4FB53A0` -- max dependence checks per partition
Verify flag	`byte_4FB56E0` -- post-distribution verification
NVIDIA delta	None -- stock LLVM algorithm

Stack frame. ~0x780 bytes (1920 bytes). Signature: sub_1A8CD80(void *this_pass, void *Function, void *FunctionAnalysisManager).

Algorithm. The pass runs a gauntlet of six bail-out conditions per loop:

"NotLoopSimplifyForm" -- sub_157F0D0 (Loop::isLoopSimplifyForm) fails.
"MultipleExitBlocks" -- sub_157F0B0 (Loop::getUniqueExitBlock) returns null.
Metadata "llvm.loop.distribute.enable" disabled (checked via sub_15E0530 MDNode lookup). byte_4FB5360 (force flag) overrides this.
"NoUnsafeDeps" -- LAI flag at +0xDAh (HasUnsafeDependences) is zero.
"MemOpsCanBeVectorized" -- all memory operations already vectorizable.
"TooManySCEVRuntimeChecks" -- SCEV check count at LAI +0x118 exceeds qword_4FB5480.

LoopAccessInfo (LAI) structure (0x130 = 304 bytes):

Offset	Content
+0x00	Loop* TheLoop
+0x08	PredicatedScalarEvolution* PSE
+0x10	RuntimeCheckingPtrGroup* PtrRtChecks
+0x90	SmallVector buffer (16-byte aligned)
+0xDAh	bool HasUnsafeDependences
+0xE0h	MemoryDepChecker::Dependence* DepArray
+0xE8h	uint32 NumDependences
+0x108	SCEVUnionPredicate* Predicates
+0x110	SCEVCheck* SCEVChecks
+0x118	uint32 NumSCEVChecks

Dependence entry (0x40 = 64 bytes per entry): source instruction (+0x00), destination instruction (+0x08), dep type info (+0x10), SCEV distance (+0x18), DependenceType byte (+0x28). Stride confirmed at shl rax, 6 (0x1A8E6B9).

If validation passes, the core phase builds a partition graph. Each instruction starts in its own partition. The partition hash set uses 16-byte slots with NVVM-layer sentinels (-8 / -16) and an additional -2 value for "unassigned" partitions. See Hash Table and Collection Infrastructure for the hash function, probing, and growth policy.

For each unsafe memory dependence pair, the pass either merges source and destination partitions (if the dependence cannot be broken) or marks it as cross-partition. A union-find structure tracks merged partitions. After merging, if at least two distinct partitions remain, sub_1B1E040 (distributeLoopBody, ~2000 bytes) clones the loop body once per partition, removes instructions not belonging to each partition, and wires the clones in dependence order. Optional runtime dependence checks (loop versioning) are added. Post-distribution: sub_1B1DC30 updates the dominator tree, sub_197E390 registers new loops, sub_143AA50 (ScalarEvolution::forgetLoop) invalidates SCEV cache. Metadata "distributed loop" (16 chars) is attached to prevent future re-distribution.

GPU relevance. Distribution is valuable for CUDA kernels that mix shared-memory and global-memory accesses in the same loop -- the shared-memory partition can often be vectorized independently. The "llvm.loop.distribute.enable" metadata is controllable via #pragma clang loop distribute(enable). The SCEV runtime check threshold (qword_4FB5480) balances runtime check overhead against distribution benefit -- GPU kernels often have simple loop structures but complex pointer arithmetic from tiled access patterns.

LoopIdiom

Recognizes loop patterns that correspond to standard library calls (memset, memcpy, memcmp, strstr) and replaces them with optimized implementations. CICC includes both the standard LoopIdiomRecognize pass and the newer LoopIdiomVectorize pass.

Property	Value
Recognizer core	`sub_196FF90` (51 KB) -- `LoopIdiomRecognize::run`
Memset detection	`sub_196B740` (10 KB) -- detects `memset_pattern16`
Memcpy/memmove	`sub_196E000` (43 KB)
Mismatch expansion	`sub_2AA00B0` (48 KB) -- `expandMemCmpMismatch`
String search expansion	`sub_2AA3190` (40 KB) -- `expandFindFirst`
Pass name	`"loop-idiom"` (recognizer), `"loop-idiom-vectorize"` (vectorizer)
Vectorize knobs	`disable-loop-idiom-vectorize-all`, `loop-idiom-vectorize-style` (masked/predicated), `loop-idiom-vectorize-bytecmp-vf`, etc.
NVIDIA delta	None visible -- stock LLVM

Standard idioms. The recognizer scans loops for store patterns that correspond to memset (constant value stored on every iteration) and memcpy/memmove (load-store pairs with matching strides). It also detects trip-count-decrement patterns ("tcphi", "tcdec") used in hand-written copy loops. Recognized patterns are lowered to @llvm.memset / @llvm.memcpy / @llvm.memmove intrinsics.

Vectorized idiom expansion -- MemCmpMismatch (sub_2AA00B0). The expansion generates a two-tier multi-block IR structure:

LoopIdiomExpansionState structure (80+ bytes): idiom type at +0 (0=byte, 1=word), loop info at +8, DataLayout at +16, alloc context at +24, target info at +32, output blocks at +48 through +80.
11 basic blocks created in sequence: "mismatch_end", "mismatch_min_it_check", "mismatch_mem_check", "mismatch_vec_loop_preheader", "mismatch_vec_loop", "mismatch_vec_loop_inc", "mismatch_vec_loop_found", "mismatch_loop_pre", "mismatch_loop", "mismatch_loop_inc", "byte.compare".
Page-boundary safety protocol (shared with string search expansion): PtrToInt -> LShr by log2(pagesize) (from sub_DFB4D0 via DataLayout) -> ICmpNE of start/end page numbers. If both pointers stay within a single page, wider-than-element vector loads are safe; otherwise, @llvm.masked.load provides the fallback. The page size is retrieved via sub_DFB4D0(*a1[32]) from the target DataLayout.
Vector loop body: dispatches to sub_2A9D690 (byte-granularity) or sub_2A9EC20 (word-granularity) based on *a1 idiom type. Generates vector load + compare + cttz (count trailing zeros via sub_B34870).
Scalar fallback: byte-by-byte comparison with "mismatch_index" phi node, induction variable add (sub_929C50), and ICmpULT (sub_92B530(0x20)) loop bound check.
LCSSA verification: explicit assertion "Loops must remain in LCSSA form!" via sub_D48E00. SE/LI/DT invalidated/recalculated on exit (sub_FFCE90, sub_FFD870, sub_FFBC40).

Vectorized idiom expansion -- FindFirst (sub_2AA3190). Implements vectorized first-occurrence search (strstr-like):

7 basic blocks: "scalar_preheader", "mem_check", "find_first_vec_header", "match_check_vec", "calculate_match", "needle_check_vec", "search_check_vec".
Needle splatting: needle[0] is extracted via ExtractElement (sub_B4DE80) with index 0, frozen via sub_B37620, then splatted across all vector lanes via ShuffleVector (sub_B36550). The splat enables parallel comparison of the haystack against the needle's first character.
Masked loads: @llvm.masked.load (sub_B34C20) provides page-boundary-safe vectorized reads. Same page-boundary protocol as mismatch expansion.
Two nested loops: outer scans haystack, inner verifies full needle match at candidate positions. PHI nodes: "psearch" (haystack), "pneedle" (needle position), "match_start", "match_vec".

GPU considerations. LoopIdiom is present in cicc but its value on GPU code is limited. GPU memset/memcpy are typically handled by device runtime calls or specialized PTX instructions (st.global, ld.global with vectorized widths) rather than loop-based patterns. The vectorized mismatch/search expansions target CPU-style byte-level operations that are rare in GPU kernels. The page-boundary safety protocol is irrelevant on GPU (virtual memory page faults work differently -- GPU global memory is always accessible within the allocation). The pass runs but likely fires infrequently. When it does fire, the generated @llvm.memset/@llvm.memcpy intrinsics are later lowered to PTX-specific sequences by the NVPTX backend.

LoopRotate

Transforms loops so that the latch block (back-edge source) becomes the exiting block (where the exit condition is tested). This converts "while" loops into "do-while" form, which is a prerequisite for LICM (the loop body is guaranteed to execute at least once, enabling unconditional hoisting) and simplifies trip count computation for SCEV.

Property	Value
Entry point (legacy)	`sub_18A3090` -- called directly in O1/O2/O3 pipeline
Entry point (new PM)	`sub_28448D0` -- `LoopRotatePass` with `"header-duplication;"` param
Core implementation	`sub_2A0CFD0` (65 KB) -- `LoopRotation::runOnLoop`
String markers	`".lr.ph"` (preheader), `"h.rot"`, `"pre.rot"`
Pass name	`"loop-rotate"`
Params	`no-header-duplication` / `header-duplication`
Pipeline knob	`enable-loop-header-duplication` (bool) -- controls default param
NVIDIA delta	None -- stock LLVM, but appears multiple times in pipeline

Pipeline placement. LoopRotate appears at least four times in the cicc pipeline across different tiers:

Full O1+ pipeline, position 11: sub_18A3090() in sub_12DE330 -- runs before LICM (sub_184CD60) and IndVarSimplify.
Tier 1 passes: appears alongside SimplifyCFG and InstCombine as part of the canonicalization loop.
Tier 2 passes: appears again in the LoopRotate+LICM pair.
Pipeline assembler: sub_195E880 appears 4 times (labeled "LICM/LoopRotate"), conditional on opts[1240] and opts[2880].

This multiple invocation is standard LLVM practice -- rotation may be needed again after other transforms invalidate the rotated form. In the Ofcmid fast-compile pipeline, LoopRotate does not appear as a standalone pass; LICM (which internally depends on rotation) handles it.

Algorithm. The pass duplicates the loop header into the preheader (creating a "rotated" header named "h.rot" or "pre.rot"), then rewires the CFG so the original header becomes the latch. The header-duplication parameter controls whether the header is actually duplicated (which increases code size) or only the branch is restructured. After rotation, SCEV's backedge-taken count computation becomes straightforward because the exit test is at the latch.

SCEV interaction. LoopRotate requires BTC (backedge-taken count) recomputation after the header/latch swap. This is handled by ScalarEvolution::forgetLoop being called by downstream passes that depend on fresh SCEV data.

GPU considerations. LoopRotate is purely a structural transformation that does not examine instruction semantics. It has no barrier awareness -- if a barrier (__syncthreads()) is in the loop header, it will be duplicated into the preheader during rotation. In practice, barriers in CUDA kernels are rarely in loop headers (they are typically in loop bodies or between loops). The header duplication can increase code size, which affects instruction cache utilization on GPU -- SM instruction caches (L0/L1 I-cache) are small (typically 12-48 KB per SM depending on architecture), so excessive duplication of large loop headers across many loops in a kernel could cause I-cache pressure. The pass does not have a size threshold to prevent this.

LoopSimplify

Enforces LLVM's canonical loop form: single preheader, single latch, single dedicated exit block, and no abnormal edges. Nearly every loop optimization pass requires simplify form as a precondition.

Property	Value
Canonicalization core	`sub_1A5B3D0` (62 KB)
DomTree update helper	`sub_1A593E0` (47 KB)
Preheader insertion	`sub_1A5E350` (25 KB)
Exit block normalization	`sub_1A5F590` (42 KB)
Pass name	`"loop-simplify"`
String markers	`".backedge"`, `"llvm.loop"`
Pipeline wrapper (standalone)	`sub_1832270(n)` where n = verify flag
Pipeline wrapper (bundled)	`sub_1841180()` -- LoopSimplify + LCSSA combined
NVIDIA delta	None -- stock LLVM

Pipeline placement. LoopSimplify is the most frequently invoked loop pass in the cicc pipeline:

Context	Call site	Position
Full O1+ pipeline	`sub_1841180()`	Position 40 (bundled with LCSSA)
Ofcmid pipeline	`sub_1832270(1)`	Position 11 (standalone)
Ofcmid pipeline	`sub_1841180()`	Position 15 (bundled with LCSSA)
Post-tier insertion	`sub_1841180()`	Tier 2/3 additional invocations
As precondition	`sub_157F0D0` (check)	Called by LoopInterchange, LoopDistribute, IRCE, LoopVectorize

The pass appears at least 5 times across different pipeline tiers. It also runs as a utility called by other loop passes -- LoopInterchange, LoopDistribute, IRCE, and LoopVectorize all check isLoopSimplifyForm() (sub_157F0D0) and bail out if it fails.

What it does. If a loop lacks a single preheader, LoopSimplify creates one by inserting a new basic block on the entry edge (named with .lr.ph suffix via sub_1A5E350). If multiple latch blocks exist, it merges them into one (inserting .backedge blocks). If exit blocks are shared with other loops, it creates dedicated exit blocks via sub_1A5F590 (42 KB normalization function). After transformation, loop metadata ("llvm.loop" nodes) is preserved on the new latch terminator.

GPU considerations. LoopSimplify is purely structural and has no GPU-specific implications. However, it is worth noting that StructurizeCFG (which runs after all loop optimizations, during NVPTX code generation) re-canonicalizes the CFG for GPU divergence handling. Loop structures created by LoopSimplify may be further modified by StructurizeCFG when the loop contains divergent branches. The two passes do not interfere because they run in different pipeline phases (IR optimization vs. code generation).

LCSSA (Loop-Closed SSA)

Ensures that every value defined inside a loop and used outside it passes through a PHI node at the loop exit. This invariant simplifies SSA-based transformations: passes can modify loop internals without worrying about breaking uses outside the loop.

Property	Value
Formation pass	`sub_1AE2630` (49 KB)
Lightweight form	`sub_1961B00` (13 KB) -- creates `.lcssa` PHI nodes
LCSSA updater	`sub_1AF8F90` -- used by LoopInterchange post-transformation
Pass name	`"lcssa"`
Verify knob	`verify-loop-lcssa` registered at `ctor_094` (~`0x4A2491`)
String markers	`".lcssa"` suffix on PHI node names
NVIDIA delta	None -- stock LLVM

Pipeline placement. LCSSA runs bundled with LoopSimplify via sub_1841180() at position 40 in the full pipeline. In the Ofcmid fast-compile pipeline, it appears at position 15 via the same bundled wrapper. It is also maintained incrementally by every pass that modifies loop structure:

LoopInterchange calls sub_1AF8F90 to update LCSSA form for both inner and outer loops after transformation. The inner loop is updated first. The TTI availability boolean from a1+192 is passed as the 4th argument to the updater.
LoopUnroll checks LCSSA form via sub_D49210 and generates .unr-lcssa blocks for unrolled iterations.
LoopIdiom expansions (sub_2AA00B0, sub_2AA3190) end with explicit verifyLoopLCSSA assertion ("Loops must remain in LCSSA form!").

What it does. For each instruction defined inside the loop, LCSSA checks all uses outside the loop's exit blocks. For each such use, it inserts a PHI node in the exit block with the defined value as the incoming value from the latch. The PHI node is named with a .lcssa suffix. After LCSSA formation, all external uses of loop-internal values go through these PHI nodes, and loop transforms only need to update the PHI nodes rather than chasing all external uses.

GPU considerations. LCSSA is purely structural and has no GPU-specific behavior. However, LCSSA PHI nodes interact with the NVPTX backend's divergence analysis: when a loop exit depends on a divergent condition (different threads take different exit iterations), the .lcssa PHI node at the exit carries a divergent value. The divergence analysis pass (NVVMDivergenceLowering, sub_1C76260) must handle these PHIs correctly to avoid generating incorrect predication. This is not an issue with LCSSA itself but with downstream consumers.

Function Map

Function	Address	Size	Role
`IndVarSimplify::run` (core)	`sub_1945A50`	65 KB	--
`IndVarSimplifyPass::run` (NewPM wrapper with NVIDIA guards)	`sub_19489B0`	--	--
`rewriteLoopExitValues`	`sub_1943460`	--	--
`replaceExitValuesWithCompute` (LFTR commit)	`sub_1941790`	--	--
`computeWideTripCount`	`sub_1940670`	--	--
`hasCongruousExitingBlock`	`sub_193E1A0`	--	--
`getLoopDepth` (recursive, 1 for outermost)	`sub_193DD90`	--	--
`isBetterIV` (candidate comparison for LFTR)	`sub_193E640`	--	--
`exitValueSafeToExpand` (SCEV expandability check)	`sub_193F280`	--	--
`findFinalIVValue` (trace IV to exit value)	`sub_193F190`	--	--
`hasSafeExitBlock` (exit block LFTR safety)	`sub_193F750`	--	--
`initPassState` (initialize pass-level state)	`sub_1940CE0`	--	--
`clearPassState` (cleanup per-iteration state)	`sub_1940B30`	--	--
`SimplifyIndVar::simplifyIVUsers`	`sub_1B649E0`	--	--
`LoopInterchange::processLoopList`	`sub_1979A90`	69 KB	--
`LoopInterchange` legality checker	`sub_1975210`	45 KB	--
`LoopInterchange` dependence analysis helper	`sub_1978000`	37 KB	--
`LoopInterchange::getAnalysisUsage`	`sub_19743F0`	--	--
SmallVector copy helper (dep vector / loop list)	`sub_19742B0`	--	--
`vector<DepVector>` push_back	`sub_1974CB0`	--	--
Swap loop bounds / trip count metadata	`sub_1973F90`	--	--
`InductiveRangeCheckElimination::run`	`sub_194D450`	71 KB	--
`createPreLoop` / `cloneLoopForRange` (~1200 bytes)	`sub_194C320`	--	--
`createPostLoop` / `wirePostLoop` (~1300 bytes)	`sub_194AE30`	--	--
`classifyRangeCheckICmp` (~800 bytes)	`sub_1949EA0`	--	--
`canHandleRangeCheck` (~400 bytes)	`sub_1949540`	--	--
`canHandleRangeCheckExtended` (~300 bytes, constrained mode)	`sub_1949670`	--	--
`buildInductiveRangeCheck` (~500 bytes)	`sub_1949C30`	--	--
`adjustSCEVAfterCloning`	`sub_1949270`	--	--
`simplifyLoopAfterCloning` (~200 bytes)	`sub_1948FD0`	--	--
`verifyLoopStructure` (~200 bytes)	`sub_1948D70`	--	--
`LoopDistributePass::run`	`sub_1A8CD80`	63 KB	--
`distributeLoopBody` (core fission engine, ~2000 bytes)	`sub_1B1E040`	--	--
`updateDominatorTree` (post-distribution, ~400 bytes)	`sub_1B1DC30`	--	--
`updateLoopInfo` (post-distribution, ~300 bytes)	`sub_1B1DDA0`	--	--
`cleanupPartitions` (~400 bytes)	`sub_1B1F0F0`	--	--
`verifyDistribution` (~300 bytes)	`sub_1B216C0`	--	--
`cleanupAfterDistribution` (~200 bytes)	`sub_1A8C510`	--	--
`lookupPartitionForInstruction` (hash table lookup)	`sub_3860240`	--	--
`hasDirectDependence(partA, partB)`	`sub_385DBB0`	--	--
`alreadyMerged(partA, partB)`	`sub_385DB90`	--	--
`isSafeToDistribute` (final safety check)	`sub_1452CB0`	--	--
`LoopIdiomRecognize::run`	`sub_196FF90`	51 KB	--
LoopIdiom memset pattern detection	`sub_196B740`	10 KB	--
LoopIdiom memcpy/memmove patterns	`sub_196E000`	43 KB	--
`expandMemCmpMismatch`	`sub_2AA00B0`	48 KB	--
`expandFindFirst` (string search vectorization)	`sub_2AA3190`	40 KB	--
`expandByteMismatchLoopBody` (type 0)	`sub_2A9D690`	--	--
`expandWordMismatchLoopBody` (type 1)	`sub_2A9EC20`	--	--
`replaceUsesOfPhiInSuccessors` (LCSSA fixup)	`sub_2A9D330`	--	--
`LoopRotation::runOnLoop`	`sub_2A0CFD0`	65 KB	--
`LoopRotatePass` (NewPM, `"header-duplication;"`)	`sub_28448D0`	--	--
`LoopRotate` (legacy pipeline call)	`sub_18A3090`	--	--
`LoopSimplify` canonical form enforcement	`sub_1A5B3D0`	62 KB	--
`LoopSimplify` DomTree update helper	`sub_1A593E0`	47 KB	--
LoopSimplify preheader insertion	`sub_1A5E350`	25 KB	--
LoopSimplify exit block normalization	`sub_1A5F590`	42 KB	--
`LoopSimplify` pipeline wrapper (with verify flag)	`sub_1832270`	--	--
`LoopSimplify + LCSSA` bundled pass	`sub_1841180`	--	--
LCSSA formation pass	`sub_1AE2630`	49 KB	--
LCSSA lightweight `.lcssa` PHI insertion	`sub_1961B00`	13 KB	--
LCSSA form updater (used post-interchange)	`sub_1AF8F90`	--	--
`verifyLoopLCSSA` (assertion: "Loops must remain in LCSSA form!")	`sub_D48E00`	--	--

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
IndVarSimplify knobs	Stock LLVM defaults; no GPU-specific configuration	Three NVIDIA-specific knobs that change IV widening/narrowing behavior for GPU register pressure management
Barrier awareness	No concept of GPU barriers or synchronization primitives	None of the 8 standard passes have explicit barrier awareness; barrier handling deferred to dedicated NVIDIA passes (Dead Barrier Elimination, convergence token verification)
LoopRotate frequency	Runs once or twice in pipeline	Appears multiple times as canonicalization prerequisite for LICM and unrolling; forms the backbone of loop pass infrastructure
LoopIdiom patterns	`memset`, `memcpy` recognition for CPU targets	Same patterns; GPU-specific expansion handled downstream by MemmoveUnroll pass
IRCE	Range check elimination for deoptimization-safe targets	Present but effectiveness limited on GPU: no deoptimization support, relies on SCEV range analysis for bound proofs
LoopInterchange	Cost model driven by cache locality	Same legality checks; profitability analysis implicitly favors stride-1 access (coalescing) over cache line optimization
IV Demotion	Not present	Downstream NVIDIA pass (IV Demotion) narrows IVs widened by IndVarSimplify back to 32-bit where GPU value ranges permit

Cross-References

LoopVectorize & VPlan -- LoopDistribute feeds vectorization; IRCE removes bounds checks that block it.
Loop Unrolling -- Runs after IndVarSimplify canonicalizes IVs; requires LoopSimplify form. The unroll-runtime-convergent knob forces epilogue mode when convergent calls (warp-level primitives) are present -- an interaction with GPU barrier semantics that these 8 standard passes do not handle.
LICM -- Requires LoopRotate and LoopSimplify as prerequisites.
ScalarEvolution -- IndVarSimplify and IRCE are among the heaviest SCEV consumers; LoopInterchange uses SCEV for stride analysis. LoopRotate and LoopDistribute call ScalarEvolution::forgetLoop after transformation.
SCEV Invalidation -- LoopRotate requires BTC recomputation after header/latch swap; LoopDistribute calls forgetLoop after fission.
Loop Strength Reduction -- Runs after IndVarSimplify; consumes the canonicalized IV forms it produces. LSR has address-space-aware chain construction for shared memory (addrspace 3) that these 8 passes lack.
IV Demotion -- NVIDIA's custom pass that narrows IVs widened by IndVarSimplify back to 32-bit where value ranges permit, reducing register pressure for GPU occupancy.
Dead Barrier Elimination -- Handles barrier optimization that these standard loop passes do not address.
Pipeline & Ordering -- LoopRotate at position 11, LoopSimplify/LCSSA at position 40 in the full O1+ pipeline.
NVVMDivergenceLowering -- Handles divergent LCSSA PHI nodes at loop exits when different threads take different exit iterations.

Loop Unrolling

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp (decision engine), llvm/lib/Transforms/Utils/LoopUnroll.cpp (transformation engine), llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp (runtime unrolling) (LLVM 20.0.0)

Loop unrolling in cicc is one of the most heavily tuned transformations in the entire pipeline. On a GPU, unrolling directly trades register pressure against instruction-level parallelism: every additional copy of the loop body increases live register count, which reduces SM occupancy and the number of concurrent warps available to hide memory latency. Conversely, too little unrolling leaves performance on the table by failing to expose independent instructions that the hardware scheduler can overlap. NVIDIA's unroller resolves this tension through a priority-based decision cascade with GPU-specific heuristics that have no upstream equivalent -- most notably a local-array threshold multiplier, power-of-two factor enforcement, and a pragma threshold 200x larger than stock LLVM. The transformation engine itself is a lightly modified version of upstream llvm::UnrollLoop, but the decision engine (computeUnrollCount) is substantially reworked.

The pass appears twice in the cicc pipeline. The first invocation (sub_197E720) runs early, interleaved with loop vectorization in the main optimization sequence. The second invocation (sub_19C1680) runs later as a cleanup pass, gated by opts[1360] (the nv-disable-loop-unrolling flag). Both share the same decision engine; the second invocation operates on loops that were created or exposed by intervening passes (InstCombine, SROA, EarlyCSE).

Property	Value
Decision engine	`sub_19BB5C0` / `computeUnrollCount` (50 KB, ~1681 lines)
Transformation engine	`sub_2A15A20` / `UnrollLoop` (85 KB, ~2434 lines)
Top-level driver	`sub_19BE360` / `tryToUnrollLoop`
Runtime-check unroller	`sub_2A25260` / `UnrollLoopWithRuntimeChecks` (91 KB)
Pipeline slot (early)	`sub_197E720` -- runs once in main opt pipeline
Pipeline slot (late)	`sub_19C1680` -- conditional on `!opts[1360]`
Disable knob	`-Xcicc "-disable-LoopUnrollPass"` or `opts[1360]`
LLVM base	`LoopUnrollPass` from LLVM 20.0.0

Why Unrolling Matters More on GPU

On a CPU, the primary benefit of unrolling is reducing branch overhead and enabling wider SIMD scheduling. On a GPU, the calculus is different in three ways that all trace back to the GPU execution model:

First, unrolling increases register pressure, and register pressure determines occupancy. If unrolling pushes a kernel from 64 to 96 registers per thread, the SM drops from 32 to 21 resident warps -- a 34% reduction. Fewer warps means less latency hiding, so the unroll factor selection must be conservative in ways that a CPU unroller never needs to be.

Second, there is no out-of-order execution within a warp; the hardware issues instructions in program order. Unrolling creates independent instructions that the compiler (ptxas) can interleave, particularly independent loads that can overlap with arithmetic. This is the ILP benefit, and it is the primary argument for aggressive unrolling.

Third, GPU loops often access shared memory (__shared__) or local memory arrays indexed by threadIdx. Unrolling these loops enables the backend to promote array elements to registers and to rearrange memory accesses to avoid bank conflicts. NVIDIA's local-array heuristic (see below) exists specifically to exploit this opportunity.

The unroller's job is to find the sweet spot: enough copies to saturate the instruction pipeline, few enough to keep register pressure within occupancy targets.

The Decision Engine: computeUnrollCount

The decision engine at sub_19BB5C0 implements a strict six-level priority cascade. Each level is tried in order; the first level that produces a valid unroll factor wins. Every decision is logged through optimization remarks, making the logic traceable from -Rpass-analysis=loop-unroll.

UnrollParams Struct Layout

The decision communicates its result through a struct passed by pointer (a12 / v14):

Offset	Field	Type	Description
+0	Threshold	u32	Cost budget for full unroll
+4	MaxPercentThresholdBoost	u32	Max boost percentage (default 400)
+12	PartialThreshold	u32	Cost budget for partial unroll
+20	Count	u32	Chosen unroll factor (primary output)
+24	PeelCount	u32	Loop peel iteration count
+28	DefaultUnrollCount	u32	Fallback count when no factor found
+32	MaxCount	u32	Hard cap on unroll factor
+36	FullUnrollMaxCount	u32	Max trip count for full unroll
+40	FixedCost	u32	Non-scaling cost (IV increments, branches)
+44	AllowPartial	u8	Partial unrolling permitted
+45	AllowRemainder	u8	Remainder loop generation permitted
+46	UserProvidedCount	u8	True when pragma supplies count
+48	(reserved)	u8	--
+49	AllowUpperBound	u8	Use max-trip-count when exact unknown

The Cost Model

Every decision in the cascade uses the same linear cost model to estimate unrolled loop size:

estimated_size = FixedCost + Count * (LoopBodySize - FixedCost)

LoopBodySize is the instruction cost of one iteration (parameter a11, computed by LLVM's CodeMetrics). FixedCost captures instructions that do not replicate with unrolling -- induction variable increments, the backedge branch, loop overhead. The difference (LoopBodySize - FixedCost) is the per-copy marginal cost.

For full unrolls, an additional dynamic cost simulation (sub_19B9A90) constant-folds through the unrolled body. If the loop contains iteration-dependent simplifications (constant array indices, strength-reduced expressions), the simulation reports a cost lower than worst-case. The effective budget for this check is boosted:

dynamic_budget = Threshold * MaxPercentThresholdBoost / 100

With the default boost of 400%, this means a loop whose body simplifies substantially after unrolling gets 4x the normal cost budget.

Priority Cascade (Pseudocode)

int computeUnrollCount(Loop *L, SE, TTI, TripCount, MaxTripCount,
                       BodySize, UnrollParams *UP, bool *AllowRuntime) {

    // PRIORITY 1: Local array threshold multiplier (NVIDIA-specific)
    int localSize = computeLocalArraySize(L);  // scans for AS5 allocas
    int multiplier = min(max(localSize, 1), 6);
    int effectiveThreshold = multiplier * UP->Threshold;

    // PRIORITY 2: #pragma unroll N
    int pragmaCount = getMetadataCount(L, "llvm.loop.unroll.count");
    if (pragmaCount != 0) {
        if (pragmaCount == 1) {
            UP->Count = 1;  // disable unrolling
            return UNROLL_DISABLED;
        }
        UP->Count = pragmaCount;
        int estSize = UP->FixedCost + pragmaCount * (BodySize - UP->FixedCost);
        if (estSize > multiplier * PragmaUnrollThreshold) {
            // too large -- try to find smaller factor
            searchSmallerDivisibleFactor(UP, TripCount);
        }
        if (TripMultiple % pragmaCount != 0)
            emitRemark("remainder loops not allowed");
        return UNROLL_PRAGMA;
    }

    // PRIORITY 3: #pragma unroll (full, no count)
    if (hasMetadata(L, "llvm.loop.unroll.full")) {
        if (TripCount > 0 && TripCount <= UP->FullUnrollMaxCount) {
            int estSize = UP->FixedCost + TripCount * (BodySize - UP->FixedCost);
            if (estSize <= effectiveThreshold) {
                if (simulateLoopBody(L, TripCount, dynamicBudget))
                    { UP->Count = TripCount; return FULL_UNROLL; }
            }
        }
        // fallthrough to lower priorities
    }

    // PRIORITY 4: Loop peeling
    int peelCount = computePeelCount(L, SE, UP);
    if (peelCount > 0) {
        UP->PeelCount = peelCount;
        UP->Count = 1;
        return PEEL;
    }

    // PRIORITY 5: Static partial unrolling (known trip count)
    if (TripCount > 0 && (UP->AllowPartial || pragmaOversize) && isInnermost(L)) {
        int count = UP->Count ? UP->Count : UP->DefaultUnrollCount;

        // Size clamp
        if (UP->PartialThreshold < UP->FixedCost + count * (BodySize - UP->FixedCost))
            count = (UP->PartialThreshold - UP->FixedCost) / (BodySize - UP->FixedCost);
        count = min(count, UP->MaxCount);

        // Power-of-two + trip-divisible search
        while (count > 0) {
            if (TripCount % count == 0 && isPowerOfTwo(count))
                break;
            count--;
        }

        // Fallback: halve DefaultUnrollCount until it fits
        if (count == 0 && UP->UserProvidedCount) {
            count = UP->DefaultUnrollCount;
            while (UP->PartialThreshold <
                   UP->FixedCost + count * (BodySize - UP->FixedCost))
                count >>= 1;
        }

        if (count > 1) { UP->Count = count; return PARTIAL_UNROLL; }
    }

    // PRIORITY 6: Runtime unrolling (unknown trip count)
    if (!hasMetadata(L, "llvm.loop.unroll.runtime.disable")
        && RuntimeUnrollThreshold >= BodySize
        && isInnermost(L)) {

        int rtTripCount = computeRuntimeTripCount(L, SE);
        if (rtTripCount < FlatLoopTripCountThreshold) return NO_UNROLL;

        int count = UP->Count ? UP->Count : UP->DefaultUnrollCount;
        // same halving + threshold logic as Priority 5
        while (UP->PartialThreshold <
               UP->FixedCost + count * (BodySize - UP->FixedCost))
            count >>= 1;
        count = min(count, UP->MaxCount);

        if (count > 1) {
            UP->Count = count;
            *AllowRuntime = true;
            return RUNTIME_UNROLL;
        }
    }

    // Small-function override (tiny kernels get aggressive unrolling)
    if (functionInstructionCount < SmallFunctionThreshold)
        return handleSmallFunction(L, UP, BodySize);

    return NO_UNROLL;
}

Local Array Heuristic

The function sub_19B5DD0 (computeLocalArraySize) is entirely NVIDIA-specific. It scans every basic block in the loop for load/store instructions that access address space 5 (GPU local memory). For each such access, it traces back to the underlying alloca, determines the array type, and computes the product of array dimensions. If any dimension is unknown at compile time, it substitutes the unroll-assumed-size knob (default 4). The returned value is the maximum local-array size found across all accesses.

This value becomes a threshold multiplier, capped at 6:

int computeLocalArraySize(Loop *L) {
    int maxSize = 0;
    for (BasicBlock *BB : L->blocks()) {
        for (Instruction &I : *BB) {
            if (!isLoadOrStore(I) || getAddressSpace(I) != 5) continue;
            Value *base = getUnderlyingAlloca(I);
            if (!base || !isArrayType(base->getType())) continue;
            int size = 1;
            for (int dim : getArrayDimensions(base))
                size *= (dim > 0) ? dim : UnrollAssumedSize;  // default 4
            maxSize = max(maxSize, size);
        }
    }
    return maxSize;
}

The rationale: GPU kernels frequently use __shared__ or local arrays indexed by threadIdx. Unrolling such loops by a factor proportional to the array size enables register promotion of individual array elements and eliminates bank-conflict-prone access patterns. The cap at 6 prevents pathological explosion when arrays are large.

Power-of-Two Factor Enforcement

The partial-unroll factor search at Priority 5 requires the chosen count to satisfy two constraints simultaneously: it must evenly divide the trip count and must be a power of two. The implementation uses the classic bitmask test:

while (count > 0) {
    if (tripCount % count == 0 && (count & (count - 1)) == 0)
        break;
    count--;
}

This is a GPU-specific requirement. Warp size is 32 (a power of two), and many GPU memory access patterns, shared-memory bank calculations, and reduction operations assume power-of-two alignment. An unroll factor of, say, 6 would create asymmetric loop bodies that interact poorly with warp-level execution.

Pragma Handling

The frontend (sub_9305A0 / emitUnrollPragma) translates CUDA pragmas to LLVM metadata during codegen:

CUDA Source	LLVM Metadata
`#pragma unroll` (bare)	`!{!"llvm.loop.unroll.full"}`
`#pragma unroll N` (N > 1)	`!{!"llvm.loop.unroll.count", i32 N}`
`#pragma unroll 1`	Disables unrolling at Priority 2

The metadata is attached to the backedge branch as a self-referential !llvm.loop node. A guard flag (dword_4D046B4) skips pragma processing entirely in fast-codegen mode.

The pragma threshold is 32768 (0x8000), compared to upstream LLVM's 16384 (0x4000). This means #pragma unroll succeeds on loop bodies up to approximately 32K cost units -- covering virtually any realistic GPU kernel loop. When even this generous budget is exceeded, the decision engine falls through to lower priorities and attempts partial unrolling.

The __launch_bounds__ attribute does not directly feed the unroll decision. Instead, it constrains register allocation downstream, which indirectly limits the benefit of aggressive unrolling. There is no feedback loop from register pressure estimation back into the unroll factor at this stage of the pipeline; that coordination happens implicitly through the PartialThreshold provided by TTI.

Runtime Unrolling

Runtime unrolling (Priority 6) handles loops whose trip count is unknown at compile time. cicc enables it by default (unroll-runtime = true), with several GPU-specific twists:

Convergent instruction support. The knob unroll-runtime-convergent (default true, NVIDIA-specific) allows unrolling loops that contain convergent operations like warp-level primitives (__shfl_sync, __ballot_sync). Upstream LLVM refuses to unroll such loops because it cannot guarantee all threads in the warp execute the same iterations. cicc overrides this, relying on the waterfall-epilogue mechanism to preserve convergence.

Epilog vs. prolog remainder. The choice is controlled by a cascade:

If waterfall-unrolling-force-epilogue is true (default, NVIDIA-specific) and the loop has runtime trip count: epilog mode is selected.
If the loop body contains function calls (hasCallInLoop / sub_2A10B40 checks for opcode 17): epilog mode is forced. This preserves the property that all threads in a warp participate in calls, which matters for convergent operations.
Otherwise, unroll-runtime-epilog (default false) determines the mode.

In practice, GPU loops almost always use epilog-style remainders.

Flat-loop exclusion. If the estimated runtime trip count is below flat-loop-tripcount-threshold (default 5), runtime unrolling is skipped. The overhead of generating the modulo check and epilog loop is not worth it for loops that iterate fewer than 5 times.

Body size gate. Runtime unrolling only proceeds if runtime-unroll-threshold (default 95) is greater than or equal to the loop body size. This is more conservative than the static partial-unroll threshold, preventing code explosion for large loop bodies when the trip count is unknown.

Thresholds: NVIDIA vs. Upstream LLVM

Parameter	Upstream LLVM (O3)	Upstream LLVM (NVPTX TTI)	cicc v13.0
Threshold	300	300	From TTI (300), then multiplied by local-array factor (1-6x)
PartialThreshold	150	75 (`Threshold / 4`)	From TTI (75), plus local-array scaling
MaxPercentThresholdBoost	400%	400%	400% (same)
PragmaUnrollThreshold	16384	16384	32768
RuntimeUnrollThreshold	--	--	95 (NVIDIA addition)
FlatLoopTripCountThreshold	5	5	5 (same)
MaxUpperBound	8	8	8 (same)
MaxPragmaUpperBound	--	--	64 (NVIDIA addition)
DefaultUnrollRuntimeCount	8	8	From TTI
AllowPartial	false	true	true (from TTI)
Runtime	false	true	true (from TTI)
AllowRemainder	true	true	true
MaxIterationsCountToAnalyze	10	10	10 (same)
UnrollAssumedSize	--	--	4 (NVIDIA addition)

The critical differences: cicc doubles the pragma threshold, introduces a body-size gate for runtime unrolling (95), adds the local-array multiplier (up to 6x on base thresholds), and enforces power-of-two partial factors. The upstream NVPTX TTI enables partial and runtime unrolling but leaves thresholds at modest CPU-oriented values; cicc's decision engine applies substantial additional logic on top.

Interaction with Loop Vectorization

In the cicc pipeline, loop vectorization (LoopVectorizePass) runs before the first unroll invocation. Specifically, sub_197E720 combines both vectorization and unrolling decisions in the early pipeline slot. The vectorizer decides the vector width first (VF), and if it applies a transformation, the resulting loop (possibly with a scalar epilog) is then presented to the unroller.

This means vectorization and unrolling do not "coordinate" in the planning sense -- the vectorizer runs to completion before the unroller sees the loop. However, the vectorizer's interleave count (IC) serves a similar role to unrolling: it replicates the vectorized loop body to increase ILP. When the vectorizer chooses IC > 1, the subsequent unroller typically finds the loop body too large to unroll further, producing a de facto coordination through cost thresholds.

The second unroll invocation (sub_19C1680) runs much later, after InstCombine, SROA, and EarlyCSE have had a chance to simplify the vectorized code. Loops that were too large to unroll earlier may become eligible after dead code elimination within the unrolled-and-vectorized body.

The Transformation Engine: UnrollLoop

The transformation at sub_2A15A20 takes a loop and an unroll factor and physically duplicates the loop body. It is structurally close to upstream llvm::UnrollLoop with the following entry guards:

Loop must have a preheader (sub_D4B130)
Loop must have a single latch (sub_D47930)
Loop must be in LCSSA form (sub_D49210)
Header flags must be clean (no special bits set)

The duplication proceeds by iterating Count - 1 times, each iteration cloning every basic block in the loop body, remapping instructions through a value map, and rewiring PHI nodes so that iteration i's latch feeds iteration i+1's header. After all copies, the backedge of the last copy is reconnected to the first copy's header (for partial unroll) or removed entirely (for full unroll).

For partial unrolls where TripCount % Count != 0, a remainder loop is generated by sub_2A23640. If remainder generation fails (e.g., multi-exit loops), the engine delegates to sub_2A25260 which generates the runtime-check variant with prologue/epilogue.

The return value encodes the result: 0 = no change, 1 = partial unroll, 2 = full unroll.

Configuration Knobs

Standard LLVM Knobs (with NVIDIA defaults)

Knob	Default	Global	Effect
`unroll-threshold`	From TTI	`sub_19B7760` struct	Base cost budget for full unroll
`unroll-partial-threshold`	From TTI	`0x4FB3140` area	Cost budget for partial unroll
`unroll-max-percent-threshold-boost`	400	`dword_4FB3100`	Max dynamic cost boost (%)
`unroll-max-iteration-count-to-analyze`	10	`dword_4FB3020`	Max iterations for cost simulation
`unroll-count`	Unset	`dword_4FB2EA8`	Force specific unroll factor
`unroll-max-count`	Unset	`sub_19B7760` struct	Hard cap on unroll factor
`unroll-full-max-count`	Unset	`0x4FB2CE0` area	Max trip count for full unroll
`unroll-peel-count`	Unset	`0x4FB2C00` area	Force specific peel count
`unroll-allow-partial`	false	`0x4FB2B20` area	Enable partial unrolling override
`unroll-allow-remainder`	false	`0x4FB2A40` area	Enable remainder loop generation
`unroll-runtime`	true	`0x4FB2960` area	Enable runtime (dynamic TC) unrolling
`unroll-max-upperbound`	8	`dword_4FB2920`	Max trip count for upper-bound unroll
`pragma-unroll-threshold`	32768	`dword_4FB2760`	Cost budget for pragma-directed unrolls
`flat-loop-tripcount-threshold`	5	`0x4FB2680` area	Min estimated TC for runtime unroll
`runtime-unroll-threshold`	95	`dword_4FB3560`	Max body size for runtime unroll
`max-pragma-upperbound-unroll`	64	`dword_4FB2840`	Max upper-bound factor for pragma
`unroll-assumed-size`	4	`dword_4FB33A0`	Assumed array size for unknown dims

NVIDIA-Specific Knobs

Knob	Default	Global	Effect
`unroll-runtime-convergent`	true	`0x500A440` area	Allow unrolling loops with convergent ops
`unroll-runtime-epilog`	false	`qword_500A3E8`	Force epilog-style remainder (override)
`waterfall-unrolling-force-epilogue`	true	`qword_500A148`	Force epilog for waterfall patterns

Knobs are registered in two constructors: standard LLVM knobs in ctor_216_0 at 0x4E5C30, NVIDIA-specific knobs in ctor_501 at 0x559890.

Function Map

Function	Address	Size	Role
`emitUnrollPragma`	`0x09305A0`	--	Frontend: `#pragma unroll` to metadata
`parseUnrollMetadata`	`0x19B4C50`	--	Reads `llvm.loop.unroll.*` metadata
`computeLocalArraySize`	`0x19B5DD0`	--	NVIDIA: local array threshold heuristic
`handleSmallFunction`	`0x19B6500`	--	Special aggressive unroll for tiny kernels
`selectUnrollFactor`	`0x19B6690`	--	Trip count analysis helper
`emitRemainderNotAllowedRemark`	`0x19B78B0`	--	Diagnostic emission
`simulateLoopBody`	`0x19B9A90`	--	Dynamic cost simulation with constant folding
`computeUnrollCount`	`0x19BB5C0`	--	Main decision engine
`tryToUnrollLoop`	`0x19BE360`	--	Top-level driver
`computePeelCount`	`0x1B0B080`	--	Loop peeling logic
`computeRuntimeTripCount`	`0x1B18810`	--	Runtime trip count estimation
`hasCallInLoop`	`0x2A10B40`	--	Checks for call/invoke in loop body
`createSideExitPHI`	`0x2A10DD0`	--	PHI nodes for side-exit unrolled loops
`cloneInstructionsInBlock`	`0x2A12AD0`	--	Instruction-level cloning
`reconcileLoopAfterUnroll`	`0x2A13F00`	--	Post-unroll SCEV/LoopInfo fixup
`UnrollLoop`	`0x2A15A20`	--	Main transformation engine
`unrollCostModel`	`0x2A1AA10`	--	Cost estimation helper
`UnrollAndJamLoop`	`0x2A1CF00`	--	Unroll-and-jam variant
`generateRemainderLoop`	`0x2A23640`	--	Remainder loop construction
`UnrollLoopWithRuntimeChecks`	`0x2A25260`	--	Prologue/epilogue generation

Pass Factory and Object Layout

The following section documents the LoopUnroll pass factory at sub_19B73C0, which was originally misidentified as LICM in the P2C.3 sweep due to binary adjacency with the actual LICM pass. The vtable at unk_4FB224C, the 7-parameter constructor signature, and diagnostic function strings all confirm LoopUnroll identity.

The pass factory at sub_19B73C0 allocates a 184-byte pass object and accepts seven parameters that control unroll behavior. When a parameter is -1, the pass uses its compiled-in default.

Constructor Parameters

Parameter	Offset	Enable Flag	Semantics
`a1` (optimization level)	+156	--	2 = standard, 3 = aggressive
`a2` (unroll threshold)	+168	+172	Trip count threshold; -1 = use default
`a3` (unroll count)	+160	+164	Explicit unroll factor; -1 = use default
`a4` (allow partial)	+176	+177	0 = disable partial unroll, 1 = enable
`a5` (runtime unroll)	+178	+179	0 = disable runtime unroll, 1 = enable
`a6` (upper bound)	+180	+181	0 = disable upper-bound unroll, 1 = enable
`a7` (profile-based)	+182	+183	0 = disable profile-guided unroll, 1 = enable

Object Construction

The factory allocates 184 bytes via sub_22077B0, sets the vtable to off_49F45F0 (loop-unroll pass vtable), stores pass ID unk_4FB224C at offset +16, initializes self-referential linked-list pointers at offsets +80/+88 and +128/+136, sets pass type 2 (FunctionPass) at offset +24, and calls sub_163A1D0 / sub_19B71A0 for pass registration.

Pipeline Invocation Configurations

CICC invokes LoopUnroll with six distinct configurations at different pipeline stages, reflecting NVIDIA's careful tuning of unroll aggressiveness per compilation phase. These are the factory-level parameter sets passed to sub_19B73C0; see also the decision engine's per-invocation behavior in The Decision Engine above.

Configuration A: Standard Pipeline (O1/O2)

Call site: sub_12DE330

LoopUnroll(2, -1, -1, -1, -1, -1, -1)

All parameters at defaults. Standard unrolling with default thresholds at optimization level 2.

Configuration B: Code-Size Mode

Call site: sub_12DE8F0, when *(a3+4480) < 0 (NVIDIA code-size flag set)

LoopUnroll(a2, -1, -1, 0, 0, 0, 0)

All unrolling features disabled: partial, runtime, upper-bound, and profile-based are all zeroed. The pass only unrolls when the trip count is statically known and the benefit is certain. This reflects the constraint that GPU register pressure makes speculative unrolling expensive when code size matters.

Configuration C: Normal Optimizer

Call site: sub_12DE8F0, when *(a3+4480) >= 0 (normal mode)

LoopUnroll(a2, -1, -1, -1, -1, -1, -1)

Fully aggressive unrolling with all defaults. The optimization level is passed through from the caller.

Configuration D: Late Pipeline (Conservative)

Call site: sub_12DE8F0, late pipeline position

LoopUnroll(a2, -1, -1, 0, 0, -1, -1)

Partial and runtime unrolling disabled, but upper-bound and profile-based unrolling retain their defaults. This conservative late-pipeline configuration avoids creating new runtime overhead in code that has already been substantially optimized.

Configuration E: Aggressive Pipeline (O3)

Call site: sub_12E54A0

LoopUnroll(3, -1, -1, 0, 0, -1, 0)

Optimization level 3 with aggressive thresholds, but partial, runtime, and profile-based unrolling are disabled. Only upper-bound unrolling retains its default. The rationale is that at O3, the higher thresholds already capture most profitable unrolling opportunities without needing speculative runtime checks.

Configuration F: User-Configured

Call site: sub_12EA3A0

LoopUnroll(a1[4], a1[5], a1[6], a1[7], a1[8], a1[9], a1[10])

All seven parameters are read from a stored configuration object, enabling user-specified unroll behavior via command-line flags or pragmas.

Threshold Initialization (Pass-Level)

The function sub_19B6690 (17 KB) configures unroll thresholds based on optimization level and LLVM knobs at pass construction time. These values feed into the UnrollParams struct consumed by the decision engine.

Default Threshold Values

Offset	Field	Default (O2+)	Default (O1)
+0	OptThreshold	405	150
+4	Threshold	400	400
+12	SmallTripCountThreshold	150	150
+56	MaxIterationsCountToAnalyze	60	60

Function-Attribute-Aware Override

The threshold initializer queries function attributes via sub_1560180:

Attribute ID 34 (minsize): Reduces OptThreshold to SmallTripCountThreshold (150).
Attribute ID 17 (optsize): Same reduction.

This means kernels annotated with size constraints get conservative unroll thresholds regardless of the global optimization level.

Per-Function Knob Override via BST

The function queries the LLVM option registry (dword_4FA0208 BST) ten times, each time looking up a different knob address. For each knob, it searches the BST rooted at dword_4FA0208[2], compares the current function hash (sub_16D5D50) against node ranges, and applies the override if the knob value meets the threshold. The knob-to-field mapping:

Knob Address	Override Address	Field
`dword_4FB3228`	`dword_4FB32C0`	OptThreshold (+0)
`dword_4FB3148`	`dword_4FB31E0`	SmallTripCountThreshold (+12)
`dword_4FB3068`	`dword_4FB3100`	Threshold (+4)
`dword_4FB2DC8`	`dword_4FB2E60`	field +32
`dword_4FB2CE8`	`dword_4FB2D80`	field +36
`dword_4FB2C08`	`dword_4FB2CA0`	field +24
`dword_4FB2B28`	(next value)	field +40

The per-function BST lookup keyed by function hash enables fine-grained tuning of unroll behavior per kernel, a capability not present in upstream LLVM.

Diagnostic Functions

Three diagnostic emission functions produce optimization remarks:

Function	Address	Diagnostic
`emitPragmaCountDiag`	`sub_19B78B0`	Reports when pragma unroll count conflicts with trip multiple
`emitThresholdDiag`	`sub_19B7B10`	Reports when unrolled size exceeds threshold
`emitLoopSizeDiag`	`sub_19B7D80`	Reports when loop body is too large to unroll

Main Loop Processing and Hash Infrastructure

The primary analysis function sub_19B7FA0 (11 KB) analyzes each candidate loop. The pass uses hash table infrastructure shared with other CICC LLVM passes:

Function	Address	Size	Role
`rehashSmallTable`	`sub_19B60B0`	5 KB	Small hash table resize
`rehashTable`	`sub_19B8820`	4 KB	Key-value hash table resize
`rehashSet`	`sub_19B89E0`	7 KB	Set hash table resize
`insertIntoSet`	`sub_19B8DA0`	--	Set insert with growth

All hash tables use the same (value >> 9) ^ (value >> 4) hash function and linear probing strategy found throughout CICC's LLVM passes. See Hash Infrastructure for the common implementation.

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Pragma threshold	`UnrollThreshold` default 150; pragma multiplier ~8x	Pragma threshold 200x larger than stock (`PragmaUnrollThreshold` = 30000); enables aggressive pragma-directed unrolling for GPU kernels
Power-of-two enforcement	No power-of-two requirement; any profitable factor accepted	Enforces power-of-two unroll factors; non-power-of-two factors are rounded down to avoid irregular loop tails
Local array multiplier	No concept of local array bonus	Dedicated local-array threshold multiplier boosts unroll budget when loop body accesses `alloca`/`.local` arrays indexed by IV, enabling register promotion
Decision engine	~20 KB `computeUnrollCount`	Substantially reworked 50 KB `computeUnrollCount` (`sub_19BB5C0`) with 6-level priority cascade and GPU-specific occupancy heuristics
Register pressure model	Generic TTI-based unroll cost; no occupancy concept	Occupancy-aware cost model considers register pressure cliffs where one additional register per thread drops warp occupancy
Pipeline invocations	Single invocation in optimization pipeline	Two invocations: early (interleaved with vectorization) and late (cleanup, gated by `opts[1360]` / `nv-disable-loop-unrolling`)
Transformation engine	Stock `llvm::UnrollLoop`	Lightly modified `UnrollLoop` (`sub_2A15A20`, 85 KB); decision engine is where the changes concentrate

Test This

The following kernel contains a simple counted loop that is a prime candidate for full unrolling. Compile and compare PTX output with and without #pragma unroll.

__global__ void unroll_test(float* out, const float* in) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;

    #pragma unroll
    for (int i = 0; i < 8; i++) {
        sum += in[tid + i * 128];
    }
    out[tid] = sum;
}

What to look for in PTX:

With #pragma unroll: the loop should be fully unrolled into 8 sequential ld.global.f32 + add.f32 sequences with no backedge branch. Look for the absence of bra instructions targeting a loop header and the presence of 8 distinct ld.global.f32 instructions with addresses offset by 128*sizeof(float).
Without #pragma unroll (remove the pragma): the compiler may still unroll if the trip count (8) times body size fits within the threshold (default 300). Check whether the PTX has a loop or is fully unrolled -- this exercises the automatic decision engine.
With #pragma unroll 1: the loop must remain as a counted loop with a backedge branch. This tests that pragma disabling works.
Compare .nreg values across the three variants. Full unrolling increases register pressure (8 loads live simultaneously); the partial or no-unroll variant uses fewer registers at the cost of loop overhead.
The power-of-two enforcement is visible when the trip count is not a power of two: change the loop bound to 6 and check whether the compiler partially unrolls by 4 (highest power of two dividing the body-size budget) rather than 6.

Cross-References

Loop Optimization Passes -- pipeline context and pass ordering
LICM -- runs before second unroll invocation, feeds hoisted invariants
Loop Strength Reduction -- runs after unrolling, reduces IV expressions
Register Allocation -- occupancy-driven allocation consumes what unrolling produces
StructurizeCFG -- runs after all loop transforms, restructures divergent control flow
InstCombine -- simplifies unrolled loop bodies between invocations

LoopVectorize and VPlan (GPU-Adapted)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp, llvm/lib/Transforms/Vectorize/VPlan*.cpp (LLVM 20.0.0). VPlan infrastructure lives in llvm/lib/Transforms/Vectorize/VPlan.cpp, VPlanRecipes.cpp, VPlanTransforms.cpp, and related files.

LLVM version note: CICC v13.0 is based on LLVM 20.0.0 trunk. Evidence includes histogram-pattern support (merged in LLVM 19), early-exit vectorization (LLVM 20 experimental feature, gated by byte_500CDA8), and the VPlan-native path. The VPlan object size (656 bytes) is consistent with LLVM 17/18+ layout. Scalable vectors are always disabled for NVPTX.

NVIDIA's cicc ships a heavily modified copy of LLVM's LoopVectorizePass, the single largest pass in the vectorization pipeline at 88 KB of decompiled output (2,612 lines in sub_2AF1970). The modifications do not change the pass's fundamental architecture -- it still builds VPlans, selects a vectorization factor (VF) through cost modeling, and transforms IR through VPlan execution -- but the cost model, VF selection heuristics, interleave count logic, and legality checker are all tuned for a target where "vectorization" means something fundamentally different than on a CPU. On a CPU, loop vectorization fills SIMD lanes: a VF of 4 on SSE processes four float elements per vector instruction. On an NVIDIA GPU, there are no SIMD lanes in the CPU sense -- each thread already executes scalar code, and the warp executes 32 threads in lockstep. The reasons to vectorize on GPU are: (1) memory coalescing -- adjacent threads issuing adjacent loads produce 128-byte cache line transactions, and vectorizing a per-thread loop body with VF=2 or VF=4 produces ld.v2/ld.v4 wide loads that maximize bytes-per-transaction; (2) reducing instruction count -- a single ld.global.v4.f32 replaces four ld.global.f32 instructions, saving fetch/decode/issue bandwidth; (3) register-to-memory width matching -- PTX supports 32-, 64-, and 128-bit load/store widths, and vectorization widens narrow scalar accesses to fill these naturally.

Key Facts

Property	Value
Registration	New PM #400, parameterized: `no-interleave-forced-only;...`
Runtime positions	Not in Tier 0/1/2/3 tables; invoked via LLVM standard sub-pipeline `sub_1A62BF0` when vectorization is enabled (see Pipeline)
Main entry point	`sub_2AF1970` (0x2AF1970) -- `LoopVectorizePass::processLoop()`
Binary size	88 KB decompiled, 2,612 lines
VPlan builder	`sub_2AEE460` (0x2AEE460) -- `tryToBuildVPlanWithVPRecipes()`, 56 KB
VPlan object size	656 bytes (0x290), consistent with LLVM 17/18 layout
LLVM base	LLVM 20 trunk (evidence: histogram-pattern support, early-exit vectorization, VPlan-native path)
Scalable vectors	Always disabled -- `sub_DFE610` returns `false` for NVPTX
Register bit width (TTI)	32 bits fixed (`TypeSize::getFixed(32)` in upstream `NVPTXTTIImpl`)
Pass name string	`"vectorize-loops"` at 0x439F095
Address cluster	0x2AA0000--0x2C20000 (loop vectorizer + VPlan infrastructure)

Why Vectorize on GPU

GPU vectorization is not about filling SIMD lanes -- the SIMT model already replicates scalar code across 32 threads. Vectorization targets three orthogonal benefits related to memory coalescing and instruction throughput:

Memory coalescing width. The GPU memory subsystem services requests in 128-byte transactions. If a single thread's inner loop accesses 4 consecutive floats in sequence, those 4 accesses become 4 separate scalar loads issued over 4 iterations. Vectorizing with VF=4 converts them into one ld.global.v4.f32, which the memory subsystem can service in a single wider transaction per thread. Across the warp, this multiplies the effective memory bandwidth.

Instruction count reduction. PTX's ld.v2 and ld.v4 instructions load 2 or 4 elements with a single instruction. The instruction issue pipeline has finite throughput (typically 1-2 instructions per clock per scheduler), so halving instruction count directly improves throughput-bound kernels.

Register-width matching. PTX has 32-bit typed registers. A 128-bit ld.v4.f32 loads directly into four consecutive registers via a single instruction, which is strictly better than four separate 32-bit loads (each requiring its own address computation).

These benefits are bounded by register pressure -- the primary constraint that does not exist on CPU. On a GPU, every additional register per thread can cross an occupancy cliff, potentially losing an entire warp group. A VF=4 vectorization that quadruples the live register count may halve occupancy and lose net throughput.

The 8-Phase Pipeline

The main function sub_2AF1970 implements eight phases, closely following upstream LLVM's structure but with GPU-specific decision points at each stage.

Phase 1: Legality Pre-Check

sub_31A4FD0(legalityCtx, Loop, Function, ORE, SE)    // init legality scratch
TTI = *(**(Loop+32) + 72)                             // Loop->getHeader()->getParent()->getTTI()
if (!sub_31A91F0(legalityCtx, TTI, Loop, LoopInfo))   // canVectorize() quick check
    return false

sub_31AF060(costCtx, ForceVectorization)              // canVectorize() full check
// ForceVectorization = qword_500D340[17] ("vectorize-loops" knob)

The legality checker (sub_31AF060) performs standard LLVM legality analysis: loop simplify form, single exit, computable backedge-taken count, no irreducible control flow. The NVIDIA-specific addition is early-exit loop handling:

if (hasUncountableEarlyExit && !byte_500CDA8)         // -enable-early-exit-vectorization
    emit "UncountableEarlyExitLoopsDisabled"
    return false

This knob (byte_500CDA8) gates an LLVM 20 feature that NVIDIA includes but disables by default. Early-exit vectorization requires predicated execution, which on GPU means divergent warps -- typically unprofitable.

Phase 2: Outer vs Inner Loop Dispatch

if (Loop->getSubLoops().size() > 0)
    goto outerLoopPath                                // PATH A (rarely taken on GPU)
else
    goto innerLoopPath                                // PATH B (the main path)

Outer loop vectorization is controlled by byte_500D208 (-force-vector-width-outer). When enabled and TTI-based VF selection returns VF <= 1, the pass forces VF=4 -- a hardcoded NVIDIA override for kernel patterns where outer-loop vectorization benefits warp-level memory access patterns. In practice, inner loop vectorization (Path B) handles the vast majority of GPU kernels.

Phase 3: Trip Count and Safety Checks

tripCount = getSmallBestKnownTC(PSE, Loop)            // sub_2AA7EC0
if (tripCount < VectorizerMinTripCount                // dword_500EAE8
    && !isForceVectorize(legalCtx)
    && !(exactTC >= userVF))
    emit "LowTripCount"
    reduce hint to interleave-only

if (hasAttribute(TTI, NoImplicitFloat))               // attribute 30
    bail "NoImplicitFloat"

if (hasUnsafeFPOps && !canReorderFP(override))
    bail "UnsafeFP" / "CantReorderFPOps"

The FP reorder safety check has an override mechanism: dword_500D508 selects whether the override is active, and byte_500D588 provides the override value. This lets NVIDIA force-allow FP reordering for specific compilation modes (e.g., -ffast-math propagated from nvcc).

Phase 4: VF Selection

This is where NVIDIA diverges most from upstream. The upstream algorithm queries TTI::getRegisterBitWidth() which returns the vector register width (256 for AVX2, 512 for AVX-512), then computes VF = registerWidth / elementSize. On NVPTX, getRegisterBitWidth() returns 32 -- a single scalar register width. This means the upstream formula would always produce VF=1 for 32-bit types.

NVIDIA's VF selection (sub_2AB8AC0 for outer loops, sub_2AE08E0 for inner loops via VPlan cost) works differently:

// sub_2AB8AC0 — outer loop VF selection (simplified)
elementBits = getWideningElementSize(CostModel)       // sub_2AB4370: top 32 bits
regWidth    = TTI.getRegisterBitWidth(Vector)          // sub_DFE640: returns 32
VF          = regWidth / (elementBits / 8)

if (!isScalable && VF <= 1 && forceOuterMode)          // byte_500D208
    VF = 4                                             // NVIDIA hardcoded override

For inner loops (the common path), VF selection goes through the full VPlan cost model:

// sub_2AE08E0 — selectBestVF() from VPlan candidates
bestCost = INT64_MAX
for each VPlan in candidatePlans:
    for each VF in VPlan.VFRange:
        cost = computeCostForVF(VPlan, VF)             // sub_2AE0750
        if isBetterThan(cost, bestCost):               // sub_2AB3FE0
            bestVF = VF
            bestCost = cost
return {bestVF, isScalable, bestIC}

The cost accumulation uses saturating arithmetic -- __OFADD__ overflow detection clamping to INT64_MAX/INT64_MIN -- preventing wrap-around in cost comparisons. This is defensive engineering for GPU kernels with very large loop bodies where naive summation could overflow.

Phase 5: Cost Model Construction

The cost model object (sub_2AB2780, 16 parameters) assembles all analysis results into a single context:

CostModel = {
    Loop*, DominatorTree*, LoopBlocksRPO*, ScalarEvolution*,
    TargetLibraryInfo*, AssumptionCache*, PredicatedScalarEvolution*,
    ValuesToIgnore=0, ORE*,
    /* additional context fields */
}

The VPlan planner (sub_2AF13F0) generates VPlans for all candidate VFs, then sub_2AE08E0 selects the best one. Each VPlan recipe provides its own cost through the virtual getVPCost(VF, CostCtx) method, which delegates to NVPTXTargetTransformInfo for GPU-specific instruction costs.

Phase 6: Profitability Decision and Interleave Selection

After VF selection, the pass evaluates a decision matrix:

Condition	Result
VF=1, not scalable	`VectorizationNotBeneficial` -- bail
IC=1 but user wanted more	`InterleavingNotBeneficial`
IC>1 but user disabled	`InterleavingBeneficialButDisabled`
Histogram loop + scalar interleave	`HistogramPreventsScalarInterleaving` -- bail
VF=1, IC>1	Interleave-only path: `executeVPlan(VF=1, IC)`
VF>1	Full vectorization path

The histogram diagnostic (HistogramPreventsScalarInterleaving) is an NVIDIA addition not present in upstream LLVM. It blocks scalar interleaving of histogram-pattern loops where reduction ordering constraints make interleaving incorrect without vectorization.

Interleave count selection (sub_2AED330) is register-pressure-bounded on GPU:

// sub_2AED330 — selectInterleaveCount() (simplified)
maxIC = TTI.getMaxInterleaveFactor(VF)                 // sub_DFB120(TTI+448)
// Override knobs:
if (VF.isScalar() && ForceTargetMaxScalarInterleave)   // dword_500E148
    maxIC = ForceTargetMaxScalarInterleave
if (VF.isVector() && ForceTargetMaxVectorInterleave)   // dword_500E068
    maxIC = ForceTargetMaxVectorInterleave

tripCount = getSmallBestKnownTC(PSE, Loop)
IC = bit_floor(tripCount / (VF * 2))                  // conservative: vector loop runs >= 2x
IC = min(IC, maxIC)

// Small loop boost
if (loopCost < SmallLoopCost)                          // qword_500DC88
    smallIC = min(IC, bit_floor(SmallLoopCost / loopCost))
    IC = max(IC, smallIC)

// Scheduling-based cap (NVIDIA-specific TTI path)
issueWidth = *(TTI + 56 + 32)                          // scheduling info at TTI+88
latency    = *(TTI + 56 + 36)                          // scheduling info at TTI+92
IC = IC / max(issueWidth, latency)                     // cap by scheduling model

// Aggressive interleave mode
if (byte_500D908)                                      // AggressiveInterleave
    IC = maxIC                                         // bypass all heuristics

IC = clamp(IC, 1, maxIC)
return powerOf2Floor(IC)

On CPU, the interleave count is bounded by vector register count (e.g., 16 YMM registers / registers per iteration). On GPU, it is bounded by register pressure impact on occupancy -- the TTI scheduling info encodes this constraint. The AggressiveInterleave knob (byte_500D908) bypasses all heuristics and sets IC to the maximum, useful for benchmarking or known-good kernels.

Phase 7: VPlan Execution and Epilogue Vectorization

mainVPlan = getBestPlanFor(bestVF)                     // sub_2BF1320
executeVPlan(mainVPlan, bestVF, IC)                    // sub_2AE3460

// Epilogue vectorization (when byte_500ED88 is set)
epilogueVF = selectEpilogueVectorizationFactor()       // sub_2ABBD40
if (epilogueVF > 1):
    clonedPlan = cloneVPlan(mainVPlan)                 // sub_2BF7CB0
    epiloguePlan = getBestPlanFor(epilogueVF)
    mergeVPlans(clonedPlan, epiloguePlan)              // sub_2AB0350
    // Remap operands between main and epilogue plans:
    //   recipe types 29 (load/store), 36 (phi), 17 (GEP)
    //   types 19-20 (inttoptr/ptrtoint casts)
    executeVPlan(merged, epilogueVF, epilogueIC, isEpilogue=true)

Epilogue vectorization is particularly relevant on GPU: the scalar remainder loop after vectorization forces warp divergence (some threads in the warp execute the epilogue while others are masked off), which is expensive. A vectorized epilogue with a smaller VF reduces the scalar remainder to fewer iterations, minimizing divergence overhead.

The epilogue VF selection (sub_2ABBD40) can be forced via qword_500ECA8 (-epilogue-vectorization-force-VF). When not forced, it uses SCEV range analysis (sub_DC3A60, sub_DBB9F0) to prove the epilogue trip count is sufficient for the candidate VF.

Phase 8: Post-Vectorization Metadata

The pass applies follow-up loop metadata (llvm.loop.vectorize.followup_all, llvm.loop.vectorize.followup_epilogue) and emits optimization remarks through sub_2AC2B40. Generated basic blocks use naming conventions vec.epilog.middle.block and vec.epilog.vector.body.

VPlan Construction (sub_2AEE460)

The VPlan builder allocates a 656-byte VPlan object and iterates over candidate VFs in powers of 2 (VF *= 2 each iteration, visible as add r15d, r15d in the binary). For each VF, it calls sub_2AA9E60 (tryToBuildRecipesForVF).

Recipe type tags observed in the binary:

Tag	Recipe Type
0x04	`VPWidenMemoryInstructionRecipe`
0x0F	`VPWidenRecipe`
0x1D	`VPReplicateRecipe`
0x21	`VPWidenSelectRecipe`
0x43	`VPWidenCallRecipe`

Interleave group recipes are built from LoopAccessInfo at [Planner+0x28]+0x150. The builder removes individual load/store recipes and replaces them with interleave group recipes via sub_2AB9570 (replaceAllUsesWith), using a hash map with the pointer-hash function (ptr >> 4) ^ (ptr >> 9) & mask -- identical to LLVM's DenseMap hash.

Cost annotation happens in Phase 6 of VPlan construction via sub_2C2E3C0, which walks all recipes and annotates them with TTI-derived costs. This is where NVPTXTargetTransformInfo shapes the cost model: it prices ld.v4 cheaper than 4x ld.f32, making vectorization profitable even with register pressure increase.

The VPlan verification flag at 0x500D2E8 enables VPlan dump/verify paths -- useful for debugging vectorization decisions with -mllvm -vplan-verify-or-dont.

NVPTXTargetTransformInfo Hooks

The loop vectorizer reaches NVIDIA's TTI through Loop->getHeader()->getParent()->getTTI() (recovered as *(**(Loop+32)+72)). Key hooks:

TTI Method	Address	GPU Behavior
`getRegisterBitWidth(Vector)`	`sub_DFE640`	Returns 32 (fixed) -- single scalar register width
`supportsScalableVectors()`	`sub_DFE610`	Returns `false` -- no SVE/RVV equivalent
`getMaxInterleaveFactor()`	`sub_DFB120`	Queried at TTI+448; register-pressure-bounded
`getMaxInterleaveFactor(vectorized)`	`sub_DFB730`	Separate limit for vectorized loops
`hasAttribute(47)`	`sub_B2D610`	"alwaysvectorize" check
`hasAttribute(30)`	`sub_B2D610`	"noimplicitfloat" check

The 32-bit register width return is the critical difference from CPU targets. It means the standard VF formula (regWidth / elemSize) produces VF=1 for 32-bit types, VF=2 for 16-bit types, and VF=4 for 8-bit types. Wider vectorization (VF=4 for float) must come from the cost model determining that ld.v4.f32 is profitable despite the VF exceeding the "register width."

The scheduling info at TTI+56 (with issue width at offset +32 and latency at +36 within that sub-structure) feeds interleave count capping. This models the SM's instruction issue pipeline: even if register pressure allows IC=8, the issue pipeline may saturate at IC=4.

Knobs and Thresholds

Knob	Global Address	CLI Name	Default	Effect
ForceVectorization	`qword_500D340[17]`	`vectorize-loops`	true	Master switch for loop vectorization
EnableEarlyExitVectorization	`byte_500CDA8`	`-enable-early-exit-vectorization`	false	Gates LLVM 20 early-exit loop vectorization
ForceOuterLoopVectorization	`byte_500D208`	`-force-vector-width-outer`	false	Forces VF=4 for outer loops when TTI returns VF<=1
ForceCanReorderFP (selector)	`dword_500D508`	--	0	Whether FP reorder override is active
ForceCanReorderFP (value)	`byte_500D588`	--	--	FP reorder override value
ForceScalarEpilogue (selector)	`dword_500E308`	--	0	Whether scalar epilogue is forced
ForceScalarEpilogue (value)	`byte_500E388`	--	--	Scalar epilogue override value
VectorizerMinTripCount	`dword_500EAE8`	`vectorizer-min-trip-count`	16 (upstream)	Minimum trip count to attempt vectorization
CostThreshold	`qword_500EA08`	--	--	Maximum cost for memory reorder safety check
EnableEpilogueVectorization	`byte_500ED88`	`-enable-epilogue-vectorization`	true (upstream)	Enables vectorized epilogue loop
EpilogueVectorizationForceVF	`qword_500ECA8`	`-epilogue-vectorization-force-VF`	0	Forces specific epilogue VF
AggressiveInterleave	`byte_500D908`	--	false	Bypasses IC heuristics, sets IC=max
PreferPredicateOverEpilogue	`byte_500DAC8`	`prefer-predicate-over-epilogue`	--	Uses predication instead of scalar epilogue
SmallLoopCost	`qword_500DC88`	`small-loop-cost`	20 (upstream)	Threshold below which loops get boosted IC
ForceTargetMaxScalarInterleave	`dword_500E148`	`force-target-max-scalar-interleave`	0	Overrides max IC for scalar loops
ForceTargetMaxVectorInterleave	`dword_500E068`	`force-target-max-vector-interleave`	0	Overrides max IC for vectorized loops

NVIDIA vs upstream defaults: The upstream vectorizer-min-trip-count default is 16. The upstream small-loop-cost default is 20. The upstream enable-epilogue-vectorization default is true. NVIDIA preserves these defaults from the knob registration code, but the TTI hooks (particularly getRegisterBitWidth returning 32 and getMaxInterleaveFactor being register-pressure-bounded) shift the effective behavior dramatically. Where a CPU target with AVX-512 might select VF=16 for float, NVPTX typically selects VF=2 or VF=4 -- just enough to use ld.v2/ld.v4 instructions without excessive register pressure.

Diagnostic Strings

All diagnostic strings are embedded in the binary with OptimizationRemarkAnalysis tags. Source: p2-E01-loop-vectorize.txt.

Tag	Message	Trigger
`UncountableEarlyExitLoopsDisabled`	`"Auto-vectorization of loops with uncountable early exit is not enabled"`	Early-exit loop + `byte_500CDA8` knob off
`LowTripCount`	`"The trip count is below the minial threshold value."`	TC < `dword_500EAE8` min threshold (note: "minial" is a typo [sic] in the NVIDIA binary)
`NoImplicitFloat`	`"Can't vectorize when the NoImplicitFloat attribute is used"`	Function attribute 30 check
`UnsafeFP`	`"Potentially unsafe FP op prevents vectorization"`	FP safety check failure
`CantReorderFPOps`	`"loop not vectorized: cannot prove it is safe to reorder floating-point operations"`	FP reorder proof failure
`CantReorderMemOps`	`"loop not vectorized: cannot prove it is safe to reorder memory operations"`	Memory reorder proof failure
`VectorizationNotBeneficial`	`"the cost-model indicates that vectorization is not beneficial"`	Cost model: VF=1 wins
`InterleavingNotBeneficial`	`"the cost-model indicates that interleaving is not beneficial"`	Cost model: IC=1 wins
`InterleavingNotBeneficialAndDisabled`	(appended: `" and is explicitly disabled or interleave count is set to 1"`)	IC=1 + explicitly disabled
`InterleavingBeneficialButDisabled`	(tag only, no message body recovered)	IC>1 but user disabled interleaving
`InterleavingAvoided`	`"Ignoring UserIC, because interleaving was avoided up front"`	User-specified IC overridden
`HistogramPreventsScalarInterleaving`	`"Unable to interleave without vectorization due to constraints on the order of histogram operations"`	NVIDIA-specific: histogram loop + scalar IC
`ScalableVFUnfeasible`	`"Scalable vectorization requested but not supported by the target"`	Scalable VF on NVPTX
`UncountableEarlyExitUnsupported`	`"Auto-vectorization of early exit loops requiring a scalar epilogue is unsupported"`	Early-exit + epilogue
(success remark)	`"interleaved loop (interleaved count: N)"`	Vectorization/interleaving succeeded via `sub_2AC2B40`
(metadata)	`"llvm.loop.vectorize.followup_all"`	Post-vectorization loop metadata tag
(metadata)	`"llvm.loop.vectorize.followup_epilogue"`	Post-vectorization epilogue metadata tag
(block name)	`"vec.epilog.middle.block"`	Epilogue vectorization middle block
(block name)	`"vec.epilog.vector.body"`	Epilogue vectorization body block
(block name)	`"scev.check"`	Runtime SCEV overflow check block (`sub_27C1C30`)
(VPlan debug)	`"Initial VPlan"`	VPlan builder debug output at `0x2AEFC7B`

Function Map

Function	Address	Size	Role
`LoopVectorizePass::processLoop()`	`sub_2AF1970`	88 KB	--
`tryToBuildVPlanWithVPRecipes()`	`sub_2AEE460`	56 KB	--
`Planner::plan()` -- generate VPlans for candidate VFs	`sub_2AF13F0`	--	--
`selectBestVF()` -- iterate VPlans, pick lowest cost	`sub_2AE08E0`	--	--
`computeCostForVF()` -- per-VF cost query	`sub_2AE0750`	--	--
`isBetterThan()` -- VF cost comparator	`sub_2AB3FE0`	--	--
`executeVPlan()` -- IR transformation from VPlan	`sub_2AE3460`	--	--
`selectInterleaveCount()` -- IC heuristic	`sub_2AED330`	--	--
`selectEpilogueVectorizationFactor()`	`sub_2ABBD40`	--	--
`LoopVectorizationCostModel` constructor (16 params)	`sub_2AB2780`	--	--
`selectVectorizationFactor()` -- outer loop path	`sub_2AB8AC0`	--	--
`selectVectorizationFactor()` -- hint/pre-check	`sub_2AAEAB0`	--	--
`computeExpectedScalarCost()`	`sub_2AAD640`	--	--
`LoopVectorizationLegality::init()`	`sub_31A4FD0`	--	--
`canVectorize()` -- pre-check	`sub_31A91F0`	--	--
`canVectorize()` -- full check	`sub_31AF060`	--	--
`getBestPlanFor(VF)` -- VPlan lookup	`sub_2BF1320`	--	--
`cloneVPlan()`	`sub_2BF7CB0`	--	--
`mergeVPlans()` -- main + epilogue merge	`sub_2AB0350`	--	--
`buildInterleaveGroupRecipes()`	`sub_2C06CE0`	--	--
VPlan cost annotation pass	`sub_2C2E3C0`	--	--
VPlan simplification / recipe combining	`sub_2C32950`	--	--
VPlan legality re-verification	`sub_2C2A390`	--	--
`getSmallBestKnownTC()` -- trip count upper bound	`sub_2AA7EC0`	--	--
`tryToBuildRecipesForVF()` -- per-VF body builder	`sub_2AA9E60`	--	--
`finalizeRecipesForVF()` -- scaling/widening	`sub_2AD9850`	--	--
`TTI::getMaxInterleaveFactor()`	`sub_DFB120`	--	--
`TTI::getRegisterBitWidth(Vector)`	`sub_DFE640`	--	--
`TTI::supportsScalableVectors()`	`sub_DFE610`	--	--
Emit vectorization success remarks	`sub_2AC2B40`	--	--
VPlan fixup/finalize	`sub_ABDAE0`	--	--

Loop Strength Reduction -- LSR runs after vectorization and must handle the wider induction variables and address expressions that vectorization introduces. NVIDIA's custom LSR is occupancy-aware and interacts with the same register pressure model.
Register Allocation -- The register pressure that bounds VF and IC decisions is ultimately resolved by the register allocator. VF=4 with IC=2 may request 8x the base register count; the allocator must either accommodate this or spill to local memory.
Scheduling -- The TTI scheduling info (issue width and latency at TTI+56) that caps interleave count comes from the same target model used by instruction scheduling.
SelectionDAG -- Vectorized IR produces vector types (<4 x float>) that SelectionDAG must lower to PTX ld.v4/st.v4 instructions.
SLP Vectorizer -- SLP vectorization (sub_2BD1C50) handles straight-line code and horizontal reductions; loop vectorization handles loop bodies. Both share the same TTI cost model.

What Upstream LLVM Gets Wrong for GPU

Upstream LLVM's LoopVectorize pass was built for CPU SIMD: fill wider vector registers to process more data elements per instruction. On a GPU, every foundational assumption is inverted:

Upstream assumes SIMD lanes need filling. The CPU vectorizer exists to pack 4/8/16 scalar operations into one vector instruction (SSE/AVX/NEON). On GPU, there are no SIMD lanes in the CPU sense -- the SIMT model already executes 32 threads in lockstep per warp. "Vectorization" on GPU means widening per-thread memory accesses to ld.v2/ld.v4 for coalescing, not filling SIMD lanes.
Upstream computes VF from vector register width. The standard formula is VF = registerWidth / elementSize (e.g., AVX-512 gives VF=16 for float). NVPTX's getRegisterBitWidth() returns 32 bits -- a single scalar register width -- so this formula always produces VF=1 for 32-bit types. Wider VFs must come entirely from the cost model deciding that ld.v4.f32 is profitable, bypassing the standard VF selection path.
Upstream ignores register pressure when selecting VF. On CPU, VF=16 using 16 ZMM registers has no throughput penalty -- there is no occupancy concept. On GPU, VF=4 that quadruples live registers can cross an occupancy cliff, losing an entire warp group and halving net throughput. Every VF and IC decision must be bounded by register pressure impact on occupancy.
Upstream assumes scalable vectors are desirable. LLVM supports SVE/RISC-V V scalable vector types. NVPTX disables them entirely (supportsScalableVectors() = false) because PTX has no scalable vector model -- only fixed-width ld.v2/ld.v4 instructions exist.
Upstream's interleave count is bounded by CPU port pressure. CPU IC selection considers execution port contention and register file depth (e.g., 16 YMM registers). GPU IC selection is capped by the TTI scheduling model's issue width and latency at TTI+56, reflecting the SM's instruction issue pipeline saturation -- a completely different bottleneck.

Optimization Level Behavior

Level	Scheduled	Max VF	Interleave	Notes
O0	Not run	N/A	N/A	No optimization passes
Ofcmax	Not run	N/A	N/A	Fast-compile skips vectorization entirely
Ofcmid	Not run	N/A	N/A	Vectorization not in medium fast-compile tier
O1	Runs (Tier 1)	4	Enabled	Single instance after loop canonicalization
O2	Runs (Tier 1)	4	Enabled	Same scheduling as O1; benefits from more aggressive scalar optimization preceding it
O3	Runs (Tier 1)	4	Enabled	Same as O2; additional Tier 3 loop passes (interchange, distribution) may create more vectorization opportunities

Loop vectorization is a Tier 1 pass, meaning it runs at O1 and above but not in any fast-compile tier. The maximum VF is effectively capped at 4 by the GPU register pressure constraint -- higher VFs would multiply live registers past occupancy cliffs. The vectorize-loops knob (qword_500D340[17]) can force vectorization even when the cost model says it is unprofitable; this knob defaults to off and is typically used only for debugging. Early-exit vectorization (byte_500CDA8) is gated separately and defaults to disabled. See Optimization Levels for the complete tier structure.

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Vectorization purpose	Fill SIMD lanes (SSE/AVX/NEON) for data parallelism	Memory coalescing (`ld.v2`/`ld.v4`), instruction count reduction, and register-to-memory width matching; no SIMD lanes on GPU
Scalable vectors	Supported (SVE, RISC-V V)	Always disabled -- `sub_DFE610` returns `false` for NVPTX; only fixed-width VF=2/4
Register bit width (TTI)	Target-dependent (128/256/512 for x86)	Fixed 32 bits (`TypeSize::getFixed(32)`) reflecting PTX's 32-bit register model
VF selection cost model	SIMD-width-driven: higher VF fills wider vector registers	Occupancy-bounded: VF must not increase register pressure past warp occupancy cliffs; VF=4 is typically the maximum
Interleave count	Profile-guided or port-pressure-based (2--8 typical)	Capped by TTI scheduling info at `TTI+56`; conservative due to register pressure cost per interleaved iteration
Early-exit vectorization	Experimental (behind flag)	Present, gated by `byte_500CDA8` (`-enable-early-exit-vectorization`)
Convergent call handling	Standard legality rejection	Additional barrier-aware legality: convergent intrinsics (`__syncthreads`, warp shuffles) block vectorization of the containing loop body

SLP Vectorizer

NVIDIA-modified pass. See Key Behavioral Differences from Upstream for GPU-specific changes.

The SLP (Superword-Level Parallelism) vectorizer packs independent scalar operations on adjacent data into vector operations. Unlike the loop vectorizer, SLP operates on straight-line code within a single basic block --- it does not require a loop. On NVPTX, the practical payoff is combining two or four scalar loads/stores into ld.v2/ld.v4 (or st.v2/st.v4), and folding arithmetic on adjacent elements into a single wider instruction. CICC runs the SLP vectorizer as part of the combined LoopVectorize / SLPVectorize pass group at step 31 of the O2 pipeline (sub_19B73C0), after SCCP/GlobalOpt and before the post-vectorization GVN cleanup. The pass is registered under the name slp-vectorizer (pipeline slot 350, llvm::SLPVectorizerPass).

Property	Value
Pass name	`slp-vectorizer`
Pipeline slot	350 (`llvm::SLPVectorizerPass`)
Constructor registration	`ctor_517` at `0x560FD0` (12,410 bytes)
Option constructor	`ctor_248` at `0x4EEF30` (8,219 bytes)
Horizontal reduction entry	`sub_2BD1C50` (~85 KB, ~3,005 decompiled lines)
Straight-line SLP entry	`sub_2BCE070`
Store-SLP entry	`sub_2BCA110`
SLP tree code cluster	`0x1BC0000`--`0x1BFFFFF` (~1,353 KB across ~266 files)
Key diagnostic strings	`"slp-vectorizer"`, `"HorSLPNotBeneficial"`, `"VectorizedHorizontalReduction"`, `"const.rdx"`, `"SLP vectorized with cost"`, `"Cannot SLP vectorize list:"`, `"Stores SLP vectorized with cost"`

SLP vs Loop Vectorization on GPU

The loop vectorizer (see LoopVectorize & VPlan) transforms counted loops by widening the loop body to process multiple iterations per step, driven by VPlan. SLP vectorization is fundamentally different: it searches a single basic block for groups of isomorphic scalar instructions that operate on adjacent memory or independent data, then replaces them with a single vector instruction. No loop structure is required.

On a GPU, SLP opportunities arise in three main patterns:

Adjacent memory operations. Two consecutive f32 loads from addresses p and p+4 become a single ld.v2.f32. Four consecutive i32 stores become st.v4.b32. This is the highest-value SLP transformation on NVPTX because coalesced memory transactions are critical for throughput.
Same-typed arithmetic on independent operands. Two fadd instructions with no data dependency between them can become a single vector fadd on <2 x float>. The PTX backend later lowers this back to scalar instructions if the target has no native wide ALU, but the combined form enables better scheduling and may survive to the load/store vectorizer's benefit.
Texture coordinate packing. Texture/surface sampling requires coordinate tuples (u, v) or (u, v, w). When the scalar coordinates are computed independently, SLP can pack them into a <2 x float> or <4 x float> bundle that feeds directly into the sampling intrinsic, avoiding per-element extract/insert overhead.

NVPTX TTI Hooks Affecting SLP

The SLP vectorizer consults TargetTransformInfo at several decision points. NVIDIA's proprietary TTI implementation differs significantly from the upstream open-source NVPTX backend.

Upstream Open-Source NVPTX TTI (for reference)

Hook	Return Value	Comment
`getRegisterBitWidth(Vector)`	32 bits	"Only `<2 x half>` should be vectorized"
`getMinVectorRegisterBitWidth()`	32 bits	Matches 32-bit register file
`getNumberOfRegisters()`	1 (all classes)	FIXME in source: "this is conservative"
`getArithmeticInstrCost(i64)`	2x base for ADD/MUL/XOR/OR/AND	Reflects 32-bit ALU emulation
`supportsScalableVectors()`	`false`	No SVE/RVV equivalent in PTX

With these returns, the standard LLVM VF formula (registerBitWidth / elementBitWidth) produces VF = 1 for f32 and VF = 2 for f16. The open-source backend effectively limits SLP to <2 x half> bundles only.

CICC v13.0 Proprietary TTI

CICC overrides the upstream returns at three levels: the TTI wrapper pass, the SLP tree's internal scheduling-width parameter, and several SLP-specific helper functions that query TTI indirectly.

TTI hooks queried by SLP (directly or via the cost model):

Hook	Address	Return / Behavior	SLP Impact
`getRegisterBitWidth(Vector)`	`sub_DFE640`	`TypeSize::getFixed(32)`	Formal register width --- same as upstream. But see `a2+840` override below.
`getRegisterBitWidth(Scalar)`	`sub_DFB1B0`	32	Confirms 32-bit register file for scalar cost comparison.
`supportsScalableVectors()`	`sub_DFE610`	`false`	Scalable VF never attempted.
`getInstructionCost()`	`sub_20E14F0` (33KB)	Per-opcode latency from scheduling model	Called indirectly through `getTreeCost()` (`sub_2B94A80`) for each tree node.
`getInstructionCost()` (IR-level)	`sub_B91420`	Per-instruction cost estimate	Called 7 times per instruction during per-node SLP cost evaluation.
`hasAttribute(47)`	`sub_B2D610`	Checks `alwaysvectorize`	When set, SLP skips profitability check and vectorizes unconditionally.
`hasAttribute(18)`	`sub_B2D610`	Checks `optnone`	When set, SLP is entirely disabled.

The a2+840 scheduling-width override:

The SLP tree object (BoUpSLP, parameter a2 in the horizontal reduction entry sub_2BD1C50) stores a max register pressure / scheduling width at offset +840. This value does NOT come from getRegisterBitWidth(Vector) directly. Instead, it is computed during SLP tree initialization from a combination of the target's scheduling model and available register budget. In the decompiled code, the VF derivation at lines 1354-1578 reads this value and clamps the resulting bit width to [128, 512]:

// VF derivation from a2+840 (decompiled sub_2BD1C50)
uint64_t max_sched_width = *(a2 + 840);       // NOT from TTI.getRegisterBitWidth()
uint64_t scalar_width = sub_2B49BC0(a2, first_scalar);  // getScalarTypeWidth()

uint64_t vf;
if (scalar_width <= max_sched_width) {
    vf = 1 << bsr(max_sched_width / scalar_width);  // round-down power-of-2
    vf = clamp(vf, 128, 512);                        // clamp to [128, 512] BITS
} else {
    vf = 128;
}
// For f32 (32 bits) with max_sched_width=256: vf = 256/32 = 8 elements
// For f64 (64 bits) with max_sched_width=256: vf = 256/64 = 4 elements

This is the single most important NVIDIA divergence from upstream for SLP: the 32-bit getRegisterBitWidth(Vector) return would produce VF=1 for f32 operations and kill SLP entirely for 32-bit types, but the a2+840 scheduling width allows VF=4 or VF=8 for f32. The result is that CICC's SLP can produce <4 x float> bundles (later lowered to ld.v4.f32 / st.v4.f32) that the open-source backend would never attempt.

SLP-specific TTI helper functions:

Function	Address	Upstream Equivalent	Behavior
`getScalarTypeWidth()`	`sub_2B49BC0`	`DL.getTypeSizeInBits()`	Returns bit width of a scalar type for VF computation.
`getNextLegalVF()`	`sub_2B1E190`	No direct equivalent	Steps down through legal vector factors when current VF is unprofitable. Takes `(TTI, type, currentVF)`, returns next smaller legal VF >= minimum VF. Respects PTX v2/v4 legality constraints.
`adjustVF()`	`sub_2B1FA70`	Partial in `BoUpSLP::buildTree`	When `SLPMaxVF` (`qword_500F628`) is non-zero and `operand_count+1` is a power of 2, returns `operand_count` directly (non-power-of-2 VF). Otherwise computes a power-of-2 VF.
`isTreeNotBeneficialForArch()`	`sub_2B2DA40`	Not in upstream	NVIDIA-specific early rejection based on SM reduction type (`a1+1576`). Rejects trees whose structure is known to be unprofitable on the current GPU architecture.

Arithmetic Cost Impact on SLP Trees

The TTI cost model for i64 operations directly affects SLP profitability. Since NVPTX GPUs emulate all 64-bit integer arithmetic through pairs of 32-bit operations, the cost differential inflates the scalar cost baseline, making i64 SLP trees more profitable in relative terms:

Operation	i32 Scalar Cost	i64 Scalar Cost	i64 Vector Cost (v2)	SLP Delta
ADD/SUB	1	2 (add.cc + addc)	4 (two add.cc + addc pairs)	Neutral (2x scalar = 2x vector)
MUL	1	~4 (mul.lo + mul.hi + add chain)	~8	Neutral
Loads	1	1 (ld.b64)	1 (ld.v2.b64)	Profitable --- single wide load
Stores	1	1 (st.b64)	1 (st.v2.b64)	Profitable --- single wide store

The asymmetry is clear: SLP profit on NVPTX comes almost entirely from memory coalescing (loads and stores), not from arithmetic. The arithmetic cost for a v2 bundle is roughly 2x the scalar cost for all types, providing no ALU benefit. But a ld.v2.f32 replaces two separate load instructions with one, reducing instruction count and improving coalescing. This is why Store-SLP (sub_2BCA110) and the load/store adjacency heuristics dominate profitable SLP on GPU.

Maximum Vector Width on NVPTX

PTX supports vector types up to .v4 for most data types, but the actual hardware constraint is tighter:

v2: Supported for all types (.b8 through .b64, .f16, .f32, .f64). This is the sweet spot for SLP.
v4: Supported for .b8, .b16, .b32, .f16, .f32. NOT supported for .b64/.f64.
v8/v16: Not supported in PTX at all. CPU-style AVX-width vectorization is never legal.

The SLP vectorizer's VF selection logic at sub_2BD1C50 lines 1354--1578 computes:

// VF selection pseudocode (from decompiled sub_2BD1C50)
uint64_t max_sched_width = *(a2 + 840);  // from TTI
uint64_t scalar_width = getScalarTypeWidth(a2, first_scalar);

uint64_t vf;
if (scalar_width <= max_sched_width) {
    vf = 1 << bsr(max_sched_width / scalar_width);  // round-down power-of-2
    vf = clamp(vf, 128, 512);                        // clamp to [128, 512] bits
} else {
    vf = 128;
}

For f32 (32 bits) with a max scheduling width of 256 bits, this yields VF = 8 elements. However, PTX legalization later splits anything wider than v4 into multiple instructions, so the effective maximum is v4 for 32-bit types and v2 for 64-bit types. The SLP cost model accounts for this split cost.

GPU-Specific Vectorization Constraints

Legal Vector Types on NVPTX

The NVPTX target has exactly ONE vector register class --- Int32HalfRegs (.b32, prefix %hh) --- which holds 32 bits of packed data. The only legal vector types at the SelectionDAG level are:

Type	Packing	Register	Legal Since
`v2f16`	Two `f16` in 32 bits	`%hh`	SM 53+
`v2bf16`	Two `bf16` in 32 bits	`%hh`	SM 80+
`v2i16`	Two `i16` in 32 bits	`%hh`	SM 53+
`v4i8`	Four `i8` in 32 bits	`%hh`	SM 70+

Every other vector type is illegal and must be split or scalarized during type legalization (sub_2029C10 / sub_202E5A0). This includes <2 x float>, <4 x float>, <2 x i32>, and <2 x double> --- the very types SLP produces for 32-bit and 64-bit operations.

How SLP Vectors Survive to PTX

SLP-produced vector types such as <4 x float> are not killed by type legalization. Instead, the path is:

SLP vectorizer (IR level) produces <4 x float> loads, stores, and arithmetic in LLVM IR.
SelectionDAG type legalization splits <4 x float> into four scalar f32 values for arithmetic operations. However, load and store nodes are intercepted by NVPTX's custom lowering (NVPTXTargetLowering::LowerOperation) which converts them to target-specific NVPTX::LD_v4_f32 / NVPTX::ST_v4_f32 pseudo-instructions.
Instruction selection maps these pseudo-instructions to PTX ld.v4.f32 / st.v4.f32.
Arithmetic on the vector elements becomes four independent scalar instructions, which the scheduler can interleave with memory operations.

The net effect: SLP's primary benefit on NVPTX is vectorized memory access, while vectorized arithmetic is a wash. The cost model at sub_2B94A80 (getTreeCost) accounts for this by assigning low cost to vector loads/stores and high scalarization overhead to vector arithmetic.

PTX Vector Width Ceiling

PTX .v2 and .v4 load/store support imposes hard ceilings:

Element Type	Max `.vN`	Max Bits	SLP VF Ceiling
`.b8` / `.u8`	`.v4`	32	4
`.b16` / `.f16`	`.v4`	64	4
`.b32` / `.f32`	`.v4`	128	4
`.b64` / `.f64`	`.v2`	128	2
`.b128`	`.v1` only	128	1 (no vectorization)

When the SLP VF exceeds the PTX ceiling (e.g., VF=8 for f32 from the [128,512] bit-width clamping), the backend splits the single wide operation into multiple legal operations. The SLP cost model at sub_2B889C0 factors this split cost into the tree evaluation, ensuring that overly wide VFs are rejected if the split overhead eliminates the coalescing benefit.

Algorithm Overview

CICC's SLP vectorizer has three entry points that collectively implement the upstream BoUpSLP / SLPVectorizerPass:

Straight-Line SLP (`sub_2BCE070`)

Scans each basic block for groups of isomorphic instructions (same opcode, adjacent or compatible operands). Builds a bottom-up SLP tree using sub_2BAACB0 (buildTree), evaluates cost via sub_2B94A80 (getTreeCost), and emits vector code via sub_2BC6BE0 (vectorizeTree) when profitable. Diagnostic: "SLP vectorized with cost N" on success, "Cannot SLP vectorize list:" on failure.

Store-SLP (`sub_2BCA110`)

Seeds the SLP tree from consecutive stores to adjacent memory addresses. This is the primary entry point for memory coalescing. Diagnostic: "Stores SLP vectorized with cost N".

Horizontal Reduction SLP (`sub_2BD1C50`)

The most complex path. Handles horizontal reductions (e.g., summing all elements of a vector). Proceeds in six phases:

Phase 0 -- Scalar chain scan. Reads the reduction operand array at a1+304 (pointer) and a1+312 (count). Each bundle entry is 64 bytes. Classifies operands by opcode: values <= 0x1C are simple scalars (add/sub/mul/etc.), values > 0x1C are complex (fcmp, icmp variants). Calls sub_2B0D8B0 (isReductionOp) to validate each operation as a legal reduction (add, fadd, mul, fmul, and, or, xor, smin/smax/umin/umax, fmin/fmax).

Phase 1 -- Hash table construction. Builds two open-addressing hash tables. The "AllOps" table uses 32-byte entries with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth/compaction thresholds.

Phase 2 -- Bundle pair extraction. Calls sub_2B5F980 per bundle to classify reduction opcode pairs. When two consecutive bundles both contain fadd reductions (opcode 90), NVIDIA attempts a paired fadd bundle merge via sub_2B3C030/sub_2B25EA0/sub_2B38BA0. This is an NVIDIA-specific optimization for warp-level fadd reductions not present in upstream LLVM.

Phase 3 -- Main vectorization loop. For each bundle, builds candidate operand lists, selects a VF, and tries vectorization with progressively smaller VFs on failure. The VF trial loop uses memoization (sub_2B3C060) to avoid re-trying the same (offset, VF) pair. Key substeps: canVectorize (legality), buildTree, isTreeTinyAndNotFullyVectorizable / isTreeNotBeneficialForArch (early rejection), scheduleBlock, getTreeCost + getReductionCost (profitability).

Phase 4 -- Final reduction codegen. Produces the final horizontal reduction instruction via sub_2B21C80 (createFinalReduction), chaining multiple entries with sub_2B34820 when multiple sub-trees were vectorized.

Phase 5 -- Multi-tree scheduling and cleanup. Builds a multi-tree reduction schedule, iteratively calling sub_2B2F4A0 (reduceTreeLevel) until a single root value remains, then replaceAllUsesWith + eraseFromParent.

Paired fadd Bundle Merging (NVIDIA-Specific)

This optimization is absent from upstream LLVM and targets warp-level floating-point reduction patterns common in CUDA kernels (e.g., block-level sum reductions, dot products, softmax denominators). When two consecutive reduction bundles both contain fadd operations, CICC attempts to merge them into a single wider bundle, doubling the effective vectorization width for the reduction.

Trigger Condition

During Phase 2 of the horizontal reduction path (sub_2BD1C50, lines 921-1098), sub_2B5F980 (classifyReductionPair) is called per bundle and returns a pair of reduction opcodes (reductionOpcodeA, reductionOpcodeB). The merge path activates when:

Both opcodes in the current bundle equal 90 (0x5A), which is the internal opcode for fadd reduction.
The next consecutive bundle also has both opcodes equal to 90.
The two bundles are adjacent in the reduction operand array (no intervening non-fadd bundles).

// Trigger check (decompiled from Phase 2, sub_2BD1C50)
if (v83 == v84 && v83 == 90) {        // both opcodes in bundle[i] are fadd
    if (v83_next == v84_next && v83_next == 90) {  // bundle[i+1] also all-fadd
        // Try paired merge
        sub_2B3C030(bundle_i, bundle_i_plus_1, ...);  // tryMergeFaddBundles
    }
}

Three-Function Pipeline

The merge proceeds through three functions in sequence:

Step	Function	Address	Role
1. Try	`tryMergeFaddBundles()`	`sub_2B3C030`	Checks whether the two bundles' operand lists can be concatenated without violating data dependencies. Verifies that no operand in bundle B depends on the result of bundle A (or vice versa). Returns a candidate merged-bundle descriptor or null on failure.
2. Validate	`validateMergedBundle()`	`sub_2B25EA0`	Confirms that the merged bundle satisfies SLP legality: all operands are isomorphic (same opcode), the combined operand count does not exceed `SLPMaxVF` limits, and the merged bundle's scheduling pressure stays within `a2+840`. Also checks that external uses of intermediate reduction values are compatible with the wider bundle.
3. Rewrite	`rewriteMergedBundle()`	`sub_2B38BA0`	Physically merges the two bundle entries in the reduction operand array. The combined bundle gets double the operand count, and the second bundle slot is marked as consumed (skipped in Phase 3). Updates the AllOps hash table entries to point to the new merged bundle.

Why This Matters on GPU

Consider a warp-level sum reduction of 64 f32 values, structured as two consecutive 32-element fadd reduction trees. Without merging, the SLP vectorizer processes each 32-element tree independently, producing two separate vectorized reduction chains. With merging, the combined 64-element tree exposes a wider VF window, allowing the vectorizer to produce wider v4 bundles and reduce the total number of reduction shuffle steps.

The merged bundle also benefits the final reduction codegen (sub_2B21C80, createFinalReduction): instead of producing two separate reduction results and combining them with a scalar fadd, the merged tree produces a single reduction result directly.

Commutativity Classification

The SM reduction type at a1+1576 drives commutativity via bitmask 0x10804:

bool is_commutative;
if (reduction_type <= 0x10) {
    is_commutative = !((1 << reduction_type) & 0x10804);
    // Non-commutative types: 2, 14, 16 (likely fsub, signed cmp variants)
} else {
    is_commutative = true;
}

SLP and the Load/Store Vectorizer

CICC runs two distinct passes that vectorize memory operations, and their scopes partially overlap:

	SLP Vectorizer	OldLoadStoreVectorizerPass
Pass name	`slp-vectorizer`	`old-load-store-vectorizer`
Scope	Isomorphic ops in a BB	Adjacent loads/stores only
Seed	Any instruction group	Store/load chains
Handles arithmetic	Yes	No
Handles reductions	Yes (horizontal)	No
Pipeline position	Step 31 (with LoopVectorize)	Post-optimization (NVIDIA-specific)
Disable flag	`vectorize-slp`	`disable-nvptx-load-store-vectorizer`

The NVIDIA-proprietary old-load-store-vectorizer (llvm::OldLoadStoreVectorizerPass) is a separate pass distinct from LLVM's LoadStoreVectorizerPass. It runs later in the pipeline and handles NVVM-specific intrinsic vectorization (nvvm_load/nvvm_ld, nvvm_store/nvvm_st) via the vect-intrinsics knob. SLP may vectorize the same load/store chains if they also contain arithmetic; the load/store vectorizer catches whatever SLP missed.

Register Pressure Impact

SLP vectorization increases register pressure because vector values occupy wider registers. On NVPTX, a <2 x float> consumes two 32-bit registers (PTX has no native 64-bit float register file for packed types --- the backend lowers <2 x f32> to a pair of .f32 registers). The benefit comes from reduced instruction count and improved memory coalescing, not from register savings.

The SLP cost model accounts for register pressure through a2+840 (max scheduling width), and the profitability check rejects vectorization when the combined cost (tree cost + reduction cost) exceeds the threshold. When register pressure is already high, the TTI cost model inflates the scalarization overhead, making SLP less likely to fire.

SLP Cost Model and TTI Callouts

The SLP profitability decision is the product of two cost functions that both delegate to TTI: getTreeCost() (sub_2B94A80, 71KB) and getReductionCost() (sub_2B28940). Understanding exactly how these call into TTI is essential for predicting when SLP will fire on a given kernel.

getTreeCost() (`sub_2B94A80`)

This 71KB function walks every node in the SLP tree and accumulates the cost difference between the vectorized form and the original scalar form. For each tree node, it:

Calls sub_2B889C0 (45KB, the inner cost computation) which dispatches to TTI via sub_B91420 (TTI::getInstructionCost() at the IR level) --- called approximately 7 times per instruction to query costs for the scalar original, the vector alternative, and scalarization overhead (insert/extract elements).
For load/store nodes, queries the memory cost model which returns favorable costs for adjacent accesses (reflecting ld.v2 / ld.v4 coalescing benefit) and high costs for gather/scatter patterns.
For shuffle nodes (operand reordering), queries TTI::getShuffleCost() which on NVPTX returns high cost for any non-identity shuffle --- GPU has no native shuffle-within-register instruction for packed 32-bit values.
Returns a pair: (vectorCost : i64, isExact : i32). When isExact == 1, the cost is a precise measurement from the scheduling model; the profitability check accepts it unconditionally regardless of the threshold.

getReductionCost() (`sub_2B28940`)

Called with the TTI pointer (a4) as the second parameter, this function computes the cost of the horizontal reduction itself --- the shuffle-and-reduce tree that turns a vector into a scalar. Parameters:

sub_2B28940(
    a1,     // HorizontalReduction object
    a4,     // TargetTransformInfo*
    v478,   // operand window start
    v479,   // operand window end
    v432,   // hasExternalUses flag
    v433,   // common opcode mask from Phase 1
    a2      // BoUpSLP tree
)
// Returns: (reductionCost : i64, costKind : i32)

The reduction cost on NVPTX is typically high because the GPU has no native horizontal reduction instruction for arbitrary vector widths. A <4 x float> fadd reduction requires 2 shuffle-and-add steps (log2(4) = 2), each involving an extractelement and a scalar fadd. The TTI cost model at sub_20E14F0 (33KB) provides the per-step latency from the scheduling model.

Combined Profitability Decision

// Profitability check (decompiled from sub_2BD1C50, lines 2062-2163)
int64_t treeCost   = sub_2B94A80(tree, ...);   // vector tree cost
int64_t reducCost  = sub_2B28940(rd, TTI, ...); // reduction overhead
int64_t combined   = treeCost + reducCost;      // overflow-checked via __OFADD__

int64_t threshold  = -(int64_t)qword_5010428;  // SLPCostThreshold, default 0

if (costKind == 1) {
    // Exact cost from scheduling model: always accept
    goto vectorize;
}
if (combined > threshold) {
    // Not profitable: emit "HorSLPNotBeneficial" diagnostic
    // Try smaller VFs via getNextLegalVF() loop
    goto try_smaller_vf;
}
// Profitable: proceed to vectorizeTree()

The costKind == 1 fast path is notable: when the cost model can determine the exact scheduling benefit (rather than a heuristic estimate), it bypasses the threshold entirely. This typically fires for small, fully-analyzable SLP trees where every instruction's latency is known from the TTI scheduling tables at TTI+56.

VF Stepping on Failure

When vectorization at the current VF is unprofitable, the horizontal reduction path does not immediately give up. Instead, it calls sub_2B1E190 (getNextLegalVF) to step down to the next smaller legal VF, then re-tries the entire build-tree / get-cost cycle:

// VF step-down loop (decompiled from sub_2BD1C50, lines 2097-2163)
while (currentVF > minVF) {
    currentVF = sub_2B1E190(TTI, elementType, currentVF);
    if (sub_2B3C060(&memoSet, {offset, currentVF}))  // alreadyTried?
        continue;
    // Re-try vectorization at new VF
    sub_2BAACB0(tree, ops, currentVF, ...);  // buildTree
    treeCost  = sub_2B94A80(tree, ...);       // getTreeCost
    reducCost = sub_2B28940(rd, TTI, ...);    // getReductionCost
    combined  = treeCost + reducCost;
    if (combined <= threshold)
        goto vectorize;
}
// All VFs exhausted: emit "HorSLPNotBeneficial"

The memoization set (sub_2B3C060) prevents re-evaluating the same (offset, VF) pair, which is essential because the VF step-down loop can iterate many times for large operand counts.

Configuration Knobs

Upstream LLVM Knobs (present in CICC)

Knob	Type	LLVM Default	CICC Default	Effect
`slp-threshold`	int	0	0	Profitability threshold. Vectorize when `cost <= -threshold`. Default 0 means any non-positive cost is profitable.
`slp-vectorize-hor`	bool	true	true	Enable horizontal reduction vectorization.
`slp-vectorize-hor-store`	bool	false	false	Seed horizontal reduction from stores.
`slp-max-reg-size`	int	128	128	Maximum vector register size in bits for SLP scheduling.
`slp-min-reg-size`	int	128	128	Minimum vector register size.
`slp-schedule-budget`	int	100000	100000	Maximum scheduling region size per block.
`slp-recursion-max-depth`	int	12	12	Maximum recursion depth for tree building.
`slp-min-tree-size`	int	3	3	Minimum tree size for full vectorization.
`vectorize-slp`	bool	true	true	Master switch for the SLP pass.
`view-slp-tree`	bool	false	false	Display SLP trees with Graphviz (debug).
`slp-max-vf`	int	0	0	Maximum vector factor override (0 = unlimited).

NVIDIA-Specific Globals

Global	Address	Default	Effect
`SLPMaxVF`	`qword_500F628`	0	When zero: minimum VF = 4 elements. When non-zero: minimum VF = 3, and the value caps the maximum VF. Also bypasses power-of-2 VF requirement.
`SLPCostThreshold`	`qword_5010428`	0	Cost threshold for horizontal reduction profitability. Test is `cost > -(int)threshold`. Default 0: any non-positive cost is profitable.
Straight-line max VF	`qword_500FEE8`	unknown	Maximum VF override for straight-line SLP (`sub_2BCE070`), separate from horizontal reduction.

Key Behavioral Differences from Upstream

Minimum VF default. When SLPMaxVF is zero (default), CICC requires at least 4 scalar operands to attempt horizontal reduction vectorization. Upstream LLVM has no such global minimum; it relies on slp-min-tree-size (default 3) instead.
VF clamping. CICC clamps VF to [128, 512] bits based on the a2+840 scheduling width, then steps down via getNextLegalVF() (sub_2B1E190). Upstream computes VF from TTI::getMaximumVF() or slp-max-reg-size without the explicit bit-width clamping. The [128, 512] range allows VF=4 through VF=16 for f32 types, whereas upstream NVPTX (32-bit register width) would produce VF=1.
Paired fadd merging. CICC merges consecutive fadd reduction bundles into wider bundles via sub_2B3C030 / sub_2B25EA0 / sub_2B38BA0. This is absent from upstream and is targeted at GPU warp-level reduction patterns. See the dedicated section above.
Scheduling-width-driven VF (not register-width-driven). The upstream SLP vectorizer derives VF from TTI::getRegisterBitWidth(Vector). CICC stores a separate scheduling width at a2+840 that reflects available register budget after accounting for live-in pressure. This decouples SLP VF from the register file width, allowing profitable vectorization even though getRegisterBitWidth(Vector) returns 32.
isTreeNotBeneficialForArch(). CICC adds a GPU-architecture-specific early rejection filter (sub_2B2DA40) that takes the SM reduction type as a parameter. This rejects tree shapes known to be unprofitable on the target SM variant (e.g., trees that would produce reduction patterns not supported by the SM's warp-level primitives).
O-level gating. SLP vectorization is gated by tier != 1 in the pipeline assembler: it is disabled at O1 and enabled at O2 and O3. At O2/O3, the LoopVectorize/SLP parameter width is set to tier (2 at O2, 3 at O3), affecting the scheduling width multiplier. SM-architecture-dependent thresholds are resolved at runtime via the a2+840 value.
Non-power-of-2 VF support. When SLPMaxVF (qword_500F628) is non-zero and operand_count + 1 is a power of 2, adjustVF() (sub_2B1FA70) returns operand_count directly, enabling VFs like 3, 5, 7. Upstream LLVM requires power-of-2 VFs except in specific recent patches for fixed-length non-power-of-2 vectorization.

Diagnostic Strings

String	Function	Meaning
`"SLP vectorized with cost N"`	`sub_2BCE070`	Straight-line SLP succeeded
`"Cannot SLP vectorize list:"`	`sub_2BCE070`	Straight-line SLP failed legality/cost
`"Stores SLP vectorized with cost N"`	`sub_2BCA110`	Store-seeded SLP succeeded
`"HorSLPNotBeneficial"`	`sub_2BD1C50`	Horizontal reduction not profitable
`"Vectorizing horizontal reduction is possible but not beneficial with cost C and threshold T"`	`sub_2BD1C50`	Full rejection diagnostic with cost details
`"VectorizedHorizontalReduction"` / `"Vectorized horizontal reduction with cost C and with tree size N"`	`sub_2BD1C50`	Horizontal reduction succeeded
`"const.rdx"`	`sub_2B21B90`	Intermediate reduction variable name
`"rdx.shuf.l"`, `"rdx.shuf.r"`	(cluster `0x1BDDB00`)	Left/right reduction shuffle names
`"op.rdx"`, `"op.extra"`	(cluster `0x1BDDB00`)	Reduction operation and extra operation names

Function Map

Function	Address	Size	Role
`HorizontalReduction::tryToReduce()` -- main horizontal reduction entry	`sub_2BD1C50`	85 KB	--
Straight-line SLP vectorizer entry	`sub_2BCE070`	--	--
Store-SLP vectorizer entry	`sub_2BCA110`	--	--
`BoUpSLP::buildTree()`	`sub_2BAACB0`	--	--
`BoUpSLP::getTreeCost()`	`sub_2B94A80`	71 KB	--
`BoUpSLP::vectorizeTree()` (codegen)	`sub_2BC6BE0`	71 KB	--
`BoUpSLP::computeScheduleData()`	`sub_2BBDBE0`	40 KB	--
`BoUpSLP::scheduleBlock()`	`sub_2BBFB60`	71 KB	--
`BoUpSLP::optimizeGatherSequence()`	`sub_2BB3590`	--	--
`BoUpSLP::reorderInputsIfNecessary()`	`sub_2BB0460`	--	--
`BoUpSLP::buildExternalUses()`	`sub_2B4F3D0`	--	--
`getReductionCost()`	`sub_2B28940`	--	--
`createFinalReduction()`	`sub_2B21C80`	--	--
`createReductionOp()` (`"const.rdx"`)	`sub_2B21B90`	--	--
`buildReductionResult()`	`sub_2B2FE10`	--	--
`reduceTreeLevel()`	`sub_2B2F4A0`	--	--
`isReductionOp()`	`sub_2B0D8B0`	--	--
`isHomogeneous()` (all ops satisfy predicate)	`sub_2B0D880`	--	--
`canVectorize()` (legality check)	`sub_2B4B450`	--	--
`isTreeTinyAndNotFullyVectorizable()`	`sub_2B2DB00`	--	--
`isTreeNotBeneficialForArch()`	`sub_2B2DA40`	--	--
`adjustVF()` (vectorization factor selection)	`sub_2B1FA70`	--	--
`getNextLegalVF()`	`sub_2B1E190`	--	--
`getScalarTypeWidth()`	`sub_2B49BC0`	--	--
`hasVectorizableReductions()`	`sub_2B6E610`	--	--
`tryMergeFaddBundles()` (NVIDIA-specific)	`sub_2B3C030`	--	--
`validateMergedBundle()` (NVIDIA-specific)	`sub_2B25EA0`	--	--
`rewriteMergedBundle()` (NVIDIA-specific)	`sub_2B38BA0`	--	--
`perBundleVectorize()`	`sub_2B77B90`	--	--
`emitVectorizedReductionDiagnostic()`	`sub_2B44ED0`	--	--
`reorderForCanonical()`	`sub_2B33D00`	--	--
SLP tree scheduling	`sub_2BD7F70`	46 KB	--
SLP tree cost computation	`sub_2B889C0`	45 KB	--
SLP value rewriting (scalar-to-vector)	`sub_2BCFB90`	44 KB	--
SLP node creation (tree construction)	`sub_2BCAEC0`	42 KB	--
`deleteTree()` (cleanup on failure)	`sub_2B5C350`	--	--
`alreadyTried()` (VF memoization)	`sub_2B3C060`	--	--
`tryNextVF()` (advance or fail)	`sub_2B399C0`	--	--
`classifyReductionPair()` (per-bundle opcode pair extraction)	`sub_2B5F980`	--	--
`hasExternalUses()` (external use check for bundles)	`sub_2B27F10`	--	--
`getTargetInfo()` (TTI accessor)	`sub_BD5C60`	--	--
`initDominatorContext()`	`sub_D5F1F0`	--	--
`hashOperandSlice()` (operand slice hash for scheduling cache)	`sub_27B0000`	--	--
Extended opcode classifier (opcodes > 0x1C)	`sub_2B15E10`	--	--
`buildOperandOrder()` (commutative reorder table)	`sub_2B3D4E0`	--	--
`isInScheduledSet()` (scheduling membership test)	`sub_2B3D560`	--	--
Reduction use counter (per-operand)	`sub_2B54920`	--	--
`TTI::getRegisterBitWidth(Vector)` -- returns 32	`sub_DFE640`	--	--
`TTI::supportsScalableVectors()` -- returns false	`sub_DFE610`	--	--
`TTI::getRegisterBitWidth(Scalar)` -- returns 32	`sub_DFB1B0`	--	--
`TTI::getInstructionCost()` (scheduling cost model)	`sub_20E14F0`	33 KB	--
`TTI::getInstructionCost()` (IR-level variant)	`sub_B91420`	--	--
`TTI::hasAttribute(N)` (function attribute query)	`sub_B2D610`	--	--

Data Structure: HorizontalReduction Object

Offset	Type	Field
+0	`ReductionBundle*`	Array of reduction bundle structs
+8	`u32`	Bundle count
+304	`Value**`	Pointer to operand arrays (each bundle = 64 bytes)
+312	`u32`	Operand array count
+384	`void*`	Auxiliary dependency table
+392	`void*`	useDef map (bit 0 = inline/external flag)
+400	`void*`	useDef map pointer
+408	`u32`	useDef map capacity
+1568	`Value*`	Root function / reduction entry value
+1576	`u32`	SM reduction type (arch-specific opcode)
+1580	`u8`	Commutative flag
+1584	`char*`	Output result array
+1592	`u32`	Output result count
+1596	`u32`	Output result capacity
+1600	`char[16]`	Inline result storage

Cross-References

LoopVectorize & VPlan -- loop-based vectorization, runs alongside SLP in the same pipeline step
Loop Unrolling -- unrolling exposes more straight-line code for SLP
Pipeline & Ordering -- SLP placement at pipeline step 31
GVN -- runs after SLP to clean up redundancies introduced by vectorization
Optimization Levels -- SLP enabled at tier 2+; width parameter varies by tier
NVPTX Target Infrastructure -- TTI hook return values that drive SLP VF selection and cost model
Type Legalization -- vector split/scalarize rules that constrain SLP output legality
SelectionDAG & NVPTX Lowering -- custom lowering of SLP-produced vector loads/stores to ld.vN/st.vN
GPU Execution Model -- memory coalescing requirements that motivate SLP on GPU

Loop Strength Reduction (NVIDIA Custom LSR)

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp (LLVM 20.0.0). CICC ships the stock LLVM LSR at 0x284F650--0x287C150 alongside a completely separate NVIDIA custom formula solver at 0x199A--0x19BF. The custom solver replaces the formula generation and selection phases while reusing LLVM's SCEV infrastructure, IV rewriting, and chain construction.

NVIDIA ships two entirely separate LSR implementations inside cicc v13.0. The first is upstream LLVM's LoopStrengthReducePass (approximately 200 helpers across 0x284F650--0x287C150, compiled from llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp). The second is a custom 160KB formula solver (sub_19A87A0, 2688 decompiled lines) sitting at 0x199A--0x19BF, wrapped by NVLoopStrengthReduce at sub_19CE990. Both are invoked through the "loop-reduce" pass name in the LLVM new pass manager pipeline, but NVIDIA's overlay replaces the formula generation and selection phases with GPU-aware logic while reusing LLVM's SCEV infrastructure, IV rewriting, and chain construction.

This page documents the NVIDIA overlay -- the most GPU-specific LLVM pass in cicc. If you are reimplementing cicc's optimizer, this is the pass you cannot skip.

Why NVIDIA Rebuilt LSR

The root motivation is a single equation that does not exist on CPUs: register count determines occupancy, and occupancy determines performance. On a GPU, each additional register per thread can cross a discrete occupancy cliff, dropping warp-level parallelism by an entire warp group -- see the GPU Execution Model for the register budget and cliff table.

On a CPU, LSR's primary concern is minimizing the number of live induction variables to reduce register pressure, with a secondary goal of producing address expressions that fold into hardware addressing modes. The cost model compares formulae by counting registers, base additions, immediate encoding costs, and setup instructions. This works because a CPU's register file is fixed (16 GPRs on x86-64) and the cost of spilling to cache is relatively uniform.

On an NVIDIA GPU, four properties break this model:

Discrete occupancy cliffs. A formula that saves one instruction but adds one register might push the kernel past a cliff and lose 50% throughput. The cliff boundaries and their impact are documented in the occupancy cliff table.
No equivalent of L1 spill cost. When a GPU "spills," values go to local memory (DRAM, 200-800 cycles), which is orders of magnitude slower than CPU L1 cache.
Address space semantics. GPU memory is partitioned into address spaces with different widths and hardware addressing modes. Shared memory (addrspace(3)) uses 32-bit pointers with specialized .shared:: load/store instructions. Generic pointers are 64-bit. Strength-reducing a 32-bit shared-memory pointer can produce 64-bit intermediate values that force truncation, defeating the optimization.
Typed registers. PTX uses typed virtual registers (%r for 32-bit, %rd for 64-bit, %f for float). A 64-bit induction variable costs two 32-bit register slots. On older architectures (sm_3x through sm_5x), 64-bit integer operations are emulated and expensive; on sm_70+, native 64-bit addressing makes them acceptable.

LLVM's stock cost model knows none of this. It calls TTI::isLSRCostLess which compares an 8-field cost tuple ({Insns, NumRegs, AddRecCost, NumIVMuls, NumBaseAdds, ImmCost, SetupCost, ScaleCost}), but the NVPTX TTI implementation cannot express occupancy cliffs, address space constraints, or the sign-extension register savings that matter on GPU. NVIDIA's solution: replace the formula solver entirely, with 11 knobs for fine-grained control.

Architecture Overview

The NVIDIA LSR overlay is structured as a 7-phase formula solver pipeline. The main entry point is sub_19A87A0, which takes a single argument: a pointer to an LSR state object (referred to as a1 throughout). The state object is large -- relevant fields span from offset 0 through offset 32160.

LSR State Object Layout

Offset	Type	Field
`+8`	`ScalarEvolution*`	SCEV analysis handle
`+32`	`LoopInfo*`	Loop analysis handle
`+40`	`uint64_t`	Target address space identifier
`+192`	`int64_t**`	Stride factor table (array of stride values)
`+200`	`uint32_t`	Stride factor count
`+320`	`void*`	Reuse chain table base
`+328`	`uint32_t`	Reuse chain count
`+368`	`LoopRecord*`	Loop use-groups array base
`+376`	`uint32_t`	Loop use-groups count
`+32128`	`RPTracker`	Register pressure tracking structure
`+32136`	`void*`	Formula hash table base
`+32152`	`uint32_t`	Formula hash table bucket count
`+32160`	`void*`	Working formula set

Each loop record is 1984 bytes. Each use record within a loop is 96 bytes. The loop's use array starts at loop record offset +744, with the use count at +752. The stride at these sizes -- 1984 bytes per loop, 96 bytes per use -- is a constant throughout all 7 phases. The solver iterates loop_count * uses_per_loop in every phase, making this an O(L * U * S) algorithm where S is the stride factor table size.

The 7-Phase Formula Solver Pipeline

Phase 1: Initial Use Setup (lines 471--537)

The solver iterates all loop use-groups and, within each, all individual uses. For each 96-byte use record, it:

Copies the use record to stack locals (the record contains base SCEV, stride SCEV, flags, formula kind, scaled register array, offset expression, and secondary immediate).
Calls sub_19930D0 to expand the scaled register array into a working formula operand list.
Calls sub_19A22F0 (per-register formula generation): iterates over the scaled register count and calls sub_19A1B20 for each register operand to generate one initial formula candidate per operand. If formula_kind == 1 (with-offset mode), it also calls sub_19A1B20 with operand_idx = -1 to generate a formula for the offset expression itself.
Calls sub_19A23A0 (alternative formula generation): a second pass with different addressing-mode generation logic, likely producing formulae with combined base+offset or folded immediates.

The output of Phase 1 is a set of initial formula candidates, one per (use, scaled register) pair, covering the basic addressing modes.

Phase 2: Expression Folding with Unfolded Offsets (lines 548--662)

This phase targets uses where base == NULL (pure IV uses, no base pointer). It performs two sub-passes:

Sub-pass A (unfold offset into base): For each pure-IV use, calls sub_19A2680 per scaled register to generate candidate formulae that move the offset expression into the base register field. This is the inverse of LLVM's stock "fold offset into immediate" transform -- NVIDIA sometimes wants the offset in a register because GPU addressing modes have limited immediate widths.

Sub-pass B (factor loop bounds into formula): Builds an iterator set from the loop's start bound (+712) and end bound (+720), then calls sub_19A2820 per (scaled register, iterator bound) pair. This generates formulae that factor common strides out of the loop bounds. For example, if the loop runs i = 0..N and a use computes 4*i + base, this phase can factor out the stride 4 and produce a formula with a single-step IV.

Phase 3: Stride Factor Expansion (lines 662--862)

Runs only for loops where the use type is 3 (immediate-only addressing). This is the phase that explores alternative stride factors from the stride factor table (a1+192).

For each use:

Extract the representative SCEV via sub_1456040.
Verify bit width is at most 64 via sub_1456C90.
Verify single loop bound (start == end, meaning unit stride).
Reject floating-point constant offsets (type 15).
For each stride factor S in the table:
- Compute scaled_stride = S * use_stride.
- Overflow check: verify S * stride / S == stride and that INT64_MIN is not involved (avoiding signed overflow UB).
- Validate SCEV representability via sub_1594790.
- Also validate S * loop_start and S * loop_end.
- Construct a candidate formula with the factored stride.
- Run formula legality check via sub_1995490 (validates against TTI target legality, loop dominance, and SCEV overflow).
- If legal: rewrite all scaled operands via sub_13A5B60, check value equivalence via sub_1999100, then commit via sub_19A1660.

The overflow guards in this phase are critical. The multiplication overflow check (v94 * v455 / v94 == v455) prevents generating formulae whose stride values cannot be represented in 64-bit arithmetic, which would silently produce wrong results.

Phase 4: Chain-Based Formula Generation (lines 872--1082)

For each use, the solver attempts chain-based strength reduction: building formulae where one IV feeds the next use through a simple increment rather than a full recomputation.

Key logic:

Extracts the representative SCEV for the use.
If formula_kind != 1 (not with-offset) or the formula has a single element, iterates stride factors and builds chained formulae.
For immediate-type uses (type == 3), also considers promoting to with-offset mode (type == 1).
Each candidate is validated through sub_1995490.
Operands are rewritten via sub_145CF80 (SCEV multiply by stride factor).
The flag at loop record +728 controls address-space-aware chain construction. When set, chains respect address space constraints -- critical for shared memory (see the disable-lsr-for-sharedmem32-ptr knob section).

Phase 5: Reuse Chain Matching (lines 1093--1256)

For uses where base == NULL, the solver attempts to match existing IV chains for reuse rather than creating new ones.

Extract the representative SCEV and compute its "extent" (value range) via sub_1456E10.
Iterate the reuse chain table (a1+320 through a1+328).
For each chain entry, check if the use's extent matches the chain target.
Validate legality via sub_14A2CF0.
If matched: rewrite the use's offset via sub_147BE70 (SCEV rebase).
Register pressure check: validate via sub_19955B0(rp_tracker, scev_value, loop_idx) that the register pressure after adding this formula stays under the limit.
If RP passes: tag with address space via sub_19932F0, commit via sub_19A1660.

This is where the lsr-check-rp and lsr-rp-limit knobs have direct effect. The sub_19955B0 function compares projected register pressure against the configured ceiling and returns false if the formula would exceed it.

Phase 6: Formula-to-Use Hash Table Construction (lines 1260--1940)

The most complex phase. It builds two hash tables and uses them to identify which formulae serve multiple uses (shared IV expressions).

Hash Table 1 (7-QWORD entries per slot): maps SCEV expression to a linked list of formula candidates.

Field Offset	Size	Content
`+0`	8	Key: SCEV pointer, or sentinel (`-8` = empty, `-16` = tombstone)
`+8`	8	Formula candidate linked list head
`+16`	4	Candidate count
`+24`	8	Linked list tail
`+32`	8	Previous pointer (for median walk)
`+40`	8	Next pointer (for median walk)
`+48`	8	Total cost accumulator

Hash Table 2 (2-QWORD entries): maps SCEV expression to a use-count bitmap tracking how many uses reference the expression.

Both tables use the same hash function: (val >> 9) ^ (val >> 4) masked to the bucket count. Probing is quadratic with tombstone support. Resize triggers at 75% load factor via sub_19A82C0.

The phase then:

Inserts every formula from the working set into Hash Table 1 with SCEV normalization via sub_199D980.
Cross-references into Hash Table 2 for use counting, merging bitmaps via sub_1998630.
Iterates the formula set again and, for each formula, traverses the linked list of referencing uses.
Computes combined cost using sub_220EFE0 (reads cost from a binary tree node at +32).
Finds the median-cost insertion point (threshold at total_cost / 2) -- this is a key difference from upstream LLVM, which always picks the cheapest formula. NVIDIA's median heuristic avoids both extremes: the cheapest formula might use too many registers, while the most register-efficient formula might use too many instructions.
Builds (register_id, distance) candidate pairs for each (formula, use) combination. If the candidate set exceeds 31 entries, it migrates from an inline SmallVector to a balanced tree set (sub_19A5C50).

The use-count bitmap uses a compact inline representation: if (value & 1), the high bits encode max_reg_id and the remaining bits form the bitmap directly; otherwise, the value is a pointer to a heap-allocated BitVector (size at +16, data at +0). The popcount check at line 1927 (popcount != 1) filters out expressions used by only one use -- they cannot benefit from strength reduction.

Phase 7: Final Formula Selection and Commitment (lines 2042--2686)

After hash table cleanup, the solver iterates the candidate triples (register_id, distance, scev_use) and performs the final selection:

for each candidate (reg_id, distance, scev_use):
    loop_record = a1->loops[loop_idx]
    repr_scev   = getStart(scev_use)                    // sub_1456040
    extent      = getExtent(repr_scev)                   // sub_1456E10
    offset_expr = getAddExpr(extent, -distance, 0)       // sub_15A0680
    offset_norm = foldNormalize(scev_ctx, offset_expr)   // sub_146F1B0
    bit_budget  = getBitWidth(offset_norm)                // sub_1456C90

    for each use in loop_record:
        copy 96-byte use record to stack
        if formula_kind == 1:     // with-offset mode
            fold offset into scaled_regs
            set formula_kind = 0  // demote to normal

        if candidate IV appears in use's scaled_regs:
            // Direct replacement path
            validate via sub_1995490 (formula legality)
            build replacement AddRecExpr: sub_147DD40(scev_ctx, [target_iv], 0, 0)
            replace matching operand in formula

            // Sign-extension / width-fit check:
            value_range = computeRange(replacement)
            if abs(distance) < value_range:
                tag with address space (sub_19932F0)
                commit formula (sub_19A1660)

        else if use references a different IV:
            // Cross-IV replacement path
            alt_offset = stride + num_uses * distance
            alt_formula = getAddExpr(extent, -alt_offset, 0)
            validate via sub_1995490

            // Sign-bit check: if sign(replacement) == sign(distance),
            // the formula may wrap -- reject
            if sign_bit_matches: continue

            // Width-fit check via APInt:
            if countLeadingZeros(result) confirms fit in register width:
                commit formula

The width-fit checks use full APInt arithmetic (sub_16A4FD0 for copy, sub_16A7490 for shift/add, sub_16A8F40 for negate, sub_16A7400 for absolute value, sub_16A7B50 for bitwise AND, sub_16A57B0 for leading zero count) to determine whether the replacement formula's value range fits in the bit budget. This is essential for correctness: a formula that overflows its register width produces wrong results silently.

Register Pressure Integration

The integration between LSR and register pressure is the single most important difference from upstream LLVM. It works at three levels:

Level 1: Hard Gate (lsr-check-rp + lsr-rp-limit)

Before committing any reuse chain formula (Phase 5) and internally within the legality check sub_1995490, the solver calls sub_19955B0(rp_tracker, scev_value, loop_idx). This function reads the pre-computed per-loop register pressure estimate from offset a1+32128 and compares the projected post-formula RP against lsr-rp-limit. If the projected RP exceeds the limit, the formula is rejected outright -- it does not even enter the candidate set.

This prevents the pathological case where LSR produces a formula that requires one less instruction per iteration but needs two more live registers, pushing the kernel past an occupancy cliff. On GPU, that one extra instruction is vastly cheaper than the occupancy loss.

Level 2: Bit Budget Proxy (Phase 7)

The "bit budget" computed in Phase 7 (v325 = sub_1456C90(offset_norm)) acts as an indirect register pressure proxy. Wider values need more register slots: a 64-bit value occupies two 32-bit register slots on NVPTX. By enforcing that replacement formulae fit within the bit budget, the solver prevents needless register widening.

Level 3: Sign-Extension Credit (count-sxt-opt-for-reg-pressure + lsr-sxtopt)

When lsr-sxtopt is enabled, LSR attempts to fold sign-extension operations into IV expressions, producing narrower IVs. When count-sxt-opt-for-reg-pressure is also enabled, the cost model credits the register pressure savings from eliminated sign-extensions. A formula that requires one more base register but eliminates a sign-extension might be net-neutral or even beneficial in RP terms.

Level 4: Median-Cost Heuristic (Phase 6)

Rather than always selecting the cheapest formula (as upstream LLVM does), NVIDIA uses a median-cost heuristic. The total cost is summed across all uses of a formula, and the selection threshold is total_cost / 2. This balances instruction cost against register pressure: the cheapest formula often has the highest register pressure, while the formula nearest the median typically represents a balanced tradeoff.

GPU-Specific Knobs

All 11 knobs are registered at ctor_214_0 (0x4E4B00). They are LLVM cl::opt command-line options injected through NVIDIA's option registration infrastructure.

Complete Knob Reference Table

Knob	Type	Default	Category	Effect
`disable-unknown-trip-lsr`	`bool`	`false`	Scope Control	Skips LSR entirely for loops where SCEV cannot determine the trip count. Unknown-trip loops on GPU may be warp-divergent; applying LSR without trip count knowledge can increase register pressure with no loop-count-informed gain.
`lsr-check-rp`	`bool`	`true` `[MEDIUM confidence]`	Register Pressure	Master switch for register pressure checking. When disabled, LSR ignores occupancy constraints and behaves more like upstream LLVM. Default inferred from observed RP-aware behavior in O2 compilations; constructor default not directly confirmed.
`lsr-rp-limit`	`int`	~32-64 `[LOW confidence]`	Register Pressure	Register pressure ceiling. If current RP for the loop meets or exceeds this value, LSR is skipped for that loop. The threshold is set to coincide with occupancy cliff boundaries. Range estimated from SM occupancy math; actual compiled-in default not extracted from binary.
`filter-bad-formula`	`bool`	`true` `[MEDIUM confidence]`	Formula Quality	Enables NVIDIA's custom formula pruning pass. "Bad" formulae are those requiring too many registers or producing address modes unsupported by SASS (for example, formulae that require scaled-index modes that only exist on CPU). Default inferred from observed pruning behavior; constructor value unconfirmed.
`do-lsr-64-bit`	`bool`	arch-dependent	IV Width	Enables LSR for 64-bit induction variables. Default is `false` on sm_3x through sm_5x (where 64-bit integer ops are emulated), `true` on sm_70+ (native 64-bit datapath).
`count-sxt-opt-for-reg-pressure`	`bool`	`true` `[MEDIUM confidence]`	Register Pressure	When calculating RP cost, credits the register savings from sign-extension eliminations that LSR enables. Default inferred from observed behavior; constructor value unconfirmed.
`lsr-sxtopt`	`bool`	`true` `[MEDIUM confidence]`	Sign Extension	Master switch for sign-extension folding within LSR. Folds sign-extension operations into IV expressions to produce narrower IVs, reducing register file consumption. Default inferred from observed behavior; constructor value unconfirmed.
`lsr-loop-level`	`int`	`0` (all)	Scope Control	Restricts LSR to loops at a specific nesting depth. `0` = all levels. `1` = innermost loops only (where address arithmetic is hottest).
`lsr-skip-outer-loop`	`bool`	`false`	Scope Control	Skips the outer loop's IV when processing nested loops. Prevents strength-reducing the outer IV when the inner loop is the performance bottleneck.
`disable-lsr-for-sharedmem32-ptr`	`bool`	`false`	Address Space	Disables LSR for pointers into 32-bit shared memory (`addrspace(3)`). Protects efficient `.shared::` addressing modes and bank-conflict-free access patterns.
`disable-lsr-complexity-discount`	`bool`	`false`	Cost Model	Disables the complexity estimation discount. When the discount is active (this knob is `false`), the cost model gives a bonus to formulae that reduce addressing complexity even if they use more registers. Disabling forces strict register-count-based comparison.

Knob Grouping by Function

Register pressure control (4 knobs): lsr-check-rp, lsr-rp-limit, count-sxt-opt-for-reg-pressure, lsr-sxtopt. These collectively determine whether and how aggressively the solver factors occupancy into formula selection. With all four active, NVIDIA's LSR is deeply occupancy-aware. With all four disabled, it degrades toward upstream LLVM behavior.

Scope control (3 knobs): disable-unknown-trip-lsr, lsr-loop-level, lsr-skip-outer-loop. These restrict which loops LSR operates on. They are safety valves: if LSR is hurting a specific kernel, these allow narrowing its scope without disabling it entirely.

Address space control (2 knobs): disable-lsr-for-sharedmem32-ptr, do-lsr-64-bit. These control how LSR interacts with GPU memory semantics. The shared-memory knob protects 32-bit pointer optimality; the 64-bit knob controls IV width policy.

Cost model control (2 knobs): filter-bad-formula, disable-lsr-complexity-discount. These tune the formula evaluation heuristics. The bad-formula filter removes candidates early; the complexity discount adjusts the tradeoff between instruction count and register count.

Address-Space Awareness

Shared Memory 32-Bit Pointer Protection

Shared memory on NVIDIA GPUs uses addrspace(3) with 32-bit addressing. The hardware provides dedicated .shared:: load/store instructions with efficient addressing modes, including bank-conflict-free access patterns tied to pointer alignment.

NVIDIA's LSR overlay tracks address spaces at two levels:

Loop level: the address space identifier at loop record +40.
Use level: the alignment constraint at use record +48.

In Phase 4 (chain-based formula generation), line 983 checks use+48 == a1+40 || flag_at_+728. If the use's address space matches the target or the address-space-crossing flag is set, the solver uses address-space-aware chain construction. The sub_19932F0 helper tags committed formulae with the correct address space.

When disable-lsr-for-sharedmem32-ptr is enabled, the solver skips all formulae targeting addrspace(3) pointers. The rationale: strength-reducing a 32-bit shared memory pointer can create 64-bit intermediate values (the IV increment may be computed in 64-bit before truncation to 32-bit). This defeats the optimization and can prevent the backend from using efficient 32-bit .shared:: addressing modes.

64-Bit IV Control

The do-lsr-64-bit knob controls whether LSR generates formulae using 64-bit induction variables. The architecture-dependent default reflects hardware reality:

sm_30 through sm_52: 64-bit integer operations are emulated (two 32-bit ops + carry). A 64-bit IV costs roughly 2x the register pressure and 2x the instruction cost. LSR is disabled for 64-bit IVs.
sm_60 through sm_62: Partial native 64-bit support for address computation.
sm_70 and above: Full native 64-bit addressing and arithmetic. 64-bit IVs become acceptable.

Phase 3 (stride factor expansion) checks the bit width of the representative SCEV (sub_1456C90 must return at most 64). Phase 7's bit budget check ensures replacement formulae fit within the available register width. Together, these prevent 64-bit IV generation on architectures where it is disabled.

Sign-Extension Optimization

When lsr-sxtopt is enabled, the solver actively seeks to fold sign-extension operations into IV expressions. On NVPTX, this is important because:

PTX uses typed registers. A sext i32 %x to i64 creates a new 64-bit value occupying a separate register pair.
If LSR can express the IV in a narrower type from the start, the sign-extension becomes dead code.
When count-sxt-opt-for-reg-pressure is also enabled, the cost model credits this saving.

The sign-extension check appears in Phase 7's width-fit verification. After constructing a replacement formula, the solver computes the value range using APInt arithmetic and checks whether abs(distance) < value_range. If the replacement fits, the sign-extension can be eliminated. An additional sign-bit check (line 2545) rejects replacements where the sign bit of the result matches the sign of the distance -- this would cause the formula to wrap, producing incorrect values.

Complexity Discount Heuristic

When disable-lsr-complexity-discount is false (the default), the cost model applies a discount to formulae that reduce addressing complexity, even if they use more registers. "Addressing complexity" here means the number of operations required to compute the effective address for a memory operation.

Consider two formulae for a memory access inside a loop:

Formula A: base + 4*i -- one multiplication, one addition. Requires a scaled index register.
Formula B: ptr += 4 each iteration -- one addition per iteration, no multiplication. Requires one increment register.

Formula B is "simpler" in addressing complexity but might use one more register (the incrementing pointer) alongside the existing base. The complexity discount gives Formula B a bonus in the cost model, reflecting the GPU reality that address computation instructions compete with arithmetic instructions for issue slots, while an extra register has low cost when the kernel is not at an occupancy cliff.

When the discount is disabled (the knob is set to true), the cost model falls back to strict register-count comparison, similar to upstream LLVM behavior.

Comparison: NVIDIA LSR vs Upstream LLVM LSR

Aspect	Upstream LLVM LSR	NVIDIA Custom LSR
Code size	~180KB compiled (500+ helpers, 4 mega-functions)	~160KB compiled (30 functions, main solver 83KB)
Binary location	`0x284F650`--`0x287C150`	`0x199A`--`0x19BF` overlay
Cost model	8-field tuple: `{Insns, NumRegs, AddRecCost, NumIVMuls, NumBaseAdds, ImmCost, SetupCost, ScaleCost}`. Compared via `TTI::isLSRCostLess`.	Register-pressure-aware with occupancy ceiling. Median-cost heuristic. Complexity discount. Sign-extension credit.
Formula selection	Always picks cheapest formula per cost tuple ordering	Median-cost heuristic: picks near cost midpoint to balance instructions vs registers
Register pressure	Counted but not capped. No occupancy awareness	Hard-gated: `lsr-check-rp` + `lsr-rp-limit` reject formulae that exceed RP ceiling
Address spaces	Single flat address space assumed	Full address-space tracking. Shared memory (addrspace 3) gets special 32-bit protection
64-bit IVs	Always considered if legal	Gated by `do-lsr-64-bit` with architecture-dependent defaults
Sign-extension	Not a first-class concern	Dedicated optimization path with RP credit (`lsr-sxtopt`, `count-sxt-opt-for-reg-pressure`)
Loop scope	All loops	Filterable by nesting depth (`lsr-loop-level`) and outer-loop exclusion (`lsr-skip-outer-loop`)
Trip count requirement	Attempts all loops	Can skip unknown-trip loops (`disable-unknown-trip-lsr`)
Hash table	`DenseSet<SmallVector<SCEV*>>` for uniquification	Custom 7-QWORD-per-entry hash table with quadratic probing, tombstones, 75% load factor resize, linked-list formula chains, and use-count bitmaps
Formula phases	Single-pass candidate generation followed by cost-based pruning	7 sequential phases: initial setup, expression folding, stride expansion, chain generation, reuse matching, hash table construction, final selection
SCEV infrastructure	Native	Reused from LLVM (shared SCEV, IV rewriting, chain construction)
Tuning knobs	7 `cl::opt` knobs (general-purpose: `lsr-insns-cost`, `lsr-filter-same-scaled-reg`, `lsr-complexity-limit`, etc.)	11 GPU-specific knobs (register pressure, address space, loop scope, cost model)

What NVIDIA Reuses From Upstream

The NVIDIA overlay does not replace everything. It reuses:

SCEV infrastructure (0xDB--0xDF range): ScalarEvolution analysis, AddRecExpr construction, range analysis, and trip count computation.
IV rewriting (sub_1997F10): creates the replacement IV values with the naming convention "IV.S." and "IV.S.next.".
Chain construction (sub_199EAC0): builds IV chains with the "lsr.chain" naming prefix.
Formula cost model base (sub_1995010): the underlying cost computation, which NVIDIA then wraps with RP checking and sign-extension credit.
Terminator folding (sub_287C150): the "lsr_fold_term_cond" transform that folds loop exit comparisons.

What NVIDIA Replaces

Formula generation (Phases 1--5): entirely custom, with address-space awareness, stride factor expansion, and reuse chain matching with RP validation.
Formula-to-use mapping (Phase 6): custom hash tables replacing LLVM's DenseSet-based uniquification with a design optimized for linked-list traversal and median-cost computation.
Final selection (Phase 7): custom selection with width-fit checks, sign-extension validation, and cross-IV replacement -- none of which exist in upstream LLVM.

Key Helper Function Map

For reimplementation reference, the critical helpers and their roles:

Address	Function	Role
`sub_19A87A0`	Main 7-phase solver	Entry point (83KB, 2688 lines)
`sub_19CE990`	`NVLoopStrengthReduce::run()`	Pass wrapper
`sub_1995490`	Formula legality validator	TTI + SCEV + loop constraint check
`sub_19955B0`	Register pressure check	Compares projected RP vs limit
`sub_19932F0`	Address space tagger	Sets addrspace on formula
`sub_19A1660`	Formula commit	Sorts, deduplicates, inserts into candidate set
`sub_19A22F0`	Per-register formula gen (Phase 1)	Loops `sub_19A1B20` per operand
`sub_19A2680`	Unfolded-offset formula gen (Phase 2a)	Offset-to-base transform
`sub_19A2820`	Loop-bound-factored formula gen (Phase 2b)	Stride factoring
`sub_19A82C0`	Hash table resize	Power-of-two bucket growth
`sub_199D980`	SCEV normalization	Canonical form for hashing
`sub_1998630`	Use-count bitmap merge	Inline bitmap + heap fallback
`sub_1456040`	SCEV `getStart()`	Extract base from AddRecExpr
`sub_1456C90`	SCEV `getBitWidth()`	Register width determination
`sub_1456E10`	SCEV extent computation	Value range of IV
`sub_145CF80`	SCEV `getMulExpr()`	Multiply SCEV by stride factor
`sub_147BE70`	SCEV rebase	Rewrite base in AddRecExpr
`sub_147DD40`	AddRecExpr constructor	Build replacement IV chain
`sub_15A0680`	SCEV `getAddExpr()`	Add constant offset
`sub_146F1B0`	SCEV fold/normalize	Simplify expression

Data Structure Reference

Use Record (96 bytes)

+0   [8]  base_scev       : SCEV* (NULL for pure-IV uses)
+8   [8]  stride_scev     : SCEV* (loop stride expression)
+16  [1]  flags           : bit 0 = is_address, bit 1 = has_offset
+24  [8]  formula_kind    : 0 = normal, 1 = with-offset, 3 = immediate-only
+32  [8]  scaled_regs_ptr : pointer to SmallVector<SCEV*>
+40  [4]  scaled_regs_cnt : number of scaled register operands
+48  [32] padding / alignment / additional fields
+80  [8]  offset_scev     : SCEV* (offset expression)
+88  [8]  secondary_imm   : secondary immediate value

Loop Record (1984 bytes)

+32   [4]  use_type        : 0 = generic, 1 = address-check, 3 = immediate
+40   [8]  addr_space      : address space identifier
+48   [4]  alignment       : alignment constraint (bytes)
+712  [8]  loop_start      : SCEV* (loop start bound)
+720  [8]  loop_end        : SCEV* (loop end bound)
+728  [1]  as_aware_flag   : address-space-aware LSR active
+729  [1]  dead_guard_flag : if set && use_count > 0, skip loop
+744  [8]  use_array_ptr   : pointer to array of use records
+752  [4]  use_count       : number of uses in this loop

Reimplementation Notes

Start with the knob infrastructure. Register all 11 cl::opt knobs before anything else. The pass wrapper (sub_19CE990) reads these early and uses them to gate entire phases.
The RP tracker must exist before the solver runs. The register pressure estimate at a1+32128 is computed by an earlier pass (likely during loop analysis). The NVIDIA LSR does not compute RP itself -- it only reads and compares.
The hash function is deterministic. (val >> 9) ^ (val >> 4) masked to bucket count. Quadratic probing with tombstone support. If you are reimplementing the hash tables, use the same scheme or your formula deduplication will differ.
The median-cost heuristic is the secret sauce. Upstream LLVM always picks the cheapest formula. NVIDIA picks near the median. This single difference is responsible for most of the occupancy improvements. If you must simplify, keep this heuristic.
The overflow checks in Phase 3 are load-bearing. The S * stride / S == stride check and the INT64_MIN guard prevent generating formulae with wrapped arithmetic. Removing these checks will produce silently wrong code on kernels with large strides.
Address space tagging (sub_19932F0) must happen before commit. Every formula committed via sub_19A1660 must carry the correct address space tag. Forgetting this will produce PTX that uses generic loads/stores where shared-memory instructions are required, breaking both performance and correctness.
The use-count bitmap has two representations. Inline (when value & 1) and heap-allocated. The inline form is fast but limited to small register ID ranges. The heap form uses a BitVector with the size at +16. Both must be supported.
Phase ordering is strict. The 7 phases must run in order. Later phases depend on candidates generated by earlier ones, and the hash tables in Phase 6 assume all candidates have been generated by Phases 1--5.

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Formula solver	Single LLVM `LoopStrengthReduce` with TTI-based cost model	Two implementations: stock LLVM LSR + custom 160 KB NVIDIA formula solver (`sub_19A87A0`, 2688 lines) that replaces formula generation/selection
Cost model	8-field cost tuple (`{Insns, NumRegs, AddRecCost, ...}`), no occupancy concept	Occupancy-aware cost: register count evaluated against discrete warp occupancy cliffs where +1 register can halve throughput
Address space awareness	No address space semantics in formula selection	Address space tagging (`sub_19932F0`) ensures formulae preserve shared memory (addrspace 3) 32-bit pointer width; prevents strength-reducing 32-bit pointers into 64-bit generic form
Knob count	~5 knobs for cost tuning	11 knobs for fine-grained GPU-specific control (`lsr-no-ptr-address-space3`, stride limits, formula depth, etc.)
Algorithm structure	Monolithic formula generator + greedy selector	7-phase formula solver pipeline: candidate generation, stride-based filtering, use-group analysis, formula selection, commit, rewrite
State object	Modest state for formula tracking	32,160-byte state object with embedded register pressure tracker, formula hash table, and per-use-group formula arrays
Typed register cost	All registers weigh the same	64-bit IVs cost two 32-bit register slots; emulated on sm_3x--5x; native on sm_70+ but still double the pressure

StructurizeCFG

Prerequisites: Familiarity with GPU execution model (warp divergence, reconvergence), LLVM dominator tree and post-dominator tree concepts, and the PTX emission pipeline. Understanding of reducible vs. irreducible control flow is assumed.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/Transforms/Scalar/StructurizeCFG.cpp (LLVM 20.0.0). The upstream version was originally written for AMDGPU; cicc ships both the stock AMDGPU copy at sub_1F0EBC0 and a separate NVPTX-customized copy at sub_35CC920.

CICC v13.0 ships two copies of the StructurizeCFG pass: an NVPTX-specific version at sub_35CC920 (95 KB, 2,397 decompiled lines) and a stock LLVM/AMDGPU version at sub_1F0EBC0. Both exist because the binary links both the NVPTX backend and the generic LLVM Scalar library; only the NVPTX instance is scheduled in the CUDA compilation pipeline. This page documents the NVPTX version exclusively.

The pass is mandatory for PTX emission. It is registered as "structurizecfg" in the pipeline parser (sub_2377300, sub_233F860) and listed as a required late pass by sub_29882C0 and sub_1A6D600.

Why PTX Requires Structured Control Flow

PTX is a structured instruction set. Unlike x86 or ARM, where a branch can target any address and the hardware resolves control flow at retirement, the NVIDIA GPU execution model imposes three hard constraints:

Reconvergence at post-dominators. When a warp diverges (threads take different sides of a branch), the hardware needs a defined reconvergence point where all threads synchronize before continuing. This reconvergence point must be the immediate post-dominator of the branch. An unstructured CFG has no guarantee that such a point exists or is reachable from both sides.
No multi-entry loops. A loop header must dominate every block in the loop body. If two distinct blocks serve as loop entries (an irreducible cycle), the hardware has no single point to insert the loop counter logic and the warp-level loop exit barrier. PTX therefore requires all loops to be natural (single-entry, reducible).
No exception handling funclets. CUDA device code has no runtime support for stack unwinding, personality routines, or catch dispatch. The funclet-based EH model (Windows SEH, C++ landing pads) produces control flow patterns that cannot be expressed in PTX.

The StructurizeCFG pass converts reducible-but-unstructured flow into structured form by inserting "Flow" blocks that serve as explicit reconvergence points. It rejects irreducible flow and EH funclets with diagnostic remarks rather than attempting to restructure them.

Binary Layout

Function	Address	Size	Role
`sub_35CC920`	`0x35CC920`	95 KB	Main pass body
`sub_35CF930`	`0x35CF930`	~2 KB	Entry gate / dispatch wrapper
`sub_35CA2C0`	`0x35CA2C0`	~4 KB	Irreducibility detector
`sub_35CB4A0`	`0x35CB4A0`	~8 KB	Uniform branch classifier
`sub_35CBCD0`	`0x35CBCD0`	~6 KB	Region structurizer core
`sub_35CA580`	`0x35CA580`	~1 KB	Diagnostic emitter
`sub_35CA9C0`	`0x35CA9C0`	~1 KB	Hash-set insert for BB tracking
`sub_35C9CD0`	`0x35C9CD0`	~2 KB	Edge reroute through new block
`sub_35C9ED0`	`0x35C9ED0`	~1 KB	Domtree NCA (nearest common ancestor) walk
`sub_35C9B40`	`0x35C9B40`	trivial	Successor array offset (`return a1 + 8*a3`)

Entry Gate: sub_35CF930

sub_35CF930 is the runOnFunction entry. It implements a multi-stage filter before committing to the expensive structurization:

sub_35CF930(pass, function):
    // 1. Early-out for trivially uninteresting functions
    if sub_BB98D0(pass, function) fails:
        return 0

    // 2. Single-block functions need no structurization
    bb_list = function + 40
    if bb_list points to itself (single block):
        return 0

    // 3. Query target machine for a structurizer strategy object
    strategy = target_machine->vtable[136](...)

    // 4. Check enable-shrink-wrap override
    switch qword_50400C8:
        case 1:  goto force_structurize    // always run
        case 2:  return 0                  // always skip
        case 0:                            // ask strategy object
            if not strategy->vtable[72](function):
                return 0                   // strategy says skip

    // 5. Check function attributes for safe-to-skip markers
    for attr_id in [56, 63, 59, 64, 57]:
        if sub_B2D610(function, attr_id):
            return 0

    // 6. Run the actual structurizer
    force_structurize:
        return sub_35CC920(pass, function)

The attribute IDs likely map to: 56 = convergent, 63 = nodivergencesource, 59 = nounwind, 64 = alwaysinline, 57 = optnone. [MEDIUM confidence] These numeric-to-name associations are inferred from LLVM attribute enumeration ordering in the upstream source and the semantic context of their usage (skip-structurize guard), not from string evidence in the binary. The attribute enum may differ in NVIDIA's fork. Functions carrying any of these are either already guaranteed to have uniform control flow or are explicitly marked as not-to-be-optimized.

CLI Knobs

Knob	Registration	Type	Default	Effect
`structurizecfg-skip-uniform-regions`	`ctor_227` @ `0x4E9E40`, `ctor_489` @ `0x553F30`	bool	false	When true, regions with only uniform (warp-coherent) branches are left unstructured, avoiding unnecessary code bloat
`structurizecfg-relaxed-uniform-regions`	`ctor_489` @ `0x553F30`	bool	true	Allows treating a region as uniform even if sub-regions contain non-uniform branches, provided there is at most one conditional direct child
`enable-shrink-wrap` (`qword_50400C8`)	`ctor_688` @ `0x5A6520`	int (0/1/2)	0	0 = ask `TargetRegisterInfo` (vtable+72) whether to structurize; 1 = force structurize unconditionally; 2 = skip structurize entirely

The enable-shrink-wrap knob is stored as a global at qword_50400C8. Despite its name (borrowed from the generic LLVM shrink-wrapping pass infrastructure), it serves as a master override for the structurization decision. Mode 2 effectively disables the pass, which would produce miscompilation for any function with divergent branches -- it exists purely as a debugging/override mechanism.

Irreducibility Detection: sub_35CA2C0

Called early in sub_35CC920 (line ~743 of the decompiled output), this function determines whether the CFG contains irreducible cycles. It detects irreducibility but does not restructure it.

Algorithm

The function receives the RPO-ordered basic block list from the SCC decomposition phase and iterates backwards:

sub_35CA2C0(result, domtree_data, bb_list, bb_count):
    for each BB in reverse(bb_list):
        for each successor S of BB:
            // Probe dominator tree hash table
            // Hash: ((ptr >> 9) ^ (ptr >> 4)) & (bucket_count - 1)
            dom_node = lookup(domtree_data, S)

            // If S does NOT dominate BB, but there is a back-edge
            // from BB to S, this is an irreducible cycle
            if back_edge(BB, S) and not dominates(S, BB):
                return 1  // irreducible

    return 0  // reducible

The core invariant: in a reducible CFG, every back-edge target dominates its source. If a back-edge exists where the target does not dominate the source, the loop has multiple entries and is irreducible.

Rejection behavior

When sub_35CA2C0 returns 1 (irreducible detected), the main pass emits:

remark: UnsupportedIrreducibleCFG
        "Irreducible CFGs are not supported yet."

via sub_35CA580 and returns without modifying the function. The return value is forced to 0 (no modification made).

This is a critical design choice. LLVM upstream provides a separate FixIrreduciblePass (sub_29D33E0, registered as "fix-irreducible") that performs node-splitting to convert irreducible cycles into reducible ones. However, the NVPTX pipeline in CICC v13.0 does not schedule FixIrreduciblePass before StructurizeCFG. The assumption is that well-formed CUDA C++ source never produces irreducible flow. If it does (extreme goto abuse, or a prior optimization pass introducing an irreducible pattern), the compilation emits the diagnostic and the resulting PTX will likely be rejected by ptxas.

EH Funclet Rejection

During the per-block iteration in the main loop, each basic block is checked for funclet status at offset BB+235 (a boolean flag indicating the block is a catchpad, cleanuppad, or catchret target):

if BB->isEHFunclet():   // *(BB + 235) != 0
    emit_diagnostic("UnsupportedEHFunclets",
                     "EH Funclets are not supported yet.")
    clear visited bitvector
    bail out

The funclet model (Windows x64, ARM64) structures exception handling into mini-functions that require personality routines and unwind tables. None of this exists in the GPU runtime. If a funclet block appears, it means the frontend erroneously lowered exception handling into device code.

After emitting the diagnostic, the pass checks qword_503FFE8 (a global flag, possibly a debug override). If nonzero, it attempts to find a single-entry point and process the rest of the function; if zero, it bails out entirely.

Uniform Branch Classification: sub_35CB4A0

This function (~500 decompiled lines) classifies whether a branch instruction is warp-uniform (all threads in the warp take the same direction) or divergent. The classification determines whether the region under that branch needs structurization.

Classification logic

sub_35CB4A0(pass_state, BB, ...):
    terminator_opcode = BB->opcode_category   // BB + 68, unsigned short

    // Non-conditional terminators (ret, unreachable, switch) skip analysis
    if (terminator_opcode - 1) > 1:
        return 0  // not a conditional branch, no structurization needed

    // Check function-level flags
    func_flags = BB->parent->flags   // BB + 32 + 64
    // bit 3 (0x08) = hasConvergentCalls
    // bit 4 (0x10) = hasDivergentBranches

    // Check block-level properties
    block_flags = BB->properties   // BB + 44
    // bit 2 (0x04) = already classified
    // bit 3 (0x08) = uses profile data

    // Query DivergenceAnalysis
    uniformity = sub_2E88A90(divergence_info, BB, mask_bits)
    // mask_bits: 0x80000 = uniform, 0x100000 = divergent, 0x80 = other

    // Additional uniformity check
    is_uniform = sub_2E8B090(divergence_info, BB)

    if is_uniform and skip_uniform_regions_enabled:
        return 0  // uniform, can skip structurization

    return 1  // divergent, needs structurization

When the structurizecfg-skip-uniform-regions knob is active, regions with all-uniform branches are left unmodified. This is sound because uniform branches do not cause warp divergence and therefore do not require explicit reconvergence points. Skipping these regions reduces code bloat from the insertion of unnecessary Flow blocks.

The structurizecfg-relaxed-uniform-regions knob relaxes the uniformity check for sub-regions. In upstream LLVM, hasOnlyUniformBranches refuses to treat a region as uniform if any sub-region contains a non-uniform branch. The relaxed mode allows this if there is at most one conditional direct child, under the reasoning that a single divergent sub-region can be handled by an inner structurization pass invocation.

Region Structurizer Core: sub_35CBCD0

This is the heart of the transformation. When a non-uniform, non-EH block is identified, sub_35CBCD0 processes its region:

sub_35CBCD0(pass_state, BB, context):
    // 1. Manage region boundaries
    head = pass_state[67]   // current region head
    tail = pass_state[68]   // current region tail

    // 2. Iterate successors
    for each successor S of BB (via sub_2E313E0):

        // 3. Check uniformity of successor edge
        if sub_35CB4A0(pass_state, S, ...) returns 0:
            continue  // uniform edge, skip

        // 4. Compute reconvergence point via NCA
        nca = sub_35C9ED0(domtree, BB, S)
        // NCA = nearest common ancestor in dominator tree
        // This is where threads from both sides of the branch
        // must reconverge

        // 5. Update region boundaries
        pass_state[67] = update_head(head, nca)
        pass_state[68] = update_tail(tail, nca)

    // 6. Update visited-BB bitvector
    set_bit(pass_state[91], BB->ordinal)

The NCA computation (sub_35C9ED0) walks the dominator tree upward from both the current block and its successor until finding their nearest common ancestor. This NCA becomes the reconvergence point: the block where the hardware must synchronize all threads before continuing.

Main Structurization Loop: sub_35CC920

The main pass body executes in four phases.

Phase 1: Initialization (lines 433-648)

// Store analysis results in pass object fields
pass[65] = DivergenceAnalysis + 200
pass[66] = LoopInfo + 200
pass[67] = 0              // current head
pass[68] = 0              // current tail
pass[69] = DomTree + 200
pass[70] = PostDomTree + 200
pass[71] = loop_depth_info

// Compute RPO (reverse post-order)
rpo = sub_2EA7130() -> sub_2EA7B20()

// Build SCC ordering (cross-references RPO with SCC decomposition)
scc_order = sub_357E170(rpo)

// Check for irreducible cycles
if sub_35CA2C0(scc_order, domtree, ...):
    emit "UnsupportedIrreducibleCFG"
    return 0

Phase 2: Per-block classification (lines 816-2253)

Iterates blocks in reverse RPO order (bottom-to-top):

for each BB in reverse_rpo(scc_order):

    // (a) Reject EH funclets
    if BB->isEHFunclet:
        emit "UnsupportedEHFunclets"
        clear bitvector, bail out

    // (b) Already marked for structurization
    if BB->structurize_flag (BB+216) or BB->flag_262 (BB+262):
        sub_35CBCD0(pass, BB, ...)  // structurize this region
        continue

    // (c) Check successors for back-edges to visited blocks
    has_loop = false
    for each successor S of BB:
        if bitvector_test(S->ordinal):
            has_loop = true   // back-edge detected = loop header

    // (d) Classify uniformity of predecessors
    needs_structurize = false
    for each predecessor P of BB:
        if sub_35CB4A0(pass, P, ...):
            needs_structurize = true
            break

    // (e) Apply structurization
    if needs_structurize:
        sub_35CBCD0(pass, BB, ...)

    // (f) Update bitvector
    bitvector_set_or_clear(BB->ordinal, needs_structurize)

Phase 3: Domtree-guided reconvergence (lines 2255-2396)

After the per-block loop, if a split point was identified (pass[67] != 0 and pass[68] != 0):

// Walk domtree from split point upward
current = split_point
while current != null:
    // Query strategy object for split decisions
    if strategy->shouldSplit(current):       // vtable+312
        sub_35CBCD0(pass, current, ...)

    if strategy->shouldSplitChild(current):  // vtable+320
        // second round for child regions
        ...

    current = domtree_parent(current)

// Store results in function metadata for PTX emission
function_obj[672] = head    // reconvergence head
function_obj[680] = tail    // reconvergence tail

These stored head/tail values are read by subsequent PTX emission passes to emit the correct convergence/reconvergence annotations in the output PTX.

Phase 4: Cleanup (lines 2383-2396)

Frees the helper object allocated at line 771 (0xA8 bytes), the SCC ordering buffer, and returns the modification flag (0 = no changes, 1 = modified).

Reconvergence Insertion Path

When a non-uniform divergent region is identified between a head block and a tail block, the pass performs the actual CFG transformation:

Step 1: Dominance validation

// Head must dominate tail
if not sub_2E6D360(domtree, head, tail):
    skip  // invalid region, cannot structurize

// Tail must post-dominate head
if not sub_2EB3EB0(postdomtree, tail, head):
    skip

Step 2: Edge classification

Collect successors of the tail into two sets:

External edges: successors pointing outside the region (into v395/v396)
Internal edges: successors pointing back inside the region (into v404/v405)

The strategy object (vtable+344) classifies each edge to determine if restructuring is needed.

Step 3: Flow block creation

// Create new "Flow" basic block
new_block = sub_2E7AAE0(function, 0, ...)  // BasicBlock::Create
sub_2E33BD0(new_block, insert_point)       // insert into BB list

// Copy phi-node entries from original target
for each phi in original_target:
    sub_2E33140(phi, ...)   // copy incoming value
    sub_2E341F0(phi, ...)   // update predecessor

Step 4: Edge rerouting

// Reroute edges from old target to new Flow block
sub_2E337A0(old_target, new_block)         // replaceAllUsesWith
sub_2E33F80(new_block)                     // finalize successors

// For each stale edge, update divergence info
for each stale_edge:
    sub_35C9CD0(stale_edge, ...)
    strategy->updateDivergence(...)        // vtable+368

Step 5: Recursive child splitting

If the strategy's shouldSplitChild (vtable+320) returns true, the newly created Flow block itself may need further splitting. This creates another block, reroutes edges again, and recurses. This handles deeply nested divergent regions where a single Flow block is insufficient.

Before/After CFG Example

Consider a function with a divergent if-then-else:

Before structurization:

    Entry
    /    \
  Then   Else
    \    /
    Merge
      |
    Exit

If the branch at Entry is divergent (some threads go to Then, others to Else), the hardware needs an explicit reconvergence point. After structurization:

After structurization:

    Entry
    / T
   |    \
   |   Then
   |    /
  Flow1         <- new block: reconvergence for Then
   | F  \
   |   Else
   |    /
  Flow2         <- new block: reconvergence for Else
    |
   Merge
    |
   Exit

The Flow1 and Flow2 blocks are inserted with conditional branches controlled by PHI networks. Flow1 has a branch: if the thread came from Then, continue to Flow2; if the thread skipped Then, also continue to Flow2 (the "false" exit). Flow2 similarly gates the Else path.

For a divergent loop:

Before:

    Entry
      |
    Header <--+
    /    \     |
  Body    |   |
    \    /    |
   Latch -----+
      |
    Exit

After:

    Entry
      |
    Header <------+
      |            |
    Body           |
      |            |
    FlowLoop       |
    / (back) \     |
   |          +----+
   | (exit)
   Exit

FlowLoop is a new block whose branch condition is a PHI: true incoming from Body means exit the loop, false means take the back-edge. This inverted convention (true = break, false = continue) matches upstream LLVM's structurization invariant.

Flow Block Insertion Algorithm

The previous sections describe the pass at the function-dispatch level. This section provides the complete algorithmic detail of how Flow blocks are actually created, wired, and how PHI networks are maintained -- the core transformation that converts a reducible-but-unstructured CFG into a fully structured CFG suitable for PTX emission.

Complexity

Let B = number of basic blocks, E = number of CFG edges, and D = depth of the dominator tree. The irreducibility detection (sub_35CA2C0) is O(B * E) -- for each block in reverse RPO, it probes successors against the dominator tree hash table (O(1) per probe). The per-block classification loop is O(B * (P_avg + S_avg)) where P_avg and S_avg are average predecessor and successor counts -- effectively O(B + E). The uniform branch classifier (sub_35CB4A0) is O(1) per block (a few flag checks and one DivergenceAnalysis query). The NCA computation (sub_35C9ED0) walks the domtree upward from two nodes until convergence: O(D) per call. Each Flow block insertion is O(D + PHI_count) where PHI_count is the number of PHI nodes at the original merge point (each needs entry copying). Recursive child splitting adds at most O(B) new blocks total across the entire function. The bitvector tracking is O(B / 64) per test/set operation. Overall: O(B * D + E + F * PHI_total) where F = number of Flow blocks created. Since F <= B (one Flow per divergent region) and D = O(B) in the worst case, the theoretical worst case is O(B^2 + E). In practice, CUDA CFGs are shallow (D < 20) and sparsely divergent, making the pass effectively O(B + E).

Conceptual Model

A "Flow block" is a synthetic basic block that serves as an explicit thread reconvergence point. In an unstructured CFG, divergent branches may merge at a common successor without any indication of which predecessor each thread arrived from. The hardware's reconvergence mechanism needs a single merge point where it can resume lockstep execution. Flow blocks provide this by:

Interposing between the divergent region and its exit.
Carrying a PHI node whose value encodes the path taken by each thread.
Branching conditionally on that PHI to either enter the next region body or skip to the next Flow block.

The algorithm processes the function bottom-to-top (reverse RPO), which ensures that inner regions are structurized before outer ones. Each region is defined by a head (dominator) and tail (post-dominator). The output is a function where every conditional branch leads to at most one "then" block followed by a Flow block, guaranteeing single-entry single-exit regions.

Top-Level Algorithm: sub_35CC920

This is the complete algorithm for the main pass body, including the Flow block insertion logic interleaved with the classification phases already described above.

sub_35CC920(pass, function):
    // ---- Phase 1: Analysis setup ----
    div_info    = getAnalysis<DivergenceAnalysis>(function) + 200
    loop_info   = getAnalysis<LoopInfo>(function) + 200
    dom_tree    = getAnalysis<DominatorTree>(function) + 200
    post_dom    = getAnalysis<PostDominatorTree>(function) + 200
    pass[65]    = div_info
    pass[66]    = loop_info
    pass[67]    = NULL          // region_head
    pass[68]    = NULL          // region_tail
    pass[69]    = dom_tree
    pass[70]    = post_dom

    // Compute RPO via sub_2EA7130 -> sub_2EA7B20
    rpo_list = computeRPO(function)

    // Cross-reference RPO with SCC decomposition (sub_357E170)
    scc_order = buildSCCOrdering(rpo_list)

    // ---- Phase 1b: Reject irreducible ----
    if sub_35CA2C0(scc_order, dom_tree) == 1:   // irreducible detected
        sub_35CA580(pass, "UnsupportedIrreducibleCFG",
                    "Irreducible CFGs are not supported yet.")
        return 0

    // ---- Phase 1c: Initialize bitvector ----
    bb_count = countBasicBlocks(function)
    word_count = (bb_count + 63) >> 6
    bitvector = allocate(word_count * 8)
    memset(bitvector, 0, word_count * 8)
    pass[91] = bitvector   // at offset +728

    // ---- Phase 2: Bottom-up region identification and Flow insertion ----
    modified = false
    order = reverse(scc_order)      // process bottom-to-top

    for each BB in order:
        // 2a. Reject EH funclets
        if *(BB + 235) != 0:       // isEHFunclet flag
            sub_35CA580(pass, "UnsupportedEHFunclets",
                        "EH Funclets are not supported yet.")
            resetBitvector(pass)
            return 0

        // 2b. Already marked for structurization (from prior inner-region pass)
        if *(BB + 216) != 0 or *(BB + 262) != 0:
            sub_35CBCD0(pass, BB, context)
            continue

        // 2c. Detect back-edges to already-visited blocks (loop detection)
        has_loop_backedge = false
        for each successor S of BB:
            if bitvectorTest(pass[91], S->ordinal):
                has_loop_backedge = true

        // 2d. Classify predecessors for divergence
        needs_structurize = false
        for each predecessor P of BB:
            if sub_35CB4A0(pass, P, ...) == 1:  // divergent branch
                needs_structurize = true
                break

        // 2e. Structurize the region rooted at BB
        if needs_structurize:
            sub_35CBCD0(pass, BB, context)      // collect region bounds

            // If region bounds are valid, insert Flow blocks
            head = pass[67]
            tail = pass[68]
            if head != NULL and tail != NULL:
                modified |= insertFlowBlocks(pass, head, tail, function)

        // 2f. Update bitvector
        if needs_structurize:
            bitvectorSet(pass[91], BB->ordinal)
        else:
            bitvectorClear(pass[91], BB->ordinal)

    // ---- Phase 3: Domtree-guided outer-region finalization ----
    if pass[67] != NULL and pass[68] != NULL:
        current = pass[67]    // split_point
        while current != NULL:
            if strategy->shouldSplit(current):            // vtable+312
                sub_35CBCD0(pass, current, context)
                modified |= insertFlowBlocks(pass, pass[67], pass[68], function)

            if strategy->shouldSplitChild(current):       // vtable+320
                // recurse into child regions
                modified |= insertFlowBlocksForChildren(pass, current, function)

            current = domtreeParent(dom_tree, current)

        // Store reconvergence metadata for PTX emission
        *(function + 672) = pass[67]    // reconvergence head
        *(function + 680) = pass[68]    // reconvergence tail

    // ---- Phase 4: Cleanup ----
    free(scc_order)
    free(bitvector)
    return modified ? 1 : 0

Flow Block Insertion Detail: insertFlowBlocks

This function (inlined within the Phase 2/Phase 3 loops of sub_35CC920, approximately decompiled lines 980--2027) performs the actual CFG transformation for a single region.

insertFlowBlocks(pass, head, tail, function):
    // Step 1: Validate region boundaries via dominator/post-dominator trees
    if not dominates(pass[69], head, tail):
        return false    // head does not dominate tail => not a valid region
    if not postDominates(pass[70], tail, head):
        return false    // tail does not post-dominate head => not a valid region

    // Step 2: Classify edges leaving the tail block
    external_edges = []     // edges pointing outside the region
    internal_edges = []     // edges pointing back inside the region

    for each successor S of tail:
        if not dominatedBy(S, head) or S == tail:
            external_edges.append((tail, S))
        else:
            internal_edges.append((tail, S))

    // Step 3: Query strategy object for each edge
    for each edge E in (external_edges + internal_edges):
        classification = strategy->classifyEdge(E)   // vtable+344
        if classification == SKIP:
            continue
        // else: edge needs restructuring

    // Step 4: Create the Flow block
    //   sub_2E7AAE0 = BasicBlock::Create(context, name_hint, function)
    flow_bb = sub_2E7AAE0(function->getContext(), "Flow", function)

    //   sub_2E33BD0 = insert into function's BB list after tail
    sub_2E33BD0(flow_bb, tail->getNextNode())

    // Step 5: Build PHI node in the Flow block
    //   The PHI encodes "which path did threads arrive from?"
    //   Convention: true (i1 1) = came from the "then" body
    //              false (i1 0) = skipped the body (fell through)
    phi = createPHINode(Type::i1, flow_bb)
    phi.addIncoming(ConstantInt::getTrue(),  body_block)    // threads that executed body
    phi.addIncoming(ConstantInt::getFalse(), head_block)    // threads that skipped body

    // Step 6: Create conditional branch in the Flow block
    //   Branch on PHI: true -> next_region_or_exit, false -> next_flow_or_exit
    createCondBranch(flow_bb, phi, next_target_true, next_target_false)

    // Step 7: Reroute original edges through the Flow block
    //   For each predecessor that previously branched to the original merge:
    for each edge (P, original_merge) that should go through flow_bb:
        // sub_2E337A0 = replaceAllUsesWith for the branch target
        P->getTerminator()->replaceSuccessor(original_merge, flow_bb)

    // Step 8: Copy PHI entries from original merge to Flow block
    //   If the original merge had PHI nodes, their incoming values from
    //   rerouted predecessors must be transferred.
    for each phi_node in original_merge->phis():
        value = phi_node->getIncomingValueForBlock(rerouted_pred)
        // sub_2E33140 = addIncoming to new PHI at flow_bb
        // sub_2E341F0 = removeIncomingValue from original PHI
        flow_bb_phi.addIncoming(value, rerouted_pred)
        phi_node.removeIncomingBlock(rerouted_pred)
        phi_node.addIncoming(flow_bb_phi, flow_bb)

    // Step 9: Update dominator tree
    //   The new Flow block is immediately dominated by head.
    //   It immediately dominates the original merge (if flow_bb is its
    //   only predecessor now).
    dom_tree->addNewBlock(flow_bb, head)

    // Step 10: Update divergence analysis
    //   sub_35C9CD0 = edge reroute handler
    for each rerouted_edge:
        sub_35C9CD0(pass, rerouted_edge)
        strategy->updateDivergence(rerouted_edge)   // vtable+368

    // Step 11: Recursive child-split (if needed)
    //   The strategy may determine that the Flow block itself needs
    //   further splitting (deeply nested divergent regions).
    if strategy->shouldSplitChild(flow_bb):         // vtable+320
        child_flow = sub_2E7AAE0(function->getContext(), "Flow", function)
        sub_2E33BD0(child_flow, flow_bb->getNextNode())
        // ... repeat Steps 5-10 for the child Flow block ...
        // This recursion terminates when shouldSplitChild returns false.

    // Step 12: Expand bitvector if function grew
    new_bb_count = countBasicBlocks(function)
    if new_bb_count > pass[bb_count_field]:
        sub_C8D5F0(pass[91], new_bb_count)    // SmallVector::grow
        // Initialize new words to 0xFF...FF (conservatively "visited")
        // Then clear trailing bits beyond actual block count

    return true

PHI Network Construction for Nested Regions

When multiple Flow blocks are created for a chain of if-then-else regions, the PHI networks form a cascade. Each Flow block's PHI determines whether threads should enter the next body or skip to the subsequent Flow block.

Consider a three-way branch (implemented as nested if-then-else):

Before:                          After:
    Entry                            Entry
    / | \                            |
   A  B  C                          cond_A?
    \ | /                           / T   F
    Merge                          A      |
                                   |      |
                                  Flow1   |
                                  / F  T  |
                                 |   cond_B?
                                 |   / T   F
                                 |  B      |
                                 |  |      |
                                 | Flow2   |
                                 | / F  T  |
                                 ||   C    |
                                 ||   |    |
                                 || Flow3  |
                                 | \ | /   |
                                  Merge----+

The PHI cascade at each Flow block:

Flow1:
    %path_A = phi i1 [ true, %A ], [ false, %Entry ]
    br i1 %path_A, <continue to cond_B>, <skip to Merge via Flow3>

Flow2:
    %path_B = phi i1 [ true, %B ], [ false, %Flow1 ]
    br i1 %path_B, <continue to C>, <skip to Merge via Flow3>

Flow3:
    %path_C = phi i1 [ true, %C ], [ false, %Flow2 ]
    br i1 %path_C, <Merge>, <Merge>
    // Flow3's branch is unconditional to Merge (both sides converge)
    // but the PHI values propagated through the chain ensure each
    // thread sees the correct value at Merge's PHI nodes.

Each Flow block carries exactly one i1 PHI and one conditional branch. The chain length equals the number of divergent exits from the region minus one. The final Flow block has an unconditional branch (or a branch where both targets are the same) because all paths must converge at the region exit.

Loop Flow Block Insertion

For divergent loops, Flow blocks serve double duty: they both gate the loop body and control the back-edge. The algorithm handles loops specially:

insertLoopFlowBlock(pass, header, latch, exit, function):
    // The loop has structure: header -> body -> latch -> {header, exit}
    // After structurization:
    //   header -> body -> FlowLoop -> {header (back-edge), exit}

    // Step 1: Create FlowLoop block between latch and exit
    flow_loop = sub_2E7AAE0(context, "Flow", function)
    sub_2E33BD0(flow_loop, latch->getNextNode())

    // Step 2: PHI in FlowLoop encodes continue/break decision
    //   Convention: true = exit the loop, false = take back-edge
    //   This is INVERTED from what you might expect.
    //   Rationale: the "default" path (false) continues the loop,
    //   and the "exception" path (true) exits. This matches
    //   upstream LLVM's structurization invariant and simplifies
    //   the PHI lowering in CSSA.
    phi_loop = createPHINode(Type::i1, flow_loop)
    phi_loop.addIncoming(ConstantInt::getTrue(),  exit_pred)   // threads exiting
    phi_loop.addIncoming(ConstantInt::getFalse(), body_block)  // threads continuing

    // Step 3: Conditional branch
    createCondBranch(flow_loop, phi_loop, exit, header)
    // true -> exit, false -> header (back-edge)

    // Step 4: Reroute latch
    latch->getTerminator()->replaceSuccessor(header, flow_loop)
    latch->getTerminator()->replaceSuccessor(exit, flow_loop)

    // Step 5: Update loop info
    //   FlowLoop is inside the loop (it has the back-edge to header).
    //   LoopInfo must be updated so that FlowLoop is recognized as
    //   a loop block, otherwise subsequent passes (LICM, LSR) may
    //   misclassify it.
    loop_info->addBlockToLoop(flow_loop, loop)

    // Step 6: Domtree update
    //   FlowLoop is dominated by latch (or by header if the latch
    //   was the only block between header and exit).
    dom_tree->addNewBlock(flow_loop, latch)

The inverted convention (true = break) is critical. It ensures that the "natural" loop iteration (the common case) follows the fall-through path, which maps to the hardware's predicted branch direction. The PTX assembler uses this hint to generate the @p bra instruction with the back-edge as the taken path, minimizing branch misprediction overhead on the GPU.

Irreducible CFG Rejection: Why FixIrreducible is Not Scheduled

The pass rejects irreducible CFGs rather than attempting to restructure them. This section documents the design rationale and the consequences.

What Makes a CFG Irreducible

A CFG is irreducible if it contains a cycle with multiple entry points -- that is, there exist two blocks A and B in the cycle such that neither dominates the other, yet both can be reached from outside the cycle. The classic example is a goto into the middle of a loop:

Irreducible:
    Entry
    / \
   v   v
   A -> B
   ^   /
    \ v
     C

Both A and B are reachable from Entry, and both are in the cycle A->B->C->A.
Neither A dominates B nor B dominates A.

In a reducible CFG, every back-edge target dominates its source. This is the invariant that sub_35CA2C0 checks: it iterates blocks in reverse RPO and, for each back-edge (successor that was already visited), verifies that the target dominates the source via the dominator tree hash table.

The FixIrreducible Pass Exists But Is Not Used

CICC v13.0 links FixIrreduciblePass at sub_29D33E0 (registered as "fix-irreducible" at pipeline-parser index 239). Its core implementation at sub_29D3E80 (60KB) performs controlled node splitting: it duplicates blocks to create a single-entry version of each irreducible cycle. This is the standard compiler technique (T1-T2 node splitting from Hecht and Ullman).

However, the NVPTX pipeline in CICC v13.0 does not schedule FixIrreduciblePass before StructurizeCFG. The pipeline ordering is:

... -> SimplifyCFG -> Sink -> StructurizeCFG -> CSSA -> ISel -> ...
                              ^
                              |
                     fix-irreducible is NOT here

Design Rationale

Three factors explain this decision:

CUDA source language guarantee. Well-formed CUDA C++ does not produce irreducible control flow. The language has no goto across loop boundaries (the EDG frontend rejects it), and structured constructs (if/for/while/do/switch) always produce reducible CFGs. The only way to get irreducible flow is through extreme goto abuse in C mode or through a buggy optimization pass that introduces one.
Code size explosion. Node splitting can exponentially increase code size in pathological cases. For a cycle with N entry points, splitting may duplicate up to 2^N blocks. On a GPU where register pressure is the primary performance limiter, this expansion would be catastrophic -- more blocks means more live ranges, more register pressure, and lower occupancy.
Correctness risk. FixIrreduciblePass transforms the CFG before divergence analysis has finalized. If the splitting creates new blocks with divergent branches, those branches would need re-analysis. The interaction between FixIrreducible, DivergenceAnalysis, and StructurizeCFG is not validated in the NVPTX pipeline.

Consequence: Silent Miscompilation Risk

When sub_35CA2C0 detects irreducibility, it emits a diagnostic remark:

remark: UnsupportedIrreducibleCFG
        "Irreducible CFGs are not supported yet."

The pass then returns 0 (no modification). The function proceeds through the rest of the pipeline with its irreducible CFG intact. Downstream, one of two things happens:

ptxas rejects the PTX. If the irreducible pattern produces a branch target that violates PTX's structured control flow rules, ptxas will emit an error. This is the safe outcome.
ptxas silently accepts malformed PTX. If the irreducible pattern happens to look like valid PTX (perhaps it only involves uniform branches), the resulting code may execute with undefined reconvergence behavior. Threads may reconverge at the wrong point, producing silent data corruption. This is the dangerous outcome.

The Stock LLVM Version Has the Same Limitation

The stock LLVM StructurizeCFG at sub_1F0EBC0 (linked from llvm/lib/Transforms/Scalar/StructurizeCFG.cpp) contains identical rejection logic. The AMDGPU backend, which also requires structured control flow, schedules FixIrreduciblePass explicitly before StructurizeCFG. NVIDIA chose not to do this.

Instance	Address	Size	Irreducible handling
NVPTX custom	`sub_35CC920`	95 KB	Reject with diagnostic
Stock LLVM	`sub_1F0EBC0`	~58 KB	Reject with diagnostic
FixIrreducible	`sub_29D33E0` / `sub_29D3E80`	60 KB	Node splitting (not scheduled)

The Stock StructurizeCFG Entry Block Handling

The stock LLVM version also includes explicit entry block handling at sub_1A74020 (13KB). When the function's entry block has predecessors (which can happen if the function is a loop body extracted by a prior pass), this function creates a new entry block named "entry" and renames the original to "entry.orig". The NVPTX version at sub_35CC920 handles this inline in Phase 1.

PTX Structured Control Flow Contract

This section documents the precise contract that StructurizeCFG must satisfy for downstream passes to emit correct PTX.

What "Structured" Means for PTX

After StructurizeCFG completes, the function's CFG must satisfy these five invariants:

Single-entry regions. Every natural loop has exactly one entry (the loop header dominates all loop blocks). No irreducible cycles exist.
Post-dominator reconvergence. For every divergent conditional branch at block B, there exists a block P that post-dominates B and dominates all merge points of the two branch targets. A Flow block is inserted at P if one does not already exist.
Linear Flow chain. Between any divergent branch and its reconvergence point, the CFG forms a chain of Flow blocks with single-entry single-exit semantics. Each Flow block has exactly two predecessors (the "then" body exit and the "skip" path) and two successors (the next body entry or the final merge).
PHI-encodable path selection. Every Flow block contains an i1 PHI that encodes which path was taken. This PHI is the sole branch condition of the Flow block's terminator. No other computation occurs in Flow blocks.
Metadata tagging. Uniform branches are tagged with !structurizecfg.uniform metadata (metadata kind registered at sub_298D780). This prevents CSSA from inserting unnecessary copies at reconvergence points for branches where all threads agree.

Downstream Consumer: CSSA

The CSSA pass (sub_3720740) consumes the structured CFG and inserts explicit copy instructions at every reconvergence point. It relies on:

The Flow block chain to identify where reconvergence happens.
The i1 PHI in each Flow block to determine which threads took which path.
The !structurizecfg.uniform metadata to skip copy insertion for uniform regions.

Without StructurizeCFG, CSSA would not know where to insert copies, and the resulting register allocation would be unsound under warp divergence.

Downstream Consumer: Convergence Control in AsmPrinter

The reconvergence head/tail stored at function offsets +672 and +680 are consumed by the AsmPrinter's convergence control framework (see AsmPrinter). The AsmPrinter emits CONVERGENCECTRL_ENTRY (opcode 24) and CONVERGENCECTRL_LOOP (opcode 33) pseudo-instructions at the boundaries defined by these metadata values. The hardware uses these to program the convergence barrier stack.

Interaction with `SIAnnotateControlFlow` (AMDGPU Comparison)

AMDGPU uses a different approach: SIAnnotateControlFlow inserts explicit if/else/end_cf intrinsics after StructurizeCFG. NVPTX does not use this -- instead, the convergence information flows through:

StructurizeCFG (Flow blocks + function metadata)
CSSA (copy insertion at reconvergence)
SelectionDAG / ISel (structured branch patterns)
AsmPrinter (convergence pseudo-instructions)

This four-stage pipeline is NVIDIA-specific. Upstream LLVM for AMDGPU collapses stages 1-2 into StructurizeCFG + SIAnnotateControlFlow and has no equivalent of stage 4.

The Two Binary Instances

CICC v13.0 contains two complete copies of the StructurizeCFG pass because the binary links both the NVPTX backend (custom) and the generic LLVM Scalar library (stock). Only the NVPTX version is scheduled in the pipeline.

	NVPTX Custom	Stock LLVM
Main body	`sub_35CC920` (95 KB)	`sub_1F0EBC0` (~58 KB)
Entry gate	`sub_35CF930`	(inlined)
Region processing	`sub_35CBCD0`	`sub_1A761E0` (28 KB)
Entry block handler	(inlined in Phase 1)	`sub_1A74020` (13 KB, strings `"entry.orig"`, `"entry"`)
Region-based	Operates on entire function	Operates on individual `Region` objects
Uniform metadata	`sub_298D780` (`"structurizecfg.uniform"`)	Same string, different address
Registration	`sub_29882C0` (`"Structurize the CFG"`)	`sub_2988270` (`"Structurize control flow"`)
Pipeline parser	Index 413: `"structurizecfg"` with `skip-uniform-regions` param	Same index, same params

The NVPTX version is 37 KB larger because it inlines the entry-block handler and region-processing logic (avoiding virtual dispatch overhead) and adds the CUDA-specific attribute checks (IDs 56, 63, 59, 64, 57) and the convergence metadata writes at offsets +672/+680.

Bitvector Tracking for Region Membership

The pass tracks which basic blocks have been visited using a dynamically sized bitvector stored in the pass object:

Field	Offset	Meaning
`uint64_t *array`	`pass + 728`	Pointer to the word array
`uint64_t word_count`	`pass + 736`	Current number of 64-bit words
`uint64_t capacity`	`pass + 740`	Allocated capacity in words
`uint64_t bb_count`	`pass + 792`	Total number of basic blocks

Index computation for a block with ordinal idx:

word_offset = idx >> 6;          // idx / 64
bit_mask    = 1ULL << (idx & 63); // idx % 64

// Test
is_visited = (array[word_offset] & bit_mask) != 0;

// Set
array[word_offset] |= bit_mask;

// Clear
array[word_offset] &= ~bit_mask;

When new basic blocks are created during structurization (the function grows), the bitvector is expanded via sub_C8D5F0 (the SmallVector::grow equivalent). New words are initialized to 0xFFFFFFFFFFFFFFFF (all bits set = "visited"), then trailing bits beyond the actual block count are cleared. This ensures newly created blocks are conservatively marked as visited until explicitly processed.

Hash Table Implementation

The pass uses two DenseSet-style hash tables with LLVM-layer sentinels (-4096 / -8192); see Hash Table and Collection Infrastructure for the hash function, probing, and growth policy. The resize function for this pass is sub_2E61F50. Table v394 tracks BBs already processed during the BFS expansion, and v417 serves as a scratch set for child-split deduplication.

Comparison with Upstream LLVM StructurizeCFG

The NVIDIA version and upstream LLVM share the same fundamental algorithm. Both are derived from the same codebase (confirmed by identical diagnostic strings and strategy-object vtable layouts). The differences are:

Architectural differences

Aspect	NVIDIA (`sub_35CC920`)	Upstream LLVM
Granularity	Operates on entire function, iterating blocks in SCC/RPO order	Operates on individual `Region` objects, one region per invocation
Region discovery	Inline SCC decomposition + domtree walk	Relies on `RegionInfo` analysis pass
Object layout	Pass fields at `a1[65..91]`; BB flags at `+216`, `+235`, `+262`	Different offsets reflecting different `BasicBlock` subclass
SCC ordering	`sub_357E170` computes RPO/SCC cross-product	Uses `scc_iterator` from `llvm/ADT/SCCIterator.h`
Strategy object	Queried via vtable+312/320/344/368	Uses `TargetTransformInfo` for cost decisions

Functional differences

Irreducibility handling. Both reject irreducible CFGs with the same diagnostic. Neither performs restructuring. Upstream LLVM relies on FixIrreduciblePass being scheduled separately (AMDGPU does this). NVIDIA does not schedule it.
EH funclet handling. Both reject funclets. The NVIDIA version checks BB+235 (a wider BasicBlock struct with CUDA-specific fields). Upstream checks via isa<FuncletPadInst>.
Uniform region skipping. Both support structurizecfg-skip-uniform-regions. The NVIDIA version integrates DivergenceAnalysis queries inline (sub_2E88A90, sub_2E8B090). Upstream uses UniformityInfo::isUniform(BranchInst*).
Metadata tagging. Both use the "structurizecfg.uniform" metadata kind to mark branches that have been classified as uniform, preventing re-analysis in nested region processing.
Zero-cost hoisting. Upstream LLVM (recent versions) includes hoistZeroCostElseBlockPhiValues to reduce VGPR pressure from structurization-induced phi nodes. The NVIDIA version may or may not include this optimization; the decompiled code at the corresponding offset shows similar phi-manipulation logic but uses different register-pressure heuristics.
Reconvergence metadata. The NVIDIA version writes reconvergence head/tail to function metadata at offsets +672 and +680. This is consumed by downstream PTX emission passes (AsmPrinter, convergence barrier insertion). Upstream LLVM has no equivalent because AMDGPU uses SIAnnotateControlFlow instead.

What NVIDIA did NOT change

The core structurization algorithm is identical: topological ordering of region nodes, iterative flow-block insertion, PHI-node reconstruction via SSAUpdater, and domtree maintenance. The strategy-object interface (shouldSplit, shouldSplitChild, classifyEdge, updateDivergence) has the same vtable layout in both versions. The FlowBlock naming convention ("Flow") is preserved.

Pipeline Position

StructurizeCFG runs late in the NVPTX backend pipeline, after most IR-level optimizations and before machine code generation:

... -> SimplifyCFG -> Sink -> StructurizeCFG -> CSSA -> ISel -> ...

It must run after divergence analysis (so it can query which branches are uniform) and before instruction selection (which assumes structured control flow). The CSSA (Convergent SSA) pass that follows converts phi nodes to respect warp divergence semantics at the reconvergence points that StructurizeCFG inserted.

Summary of Pass Decisions

Input condition	Action	Diagnostic
Single-block function	Skip	None
Function with convergent/optnone attributes	Skip	None
`enable-shrink-wrap` = 2	Skip	None
Strategy object declines	Skip	None
All-uniform branches (with skip-uniform knob)	Skip	None
Irreducible CFG detected	Reject	`"UnsupportedIrreducibleCFG"`
EH funclet block detected	Reject	`"UnsupportedEHFunclets"`
Reducible, divergent regions	Restructure	None (new Flow blocks inserted, edges rerouted)

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent CFG structurization pass for a GPU target.

1. Attempting to restructure irreducible CFGs instead of rejecting them. The LLVM codebase includes FixIrreduciblePass (sub_29D33E0) which performs T1-T2 node splitting, but NVIDIA deliberately does not schedule it before StructurizeCFG. A reimplementation that adds node splitting to "handle" irreducible CFGs risks exponential code size blowup (2^N blocks for N entry points), catastrophic register pressure increases from the duplicated live ranges, and untested interaction with divergence analysis. The correct approach for an NVPTX target is to reject irreducible CFGs with a diagnostic and rely on the CUDA language guarantee that well-formed source never produces them.

2. Forgetting to update LoopInfo when inserting Flow blocks inside loops. When insertLoopFlowBlock creates a new block between the latch and the exit, that block carries the back-edge to the header and is therefore inside the loop. If LoopInfo is not updated (loop_info->addBlockToLoop), subsequent passes (LICM, LSR, LoopUnroll) will not recognize the Flow block as a loop member and may hoist or sink code across it incorrectly. This is a silent miscompilation: the kernel produces wrong results only for inputs that exercise the divergent loop path.

3. Inverting the Flow block PHI convention. The pass uses true = exit loop (break) and false = continue loop (back-edge) for loop Flow blocks. This is counterintuitive -- most programmers expect true to mean "condition is met, continue." Reversing this convention causes the back-edge to be the taken path for true, which not only produces wrong control flow but also defeats the branch prediction hint that maps the fall-through (false) path to the common-case loop continuation. A reimplementation must match the exact convention documented in the upstream LLVM structurization invariant.

4. Not writing reconvergence metadata to function offsets +672/+680. The AsmPrinter's convergence control framework reads the head and tail stored at these offsets to emit CONVERGENCECTRL_ENTRY and CONVERGENCECTRL_LOOP pseudo-instructions. A reimplementation that structures the CFG correctly but does not write these metadata values will cause the AsmPrinter to emit PTX without convergence barriers. On architectures with hardware convergence tracking (SM 7.0+), this can lead to threads reconverging at incorrect points, producing silent data corruption.

5. Skipping structurization for regions where all branches appear uniform but sub-regions contain divergent branches. The structurizecfg-relaxed-uniform-regions knob allows skipping outer regions when they have at most one conditional direct child. A reimplementation that skips any region marked "uniform" without checking sub-region divergence will fail to insert Flow blocks for inner divergent branches, leaving the PTX with unstructured control flow that ptxas may reject or (worse) silently miscompile.

Cross-References

CSSA -- the Conventional SSA pass that consumes Flow blocks to insert warp-safe copies
AsmPrinter -- convergence control pseudo-instruction emission consuming the +672/+680 metadata
GPU Execution Model -- warp divergence and reconvergence fundamentals
Branch Folding -- may eliminate redundant Flow blocks after code generation
Hash Infrastructure -- details on the DenseSet implementation used by the BB tracking tables
Pipeline -- exact position of structurizecfg in the pass ordering
Knobs -- structurizecfg-skip-uniform-regions, structurizecfg-relaxed-uniform-regions, enable-shrink-wrap
Upstream LLVM source: llvm/lib/Transforms/Scalar/StructurizeCFG.cpp

Differences from Upstream LLVM

Aspect	Upstream LLVM (AMDGPU)	CICC v13.0 (NVPTX)
Binary copies	Single StructurizeCFG in LLVM Scalar library	Two copies: NVPTX-specific at `sub_35CC920` (95 KB) and stock LLVM/AMDGPU at `sub_1F0EBC0`; only NVPTX instance scheduled
Divergence query	Queries AMDGPU divergence analysis	Queries NVPTX warp divergence analysis; uniform branch skip via `structurizecfg-skip-uniform-regions` knob
Flow block metadata	Flow blocks inserted without convergence metadata	Inserts convergence control metadata at offsets `+672`/`+680` on Flow blocks, consumed by AsmPrinter for warp reconvergence pseudo-instructions
Relaxed uniform regions	Not present	`structurizecfg-relaxed-uniform-regions` knob allows less aggressive structurization when all branches in a region are provably uniform
Irreducible CFG handling	Attempts T1/T2 node-folding reduction	Same approach, but rejection diagnostic `"UnsupportedIrreducibleCFG"` is NVPTX-specific; GPU code with irreducible CFG is a hard error
Skip conditions	Skip for single-block functions	Extended skip: single-block, `convergent`/`optnone` attributes, `enable-shrink-wrap` = 2, and strategy object decline
Mandatory status	Required for AMDGPU but can be skipped via flag	Mandatory for PTX emission: registered as required late pass by both `sub_29882C0` and `sub_1A6D600`

Machine-Level Passes

Machine-level passes in CICC v13.0 operate on MachineFunction / MachineBasicBlock / MachineInstr representations after SelectionDAG instruction selection has converted LLVM IR into target-specific pseudo-instructions. On a conventional CPU target, these passes ultimately produce native machine code; on NVPTX, they produce PTX assembly -- a virtual ISA with unlimited virtual registers and a structured instruction set. This distinction is fundamental: NVPTX's "machine code" still uses virtual registers (%r0, %f1, %p3), and the final PTX text is consumed by ptxas which performs the actual register allocation against the hardware register file. The machine-level passes in CICC therefore serve a different purpose than on CPU: they optimize register pressure (to maximize occupancy), structure control flow (PTX requires structured CFG), compute .local memory frame layouts, and prepare clean PTX for ptxas to finish.


Pass pipeline parser (MF)	`sub_235E150` (53KB)
Master pass registry	`sub_2342890` (102KB)
Codegen pass config	`ctor_335_0` at `0x507310` (88 strings)
NVPTX target pass config	`ctor_358_0` at `0x50E8D0` (43 strings)
Total registered MF passes	51 (stock LLVM) + 13 (NVIDIA custom)
Total MF analyses	14 registered
Pipeline configuration	`sub_2166D20` (addISelPasses), `sub_2166ED0` (addPreRegAlloc), `sub_21668D0` (addPostRegAlloc)

Why Machine Passes Matter on GPU

In upstream LLVM for x86 or AArch64, the machine pass pipeline assigns physical registers, inserts spill code, schedules instructions for pipeline hazards, and emits relocatable object code. On NVPTX, none of this maps directly:

No physical register file. PTX registers are virtual. The greedy register allocator in CICC does not assign physical registers -- it tracks register pressure per class and enforces the -maxreg limit (default 70) that controls SM occupancy. When the allocator "spills," it moves values to .local memory rather than to stack slots addressed by %rsp.
No prolog/epilog in the traditional sense. There is no call stack with push/pop sequences. PrologEpilogInserter in CICC computes .local frame offsets for spilled virtual registers and inserts ld.local/st.local pairs.
Structured control flow is mandatory. PTX requires structured control flow (bra, @p bra, bra.uni). The StructurizeCFG pass runs before instruction selection, and BranchFolding must preserve the structured property.
Instruction scheduling targets ptxas, not hardware. Machine scheduling optimizes the instruction stream that ptxas will consume. Since ptxas performs its own scheduling against the actual hardware pipeline, CICC's scheduling focuses on register pressure reduction (nvptx-sched4reg) and exposing parallelism that ptxas can exploit.
Two peephole levels. CICC runs both the stock LLVM PeepholeOptimizer (operates on generic MachineInstr patterns) and the NVIDIA-specific NVPTXPeephole (sub_21DB090) which handles PTX-specific patterns like redundant cvta instructions, predicate folding, and address space conversions.

Pipeline Flow

SelectionDAG ISel
    │
    ▼
FinalizeISel ─── expand pseudo-instructions from ISel
    │
    ▼
┌─────────────────────────────────────┐
│  Pre-RA Optimization                │
│  ┌─ EarlyTailDuplicate             │
│  ├─ EarlyMachineLICM               │
│  ├─ MachineCSE (RP-aware)          │
│  ├─ MachineSink (gated by knob)    │
│  ├─ PeepholeOptimizer              │
│  ├─ NVPTXPeephole             ★    │
│  ├─ DeadMachineInstrElim           │
│  └─ MachineCopyPropagation         │
└─────────────────────────────────────┘
    │
    ▼
TwoAddressInstruction ─── convert 3-addr to 2-addr form
    │
    ▼
PHIElimination (CSSA/deSSA) ─── lower MachineInstr PHIs to copies
    │
    ▼
┌─────────────────────────────────────┐
│  Register Allocation                │
│  ┌─ LiveIntervals + SlotIndexes    │
│  ├─ RegisterCoalescing             │
│  ├─ RAGreedy (pressure-driven)     │
│  ├─ NVPTXBlockRemat           ★    │
│  └─ StackSlotColoring              │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Post-RA Optimization               │
│  ┌─ ExpandPostRAPseudos            │
│  ├─ MachineLICM (post-RA)          │
│  ├─ MachineSink (post-RA, gated)   │
│  ├─ MachineCopyPropagation         │
│  ├─ BranchFolding / TailMerge      │
│  ├─ MachineBlockPlacement          │
│  └─ MachinePipeliner (SMS)         │
└─────────────────────────────────────┘
    │
    ▼
PrologEpilogInserter ─── .local frame layout
    │
    ▼
MachineOutliner ─── OUTLINED_FUNCTION_ stub creation
    │
    ▼
NVPTXProxyRegErasure ★ ─── remove redundant cvta.to.local
    │
    ▼
AsmPrinter ─── PTX text emission

Passes marked with ★ are NVIDIA-custom. The exact ordering varies by optimization level; at -O0, most pre-RA and post-RA optimization passes are skipped and RegAllocFast replaces RAGreedy.

Pipeline Configuration Functions

The NVPTX backend configures the machine pass pipeline through three key functions:

sub_2166D20 -- addISelPasses(): Configures passes before instruction selection. Diagnostic string: "\n\n*** Final LLVM Code input to ISel ***\n". Adds: alloca hoisting, ISel DAG printer (conditional), NVPTXProxyRegErasure, NVPTXLowerArgs, NVPTX-specific ISel.

sub_2166ED0 -- addPreRegAlloc(): Configures machine passes before register allocation. Diagnostic strings: "After Pre-RegAlloc TailDuplicate", "After codegen DCE pass", "After Machine LICM, CSE and Sinking passes", "After codegen peephole optimization pass". Adds: TailDuplicate, codegen DCE, Machine LICM + CSE + Sinking (conditional on byte_4FD1980, byte_4FD18A0, byte_4FD1A60), codegen peephole.

sub_21668D0 -- addPostRegAlloc(): Configures post-register-allocation passes. Diagnostic strings: "After Machine Scheduling", "After StackSlotColoring". Adds: Machine scheduling (2 modes controlled by dword_4FD26A0 -- value 1 selects simple scheduling, otherwise full pipeline), Stack slot coloring, nvptx-mem2reg (conditional on byte_4FD25C0).

Machine Pass Inventory

NVIDIA-Custom Machine Passes

Pass ID	Class / Address	Pipeline Position	Description
`nvptx-peephole`	`sub_21DB090`	Pre-RA	PTX-specific peephole: folds redundant address space conversions (`cvta`), optimizes predicate patterns, simplifies PTX-specific instruction sequences. Controlled by `enable-nvvm-peephole` (default: on).
`nvptx-remat-block`	`sub_217DBF0`	During RA	Machine-level block rematerialization. Iterative "pull-in" algorithm that recomputes values near their use rather than loading from spill slots. Two-phase candidate selection with a "second-chance" heuristic. See Rematerialization.
`machine-rpa`	`sub_21EAA00`	Analysis (pre-RA)	Machine Register Pressure Analysis. Provides per-basic-block pressure data consumed by `MachineCSE`, scheduling, and rematerialization.
`extra-machineinstr-printer`	`sub_21E9E80`	Diagnostic	Prints per-function register pressure statistics. Debug-only pass for tuning pressure heuristics.
`nvptx-mem2reg`	`sub_21F9920`	Pre-RA	Machine-level mem2reg: promotes `.local` memory loads/stores back to virtual registers when profitable. Conditional on `byte_4FD25C0` (`nv-disable-mem2reg` inverts).
`ldgxform`	`sub_21F2780`	Pre-RA	Transforms qualifying global memory loads into `ld.global.nc` (LDG -- load through read-only data cache). Splits wide vector loads for hardware constraints.
`nvptx-prolog-epilog`	`sub_21DB5F0`	Post-RA	NVPTX-specific PrologEpilog pass. Works alongside or replaces the stock PEI to handle PTX frame semantics where there is no traditional stack pointer.
`nvptx-proxy-reg-erasure`	`sub_21DA810`	Late post-RA	Removes redundant `cvta.to.local` instructions left by address space lowering.
`nvptx-assign-valid-global-names`	`sub_21BCD80`	Pre-emission	Sanitizes symbol names to comply with PTX naming rules (no `@`, `$`, or other characters illegal in PTX identifiers).
`nvptx-replace-image-handles`	`sub_21DBEA0`	Pre-emission	Replaces IR-level texture/surface handle references with PTX-level `.tex` / `.surf` declarations.
`nvptx-image-optimizer`	`sub_21BCF10`	Pre-emission	Texture/surface instruction optimization: coalesces related texture operations, validates image type consistency for `tex`, `suld`, `sust`, `suq`.
`alloca-hoisting`	`sub_21BC7D0`	Early post-ISel	Hoists alloca instructions to the entry basic block, enabling the frame layout pass to assign fixed offsets.
`generic-to-nvvm`	`sub_215DC20`	Early post-ISel	Converts generic address space (0) references to global address space (1). Runs before instruction selection on some pipelines, but also present as a machine-level fixup.
`param-opt`	`sub_2203290`	Post-ISel	Optimizes `ld.param` instructions. NVIDIA-custom pass for parameter load coalescing and redundant parameter load elimination.
`nvptx-trunc-opts`	`sub_22058E0`	Post-ISel	Optimizes redundant `ANDb16ri` instructions [sic: binary string reads "instrunctions"] generated during i16 truncation patterns.
`redundant-move-elim`	`sub_2204E60`	Post-ISel	Removes redundant register-to-register moves left by instruction selection.

Stock LLVM Machine Passes (NVPTX Configuration)

Pass ID	Class	NVIDIA Modification	Notes
`finalize-isel`	`FinalizeISelPass`	None	Expands ISel pseudo-instructions; mandatory first MF pass.
`early-tailduplication`	`EarlyTailDuplicatePass`	None	Pre-RA tail duplication. Can be disabled via `disable-early-taildup`.
`early-machinelicm`	`EarlyMachineLICMPass`	Gated	Controlled by `enable-mlicm`. Hoists loop-invariant machine instructions before register allocation.
`machine-cse`	`MachineCSEPass`	Modified	NVIDIA adds register-pressure-aware CSE (`rp-aware-mcse`, `pred-aware-mcse`, `copy-prop-mcse`). Uses MRPA (`sub_2E5A4E0`) for incremental pressure tracking. See Instruction Scheduling.
`machine-sink`	`MachineSinkingPass`	Gated	Disabled by default on NVPTX; enabled via `nvptx-enable-machine-sink`. When active, sinks instructions closer to uses to reduce register pressure.
`peephole-opt`	`PeepholeOptimizerPass`	None	Stock LLVM peephole: folds redundant copies, simplifies compare-and-branch patterns, optimizes sub-register operations. Can be disabled via `disable-peephole`.
`dead-mi-elimination`	`DeadMachineInstrElimPass`	None	Eliminates dead machine instructions. Can be disabled via `disable-machine-dce`.
`machine-cp`	`MachineCopyPropagationPass`	None	Propagates copies to reduce move instructions. Can be disabled via `disable-copyprop`.
`machinelicm`	`MachineLICMPass`	Gated	Post-RA variant. Controlled by `disable-postra-machine-licm`. NVIDIA adds `sink-insts-to-avoid-spills` to trade hoisting for spill reduction.
`two-address-instruction`	`TwoAddressInstructionPass`	None (stock)	Converts three-address instructions to two-address form by inserting copies. `sub_1F53550` (79KB, 2470 lines). Shared between cicc and libNVVM (twin at `sub_F4EA80`).
`phi-node-elimination`	`PHIEliminationPass`	Modified	NVIDIA's CSSA/deSSA method selection via `usedessa` (default 2). Controls how machine-level PHI nodes are lowered to copies; affects register allocation quality. See `cssa-coalesce`, `cssa-verbosity`.
`register-coalescer`	`RegisterCoalescerPass`	Custom NVPTX variant	The NVPTX backend has its own register coalescing framework at `0x349`--`0x34B` (separate from LLVM's stock coalescer at `0xB40000`). Uses interference oracle `sub_349D6E0`, open-addressing hash with `(reg >> 9) ^ (reg >> 4)`. See Register Coalescing.
`greedy`	`RAGreedyPass`	Modified	Pressure-driven rather than assignment-driven. Dual instances (legacy + new PM). Core at `sub_2F49070` (82KB). See Register Allocation.
`stack-coloring`	`StackColoringPass`	None	Colors stack slots to reduce `.local` memory usage by sharing slots with non-overlapping lifetimes.
`stack-slot-coloring`	`StackSlotColoringPass`	None	Secondary stack slot optimization. Can be disabled via `disable-ssc`.
`post-ra-pseudos`	`ExpandPostRAPseudosPass`	None	Expands post-RA pseudo-instructions (e.g., `COPY` to actual move).
`post-RA-sched`	`PostRASchedulerPass`	Gated	Post-RA instruction scheduling. Controlled by `disable-post-ra`.
`machine-scheduler`	`MachineSchedulerPass`	Modified	NVIDIA adds `nvptx-sched4reg` mode for register-pressure-driven scheduling. Pre-RA scheduling variant.
`postmisched`	`PostMachineSchedulerPass`	None	Post-RA machine scheduling with `ScheduleDAGMILive` (`sub_355F610`, 64KB). Controlled by `misched-postra`.
`early-ifcvt`	`EarlyIfConverterPass`	None	If-conversion before register allocation. Can be disabled via `disable-early-ifcvt`.
`machine-combiner`	`MachineCombinerPass`	None	Combines machine instructions using target-defined patterns. Knob: `machine-combiner-inc-threshold`.
`block-placement`	`MachineBlockPlacement`	None (stock)	Profile-guided basic block ordering. `sub_3521FF0` (82KB). Uses ext-TSP and chain-based algorithms. See Block Placement.
`machine-outliner`	`MachineOutliner`	None	Creates `OUTLINED_FUNCTION_` stubs for repeated instruction sequences. `sub_3537010` (77KB). See MachineOutliner.
`prologepilog`	`PrologEpilogInserter`	Modified	NVIDIA's PEI (`sub_35B1110`, 68KB) computes `.local` memory frame offsets. Frame objects are 40-byte records with offset, size, alignment, and spill-slot flags. See PrologEpilogInserter.
`opt-phis`	`OptimizePHIsPass`	None	Optimizes machine-level PHI nodes (removes trivially dead or redundant PHIs).
`tailduplication`	`TailDuplicatePass`	None	Post-RA tail duplication. Controlled by `disable-tail-duplicate`.
`detect-dead-lanes`	`DetectDeadLanesPass`	None	Detects unused sub-register lanes; minimal impact on NVPTX since register classes are fully disjoint.
`rename-independent-subregs`	`RenameIndependentSubregsPass`	None	Splits sub-register live ranges into independent virtual registers.
`localstackalloc`	`LocalStackSlotAllocationPass`	None	Allocates local frame indices for large stack objects.
`machine-latecleanup`	`MachineLateInstrsCleanupPass`	None	Late-stage dead instruction cleanup.
`machine-pipeliner`	`MachinePipeliner`	None (stock)	Swing Modulo Scheduling for loop bodies. `sub_3563190` (58KB). See below.

Per-Pass Algorithm Descriptions

NVPTXPeephole (`sub_21DB090`) -- PTX-Specific Peephole Optimizer

Registration: sub_21DB090 at 0x21DB090, pass ID "nvptx-peephole". Enabled by default; controlled by enable-nvvm-peephole.

This pass runs pre-RA and performs pattern-matching rewrites on MachineInstr sequences that are specific to the NVPTX target. Unlike the stock LLVM PeepholeOptimizer (which operates on generic copy/compare patterns), NVPTXPeephole handles PTX address space semantics and predicate register idioms.

Patterns handled:

Redundant cvta elimination. When address space lowering inserts cvta.to.global or cvta.to.shared followed by an operation that already operates in the correct address space, the cvta is dead. The pass scans for cvta instructions whose result is used only by instructions with matching address space qualifiers, and deletes the cvta.
Predicate folding. PTX predicates (%p0, %p1, ...) are first-class. The pass identifies patterns where a setp instruction produces a predicate that is consumed by exactly one @p bra and folds them into a conditional branch with embedded comparison.
Address space conversion simplification. When generic-to-nvvm inserts addrspacecast and the consuming instruction directly emits the correct address qualifier (.global, .shared, .local, .const), the intermediate cast is redundant.

// Pseudocode: NVPTXPeephole main loop
fn nvptx_peephole(MF: &mut MachineFunction) -> bool {
    let mut changed = false;
    for mbb in MF.basic_blocks() {
        let mut dead_list = vec![];
        for mi in mbb.instrs() {
            match mi.opcode() {
                NVPTX::CVTAToGeneric | NVPTX::CVTAToGlobal
                | NVPTX::CVTAToShared | NVPTX::CVTAToLocal => {
                    if single_user_in_matching_addrspace(mi) {
                        propagate_operand_and_kill(mi);
                        dead_list.push(mi);
                        changed = true;
                    }
                }
                NVPTX::SETP_* => {
                    if let Some(bra) = single_predicate_consumer(mi) {
                        fold_setp_into_branch(mi, bra);
                        dead_list.push(mi);
                        changed = true;
                    }
                }
                _ => {}
            }
        }
        for mi in dead_list { mi.erase_from_parent(); }
    }
    changed
}

NVPTXBlockRemat (`sub_217DBF0`) -- Machine-Level Block Rematerialization

Registration: sub_217DBF0 at 0x217DBF0, pass name "NVPTX Specific Block Remat", pass ID "nvptx-remat-block". Knob constructor at ctor_361_0 (0x5108E0). Main engine: sub_2186D90 (47KB, ~1742 decompiled lines).

This is NVIDIA's custom register-pressure-reduction pass. It re-computes values at their use sites instead of keeping them live across long spans. The algorithm is iterative with a two-phase candidate selection including a "second-chance" heuristic for marginal candidates.

Knobs (16 total):

Global Variable	CLI Flag	Default	Description
`dword_4FD3820`	`nv-remat-block`	14	Bitmask controlling remat modes (bits 0-3)
`dword_4FD3740`	`nv-remat-max-times`	10	Max iterations of the outer remat loop
`dword_4FD3660`	`nv-remat-block-single-cost`	10	Max cost per single live value pull-in
`dword_4FD3580`	`nv-remat-block-map-size-limit`	6	Map size limit for single pull-in
`dword_4FD3040`	`nv-remat-block-max-cost`	100	Max total clone cost per live value reduction
`dword_4FD3120`	`nv-remat-block-liveout-min-percentage`	70	Min liveout % for special consideration
`unk_4FD3400`	`nv-remat-block-loop-cost-factor`	20	Loop cost multiplier
`unk_4FD3320`	`nv-remat-default-max-reg`	70	Default max register pressure target
`unk_4FD2EC0`	`nv-remat-block-load-cost`	10	Cost assigned to load instructions
`unk_4FD3860`	`nv-remat-threshold-for-spec-reg`	20	Threshold for special register remat
`byte_4FD2E80`	`nv-dump-remat-block`	off	Debug dump toggle
`byte_4FD2DA0`	`nv-remat-check-internal-live`	off	Check internal liveness during MaxLive
`qword_4FD2C20`	`max-reg-kind`	0	Kind of max register pressure info
`qword_4FD2BE0`	`no-mi-remat`	(list)	Skip remat for named functions
`word_4FD32F0`	`load-remat`	on	Enable load rematerialization
`word_4FD3210`	`vasp-fix1`	off	VASP fix (volatile/addsp)

Algorithm pseudocode (sub_2186D90):

fn nvptx_block_remat(MF: &mut MachineFunction) -> bool {
    // (A) INITIALIZATION
    let target = max_reg_override.unwrap_or(nv_remat_default_max_reg);  // default 70
    if MF.block_count() == 1 { return false; }
    if function_name in no_mi_remat_list {
        log("Skip machine-instruction rematerialization on {name}");
        return false;
    }

    // (B) LIVEOUT FREQUENCY COUNTING
    for bb in MF.blocks() {
        for reg in bb.live_out() {
            freq_map[reg] += 1;
        }
    }
    // Normalize: freq_pct = (100 * count) / num_blocks

    // (C) OUTER ITERATIVE LOOP
    let mut iteration = 0;
    let mut overall_changed = false;
    loop {
        iteration += 1;
        if iteration > nv_remat_max_times { break; }  // default 10

        // Phase 1: COMPUTE MAX-LIVE
        let max_live = sub_2186590(MF);  // scan all blocks
        log("Max-Live-Function({num_blocks}) = {max_live}");
        if target >= max_live { break; }  // no pressure problem

        let mut changed = false;
        // Phase 2: FOR EACH OVER-PRESSURE BLOCK
        for bb in blocks_where(pressure > target) {
            let excess = bb.pressure - target;

            // Phase 3: CLASSIFY LIVE-OUT REGISTERS
            let (pullable, non_pullable) = classify_liveout(bb);
            // sub_217E810 (MULTIDEF check) -- must have single unique def
            // sub_2181550 (recursive pullability, depth <= 50)
            log("Pullable: {pullable.len()}");

            // Phase 4: SECOND-CHANCE HEURISTIC (sub_2181870)
            if excess > pullable.len() && second_chance_list.not_empty() {
                second_chance_promote(&mut pullable, &mut non_pullable);
                // Re-evaluates rejected candidates with relaxed criteria
                // Uses visit-count mechanism to prevent infinite loops
                // Hash: h(regID) = 37 * regID, open-addressing
                log("ADD {n} candidates from second-chance");
            }

            log("Total Pullable before considering cost: {pullable.len()}");

            // Phase 5: COST ANALYSIS (sub_2183E30)
            let candidates = pullable.filter_map(|reg| {
                let cost = compute_remat_cost(reg);  // 0 = cannot remat
                (cost > 0).then(|| (reg, cost))
            });

            // Phase 6: SELECT BY COST-BENEFIT (cheapest first)
            candidates.sort_by_key(|(_, cost)| *cost);  // selection sort
            let mut final_list = vec![];
            for (reg, cost) in candidates {
                if cost > nv_remat_block_single_cost { break; } // default 10
                let width = if reg_class_size(reg) > 32 { 2 } else { 1 };
                final_list.push(reg);
                if final_list.len() >= excess { break; }
            }

            log("Really Final Pull-in: {final_list.len()} ({total_cost})");

            // Phase 7: EXECUTE REMATERIALIZATION
            for reg in &final_list {
                clear_from_liveout(bb, reg);            // sub_217F620
            }
            bb.pressure -= final_list.len();
            propagate_backward(bb, &final_list);         // sub_2185250
            // Clone defining instructions at use sites
            // sub_21810D0 replaces register references
            changed = true;
        }

        overall_changed |= changed;
        if !changed { break; }
    }

    // (D) DEAD INSTRUCTION REMOVAL -- cascading deletion
    remove_dead_instructions();  // sub_217DA10
    overall_changed
}

MULTIDEF detection (sub_217E810): Returns the defining instruction if the register has exactly one non-dead, non-debug definition. Rejects instructions with hazardous descriptor flags (desc->flags & 0x3F80), opcodes in the non-rematerializable set (memory ops 534-609, texture ops 680-681, atomics 817-832, barriers 2913-2918, surface ops 3281-3287, 3449-3454, large MMA blocks 4423-4447), and instructions with tied extra defs.

Recursive pullability (sub_2181550): Walks the operand chain up to depth 50, checking each operand register against the non-pullable set and the MULTIDEF oracle. All operands in the chain must be single-def, safe-opcode, and themselves pullable.

Cost model: sub_2183E30 computes the clone cost of rematerializing a register. Load instructions cost nv-remat-block-load-cost (default 10). Instructions in loops are penalized by nv-remat-block-loop-cost-factor (default 20x). Double-wide registers (class size > 32) count as 2 for pressure and have 2x cost.

Machine Register Pressure Analysis (`sub_21EAA00`) -- MRPA

Registration: sub_21EAA00 at 0x21EAA00, pass name "Register pressure analysis on Machine IRs", pass ID "machine-rpa". Main analysis body: sub_21EEB40 (68KB). Incremental updater: sub_2E5A4E0 (48KB). Backend variant: sub_1E00370 (78KB).

MRPA is NVIDIA's custom analysis pass that provides per-basic-block register pressure data. Unlike LLVM's stock RegisterPressure tracking (which is tightly coupled to the scheduler), MRPA is consumed by multiple clients: RP-aware MachineCSE, instruction scheduling, and the block rematerialization pass.

Architecture:

The MRPA system has two modes:

Full recomputation (sub_21EEB40): Walks every instruction in every basic block, tracking register births (defs) and deaths (last uses), recording the peak pressure per register class per block.
Incremental update (sub_2E5A4E0): When a single instruction is moved or deleted (e.g., by MachineCSE), MRPA updates the affected blocks' pressure without rescanning the entire function.

Incremental update algorithm (sub_2E5A4E0):

fn mrpa_incremental_update(context, bb, instruction_delta) {
    // DenseMap hash: (ptr >> 9) ^ (ptr >> 4)
    // Empty sentinel: -8, Tombstone: -16
    // Minimum 64 buckets, always power-of-2

    // 1. Build worklist of affected BBs via DFS
    let worklist = dfs_from(bb, context.visited_set);

    // 2. For each BB: create/update tracking entry
    for bb in worklist {
        let entry = context.pressure_map.get_or_insert(bb);

        // 3. Filter schedulable instructions via sub_2E501D0
        for mi in bb.instrs().filter(schedulable) {
            // 4. For each virtual register operand (40-byte entries):
            for operand in mi.operands() {
                sub_2EBEF70(operand);  // find existing rename mapping
                sub_2EBEE10(operand);  // query register info
                sub_2EBE820(operand);  // attempt rename if profitable
                sub_2EBF120(operand);  // free old register after rename
            }
            // 5. Check register class constraints via sub_E922F0
            // 6. Validate pressure feasibility via sub_2E4F9C0
        }
        // 7. Erase unprofitable instructions via sub_2E88E20
    }
}

Verification: When verify-update-mcse is enabled (qword_501F8A8, default OFF), MRPA runs a full recomputation after every incremental update and compares results. Mismatch triggers: "Incorrect RP info from incremental MRPA update" via sub_C64ED0. The print-verify knob (qword_501F7C8) controls whether detailed per-register-class diagnostic output is printed on mismatch.

Diagnostic output (sub_21E9A60): The companion pass extra-machineinstr-printer at sub_21E9E80 prints: "Max Live RRegs: {n}\tPRegs: {m}\nFunction Size: {s}" for each function, providing per-function register pressure statistics for tuning.

LDG Transform (`sub_21F2780`) -- Read-Only Data Cache Load Transformation

Registration: sub_21F2780 at 0x21F2780, pass name "Ldg Transformation", pass ID "ldgxform". Transformation body: sub_21F2C80 (19KB). Vector splitting engine: sub_21F3A20 (44KB).

This pass transforms qualifying global memory loads into ld.global.nc (LDG) instructions, routing them through the read-only texture cache (L1 on Kepler+, unified L1/tex on Maxwell+). The transformation is profitable for read-only data because the texture cache has separate bandwidth from the L1 data cache, effectively doubling memory throughput for qualifying loads.

Algorithm:

fn ldgxform(MF: &mut MachineFunction) -> bool {
    let mut changed = false;
    for mi in MF.all_instrs() {
        if !is_global_load(mi) { continue; }
        if is_volatile(mi) { continue; }
        if !pointer_is_readonly(mi.address_operand()) { continue; }

        // Replace ld.global with ld.global.nc (LDG)
        mi.set_opcode(ldg_variant(mi.opcode()));

        // Split wide loads if necessary
        if load_width(mi) > hardware_max_ldg_width() {
            // sub_21F2C80: LDG split transformation
            // Tags: ".ldgsplit", ".load", ".ldgsplitinsert"
            let (lo, hi) = split_wide_load(mi);
            // Insert: lo = ldg.64 [addr]
            //         hi = ldg.64 [addr + 8]
            //         result = INSERT_SUBREG lo, hi
            changed = true;
        }
        changed = true;
    }
    changed
}

Vector splitting (sub_21F3A20, 44KB): This is the third-largest function in the 0x21F range. NVPTX supports limited native vector widths (typically .v2 and .v4 of 32-bit elements). When wider vectors (e.g., v8f32, v16f16) appear, this engine splits them into legal widths. Operations handled:

vecBitCast: bitcast between vector types
splitVec: split a vector into sub-vectors
extractSplitVec / insertSplitVec: element access on split vectors
splitVecGEP: GEP computation on split vector elements

The split width depends on TargetOpt.HasLDG (stored at target options offset 5, extracted from p2h-01 analysis). When LDG is available, 128-bit loads (LDG.128) are preferred, resulting in .v4.b32 patterns.

NVPTXMem2Reg (`sub_21F9920`) -- Machine-Level Mem2Reg

Registration: sub_21F9920 at 0x21F9920, pass name "Mem2Reg on Machine Instructions to remove local stack objects", pass ID "nvptx-mem2reg". Main body: sub_21FA880 (22KB), engine: sub_21FC920 (33KB). Controlled by byte_4FD25C0 (inverted by nv-disable-mem2reg, default: enabled).

Standard LLVM mem2reg operates on LLVM IR alloca instructions. This NVIDIA-custom pass operates on MachineInstr -- specifically on ld.local / st.local pairs that access __local_depot frame slots. After register allocation, some values that were spilled to .local memory can be promoted back to virtual registers if their access pattern is simple enough (single def, multiple uses, no aliasing stores).

Algorithm:

fn nvptx_machine_mem2reg(MF: &mut MachineFunction) -> bool {
    if nv_disable_mem2reg { return false; }  // byte_4FD25C0

    let mut changed = false;
    for frame_idx in MF.frame_info().stack_objects() {
        if !is_local_depot_slot(frame_idx) { continue; }
        // Collect all loads and stores to this frame slot
        let stores = find_stores_to(MF, frame_idx);
        let loads = find_loads_from(MF, frame_idx);

        if stores.len() != 1 { continue; }  // must be single-def
        let store = stores[0];
        let src_reg = store.source_register();

        // Check: no aliasing stores between def and uses
        // Check: store dominates all loads
        if !dominates_all(store, &loads) { continue; }

        // Promote: replace all ld.local with the source register
        for load in &loads {
            replace_load_with_reg(load, src_reg);
            load.erase_from_parent();
        }
        store.erase_from_parent();
        MF.frame_info().remove_object(frame_idx);
        changed = true;
    }
    changed
}

This pass is positioned in addPostRegAlloc(), meaning it runs after the greedy register allocator has already assigned slots. It acts as a cleanup: register allocation may have conservatively spilled values that turn out to be unnecessary after coalescing and copy propagation eliminate intermediate uses.

GenericToNVVM (`sub_215DC20`) -- Address Space Normalization

Registration: sub_215DC20 at 0x215DC20, pass name "Ensure that the global variables are in the global address space", pass ID "generic-to-nvvm". Pass descriptor: 80-byte allocation. Factory: sub_215D530 (allocates 320-byte state with two 128-bucket DenseMaps). New PM variant: sub_305ED20.

CUDA and LLVM IR use address space 0 (generic) as the default for globals, but NVPTX requires globals in address space 1. This pass rewrites every GlobalVariable in address space 0 to address space 1, inserting addrspacecast instructions at all use sites.

Algorithm:

fn generic_to_nvvm(M: &mut Module) -> bool {
    let mut gv_map = DenseMap::new(128);     // old -> new Value mapping
    let mut const_map = DenseMap::new(128);  // old -> new Constant mapping

    for gv in M.globals().filter(|g| g.address_space() == 0) {
        // 1. Clone to address space 1
        let new_gv = GlobalVariable::new(
            gv.value_type(), gv.is_constant(), gv.linkage(),
            gv.initializer(), gv.name(), /*addrspace=*/ 1
        );
        new_gv.set_alignment(gv.alignment());

        // 2. Insert addrspacecast(1 -> 0) at each use
        let cast = ConstantExpr::addrspace_cast(new_gv, gv.type());

        // 3. Replace all uses
        gv.replace_all_uses_with(cast);

        // 4. Track in map and erase original
        gv_map.insert(gv, new_gv);
        gv.erase_from_parent();
    }

    // Cleanup: sub_215D780 iterates gv_map, properly ref-counting Values
    cleanup_gv_map(&gv_map);
    !gv_map.is_empty()
}

NVPTXProxyRegErasure (`sub_21DA810`) -- Redundant cvta.to.local Removal

Registration: sub_21DA810 at 0x21DA810, pass name "NVPTX optimize redundant cvta.to.local instruction".

This late post-RA pass removes cvta.to.local instructions that are left over from address space lowering. After frame layout is complete, local memory addresses are known, and cvta.to.local (which converts a generic pointer to a .local pointer) is redundant when the address is already known to be in .local space. The pass is simple: scan for cvta.to.local MachineInstrs, verify the source is already a .local address, replace uses with the source operand, delete the cvta.

NVPTXAssignValidGlobalNames (`sub_21BCD80`) -- PTX Name Sanitization

Registration: sub_21BCD80 at 0x21BCD80, pass name "Assign valid PTX names to globals", pass ID "nvptx-assign-valid-global-names".

PTX has stricter naming rules than LLVM IR. Characters like @, $, . (in certain positions), and Unicode are illegal in PTX identifiers. This pass walks all GlobalValues in the module and replaces illegal characters with safe alternatives (typically _). It also handles name demangling artifacts and ensures the final names are unique after sanitization.

NVPTXImageOptimizer (`sub_21BCF10`) -- Texture/Surface Optimization

Registration: sub_21BCF10 at 0x21BCF10, pass name "NVPTX Image Optimizer". Type validation helper: sub_21DD1A0 (16KB).

This pre-emission pass optimizes texture and surface access patterns. It validates image type consistency for tex, suld, sust, and suq operations, emitting errors for mismatches: "Invalid image type in .tex", "Invalid image type in .suld", "Invalid image type in suq.", "Invalid image type in .sust". The pass coalesces related texture operations when they access the same texture handle with compatible coordinates and can be merged into wider vector fetches.

NVPTXReplaceImageHandles (`sub_21DBEA0`) -- Image Handle Lowering

Registration: sub_21DBEA0 at 0x21DBEA0, pass name "NVPTX Replace Image Handles".

Replaces IR-level texture/surface handle references (which are LLVM Value pointers to @texture_handle globals) with PTX-level .tex / .surf declarations and integer handle indices. This is a pre-emission pass that bridges the gap between LLVM IR's opaque handle model and PTX's explicit texture declaration model.

AllocaHoisting (`sub_21BC7D0`) -- Entry Block Alloca Hoisting

Registration: sub_21BC7D0 at 0x21BC7D0, pass name "Hoisting alloca instructions in non-entry blocks to the entry block", pass ID "alloca-hoisting". Registration helper: sub_21BC5A0.

PTX requires that all local memory declarations be hoisted to the function entry. This pass scans all basic blocks for alloca instructions and moves them to the entry block. This enables the frame layout pass (PrologEpilogInserter) to assign fixed offsets to all stack objects -- a requirement because PTX emits .local .align N .b8 __local_depotX[SIZE] at the function prologue and all local accesses are indexed from this single base.

ParamOpt (`sub_2203290`) -- Parameter Load Optimization

Registration: sub_2203290 at 0x2203290, pass name "Optimize NVPTX ld.param", pass ID "param-opt".

NVPTX-custom pass that optimizes ld.param instructions generated during kernel argument passing. When a kernel parameter is loaded multiple times (common when the same argument is used in different basic blocks), this pass eliminates redundant loads by propagating the first load's result to subsequent uses. Related knob: remat-load-param ("Support remating const ld.param that are not exposed in NVVM IR").

NVPTXTruncOpts (`sub_22058E0`) -- i16 Truncation Optimization

Registration: sub_22058E0 at 0x22058E0, pass name "Optimize redundant ANDb16ri instrunctions" [sic], pass ID "nvptx-trunc-opts".

When LLVM lowers trunc i32 to i16 operations, the NVPTX backend emits an AND.b16 with mask 0xFFFF to ensure the high bits are zero. In many cases this AND is redundant -- the producing instruction already guarantees a 16-bit result. This pass pattern-matches ANDb16ri instructions with the 0xFFFF immediate and removes them when the source provably fits in 16 bits.

RP-Aware MachineCSE (NVIDIA-Modified `machine-cse`)

Stock LLVM MachineCSE eliminates redundant machine instructions by matching instruction patterns within dominance regions. NVIDIA adds three extensions via ctor_302_0 (0x4FEB70, 7.8KB, 14 strings):

RP-aware CSE (rp-aware-mcse): Before eliminating a common subexpression, queries MRPA (sub_2E5A4E0) for the current register pressure. If eliminating the CSE candidate would increase pressure beyond the target (because the shared result must stay live longer), the CSE is suppressed. This prevents the classic GPU problem where CSE reduces instruction count but increases register pressure, reducing occupancy.

Predicate-aware CSE (pred-aware-mcse): Extends RP awareness to predicate registers (PTX %p class). Predicate registers are a scarce resource (maximum 7 per thread on most architectures), so predicate pressure is tracked separately from general-purpose register pressure.

Copy-prop CSE (copy-prop-mcse): Embeds copy propagation within the CSE framework. When CSE eliminates an instruction, the resulting COPY instructions can often be propagated immediately rather than waiting for the separate MachineCopyPropagation pass.

Incremental MRPA integration: The MCSE pass uses qword_501F988 (incremental-update-mcse, default ON) to incrementally update MRPA as CSE decisions are made, avoiding full recomputation per CSE candidate.

MachinePipeliner (SMS) Detail

The Swing Modulo Scheduler at sub_3563190 performs software pipelining -- overlapping successive loop iterations to hide latency. It operates on a single loop body at the MachineInstr level:

DAG construction: builds a data dependency graph with sub_2F97F60, computes latencies via sub_3559990, adds edges via sub_3542B20.
MII computation: RecMII (recurrence-based) via sub_354CBB0, ResMII (resource-based) via sub_35449F0. MII = max(RecMII, ResMII).
Early exits: MII == 0 is invalid; MII > SwpMaxMii (default 27, -pipeliner-max-mii) aborts.
II search: starts at MII, tries up to pipeliner-ii-search-range (default 10, qword_503E428) consecutive II values. First valid schedule wins.
Schedule construction: ASAP via sub_354BFF0, ALAP via sub_354BFF0, topological sort, core SMS node placement via sub_354C3A0, then finalization.
Kernel generation: Three code generation backends selected by priority -- annotation-only (pipeliner-annotate-for-testing), MVE-based (pipeliner-mve-cg, default enabled), and experimental peeling (pipeliner-experimental-cg).

The pipeliner stores its schedule context as a 616-byte (0x268) structure with four SmallVectors and per-BB data at 256-byte stride. Maximum pipeline stages: SwpMaxStages (default 3, -pipeliner-max-stages).

Core scheduling pipeline (10 sequential calls):

Step	Function	Purpose
1	`sub_35476E0`	DAG construction / dependency analysis
2	`sub_35523F0`	Recurrence detection / RecMII computation
3	`sub_35546F0`	Resource usage / ResMII computation
4	`sub_3543340`	`MII = max(RecMII, ResMII)` finalization
5	`sub_35630A0`	Node ordering / priority assignment
6	`sub_35568E0`	Schedule table initialization
7	`sub_35433F0`	Pre-scheduling transforms
8	`sub_3557A10`	Instruction ordering/selection (heuristic)
9	`sub_354A760`	Schedule finalization / modulo expansion
10	`sub_355F610`	`ScheduleDAGMILive` integration (64KB)

Instruction selection heuristic (sub_3557A10): Priority ordering: (1) deeper instructions first (offset 240 = latency/depth), (2) target priority table at a1+3944 (16-byte entries: [start, end, priority, window_width]), (3) narrower schedule windows first. Latency recomputation via sub_2F8F5D0 during comparison.

Error messages:

"Invalid Minimal Initiation Interval: 0" -- MII computation returned zero
"Minimal Initiation Interval too large: MII > SwpMaxMii. Refer to -pipeliner-max-mii." -- loop is too complex
"Unable to find schedule" -- no valid II found within search range
"No need to pipeline - no overlapped iterations in schedule." -- numStages == 0
"Too many stages in schedule: numStages > SwpMaxStages. Refer to -pipeliner-max-stages." -- pipeline depth exceeded

PrologEpilogInserter (`sub_35B1110`) -- .local Frame Layout

Address: sub_35B1110 (68KB, 2388 decompiled lines). Stack frame: 0x490 bytes of local state. This is NVIDIA's monolithic PEI for PTX. Unlike a traditional PEI that emits push/pop sequences and adjusts %rsp, this one computes .local memory frame offsets.

10-phase structure:

Phase	Lines	Description
1	443-490	Target/subtarget retrieval, initial setup
2	491-566	Callee-saved register determination
3	567-730	Pre-pass: collect fixed objects from frame info
4	733-1070	Stack object offset assignment (main layout engine)
5	1078-1600	General local variable layout
6	1688-1795	Frame-pointer stack area
7	1803-1872	Prolog/epilog instruction insertion per BB
8	1873-2132	Scavenger / frame-index elimination
9	2270-2304	Stack-size warning & diagnostic reporting
10	2305-2388	Cleanup & deallocation

Frame object record (40 bytes):

Offset	Size	Field
+0	8	Byte offset in `.local` memory (assigned by PEI)
+8	8	Object size in bytes
+16	1	Alignment (log2)
+20	1	isDead flag (skip if set)
+32	1	isSpillSlot flag
+36	1	Category byte (0/1/2/3)

Stack layout algorithm (Phase 4):

fn assign_frame_offsets(MF: &MachineFunction, frame: &mut FrameInfo) {
    let grows_neg = frame.stack_direction == 1;
    let mut offset = frame.initial_offset;
    let mut max_align = frame.max_alignment;

    // Fixed objects first
    for obj in frame.fixed_objects() {
        if obj.is_dead { continue; }
        let align = 1 << obj.log2_align;
        offset = align_to(offset, align);
        obj.offset = if grows_neg { -offset } else { offset };
        offset += obj.size;
        max_align = max(max_align, align);
    }

    // Callee-saved register region
    for csr in frame.callee_saved_range() {
        if csr.is_dead || csr.size == -1 { continue; }
        let align = 1 << csr.log2_align;
        offset = align_to(offset, align);
        csr.offset = if grows_neg { -offset } else { offset };
        offset += csr.size;
    }

    // General locals: three category buckets, each via sub_35B0830
    for category in [1, 2, 3] {
        for obj in frame.objects_of_category(category) {
            let align = 1 << obj.log2_align;
            offset = align_to(offset, align);
            obj.offset = if grows_neg { -offset } else { offset };
            offset += obj.size;
        }
    }

    frame.stack_size = offset;
}

The final PTX emission (sub_2158E80) uses these offsets to emit: .local .align N .b8 __local_depotX[SIZE]; at the function prologue, and ld.local / st.local instructions reference [%SPL + offset] where %SPL is the local stack pointer register.

ScheduleDAGMILive (`sub_355F610`) -- Post-RA Instruction Ordering

Address: sub_355F610 (64KB). This is the post-RA machine instruction scheduler, consuming either the pipeliner's output or standalone scheduling regions.

Data structures:

SUnit (Scheduling Unit): 88 bytes per instruction
Instruction-to-node hash map: 632-byte entries
RP tracking structure: 112 bytes (offsets 32-48: per-class pressure current, offsets 56-72: per-class pressure limits)

Scheduling flow:

Initialize RP tracking via sub_3551AB0 (if pipeliner-register-pressure is set)
Set per-class pressure defaults via sub_2F60A40
Walk BB instruction list, build instruction-to-node hash map (632-byte entries)
Compute ASAP via sub_354BFF0 -> earliest cycle per instruction
Compute ALAP via sub_354BFF0 -> latest cycle per instruction
Place instructions via sub_354C3A0 (returns success/failure)
Calculate stage count: (lastCycle - firstCycle) / II
Verify placement via sub_355C7C0
Build stage descriptors via sub_355D7E0 (80 bytes per stage)

Machine-Level Analysis Infrastructure

Machine passes depend on a set of analysis passes that compute liveness, dominance, and frequency information over the MachineFunction representation.

Analysis ID	Class	Description
`slot-indexes`	`SlotIndexesAnalysis`	Assigns a dense integer index to every instruction slot in the function. All liveness computations reference slot indexes rather than instruction pointers, enabling O(log n) interval queries.
`live-intervals`	`LiveIntervalsAnalysis`	Computes live ranges for every virtual register as a set of `[start, end)` slot-index intervals. The `LiveRangeCalc` engine (`sub_2FC4FC0`, 12.9KB) manages 296-byte segment entries with inline small-object buffers for endpoint, register mask, kill-set, and use-def chain data. See LiveRangeCalc.
`live-reg-matrix`	`LiveRegMatrixAnalysis`	Tracks physical register unit interference. On NVPTX, used primarily for register-class-level pressure tracking rather than physical unit assignment.
`machine-dom-tree`	`MachineDominatorTreeAnalysis`	Dominance tree over `MachineBasicBlock` graph. Required by LICM, CSE, sinking, and register allocation.
`machine-post-dom-tree`	`MachinePostDominatorTreeAnalysis`	Post-dominance tree. Used by block placement (`sub_3521FF0` stores at `this+544`).
`machine-loops`	`MachineLoopAnalysis`	Loop detection on the machine CFG. Used by LICM, block placement, and the pipeliner.
`machine-block-freq`	`MachineBlockFrequencyAnalysis`	Block frequency estimates (profile-guided or static). Block placement uses this at `this+528` to drive chain construction.
`machine-branch-prob`	`MachineBranchProbabilityAnalysis`	Branch probability data. Block placement stores at `this+536`.
`machine-trace-metrics`	`MachineTraceMetricsAnalysis`	Trace-based metrics (critical path length, resource depth). Used by `MachineCombiner` and if-conversion.
`machine-opt-remark-emitter`	`MachineOptRemarkEmitterAnalysis`	Optimization remark emission for machine passes.
`edge-bundles`	`EdgeBundlesAnalysis`	Groups CFG edges into bundles for spill placement.
`spill-code-placement`	`SpillPlacementAnalysis`	Determines optimal spill/reload points using edge bundles and frequency data.
`regalloc-evict`	`RegAllocEvictionAdvisorAnalysis`	Advises the greedy allocator on which live range to evict.
`regalloc-priority`	`RegAllocPriorityAdvisorAnalysis`	Assigns allocation priority to live ranges.
`virtregmap`	`VirtRegMapAnalysis`	Maps virtual registers to their assigned physical registers (or spill slots).
`machine-rpa` ★	`sub_21EAA00`	NVIDIA-custom machine register pressure analysis. Provides per-BB pressure data consumed by RP-aware MCSE, scheduling, and rematerialization.

Machine Pass Knobs Summary

NVIDIA Target Pass Enable/Disable

Knob	Type	Default	Effect
`enable-nvvm-peephole`	bool	true	Enable NVPTX-specific peephole optimizer
`nvptx-enable-machine-sink`	bool	false	Enable MachineSink on NVPTX (off by default due to pressure concerns)
`enable-mlicm`	bool	(opt-level dependent)	Enable MachineLICM on NVPTX
`enable-mcse`	bool	(opt-level dependent)	Enable MachineCSE on NVPTX
`nv-disable-mem2reg`	bool	false	Disable machine-level mem2reg
`nv-disable-remat`	bool	false	Disable all NVIDIA rematerialization passes
`enable-new-nvvm-remat`	bool	(varies)	Enable new NVVM remat, disable old
`usedessa`	int	2	Select deSSA method for PHI elimination
`cssa-coalesce`	int	(varies)	Controls PHI operand coalescing aggressiveness

Stock LLVM Codegen Controls

Knob	Type	Default	Effect
`disable-machine-dce`	bool	false	Disable dead machine instruction elimination
`disable-machine-licm`	bool	false	Disable pre-RA MachineLICM
`disable-postra-machine-licm`	bool	false	Disable post-RA MachineLICM
`disable-machine-cse`	bool	false	Disable MachineCSE
`disable-machine-sink`	bool	false	Disable MachineSink (NVPTX also gates via `nvptx-enable-machine-sink`)
`disable-postra-machine-sink`	bool	false	Disable post-RA MachineSink
`disable-branch-fold`	bool	false	Disable BranchFolding / tail merge
`disable-tail-duplicate`	bool	false	Disable post-RA tail duplication
`disable-early-taildup`	bool	false	Disable pre-RA tail duplication
`disable-block-placement`	bool	false	Disable MachineBlockPlacement
`disable-copyprop`	bool	false	Disable MachineCopyPropagation
`disable-ssc`	bool	false	Disable Stack Slot Coloring
`disable-post-ra`	bool	false	Disable post-RA scheduler
`disable-early-ifcvt`	bool	false	Disable early if-conversion
`disable-peephole`	bool	false	Disable stock LLVM peephole optimizer
`enable-machine-outliner`	enum	(varies)	`disable` / `enable` / `guaranteed beneficial`
`misched-postra`	bool	false	Run MachineScheduler post-RA
`optimize-regalloc`	bool	true	Enable optimized register allocation path
`verify-machineinstrs`	bool	false	Run MachineVerifier after each pass

NVIDIA RP-Aware MachineCSE Knobs

Knob	Type	Default	Effect
`rp-aware-mcse`	bool	(varies)	Enable register-pressure-aware MachineCSE
`pred-aware-mcse`	bool	(varies)	Enable predicate-register-pressure-aware MCSE
`copy-prop-mcse`	bool	(varies)	Enable copy propagation within MachineCSE
`incremental-update-mcse`	bool	true	Incrementally update MRPA during MCSE
`verify-update-mcse`	bool	false	Debug: verify incremental MRPA updates against full recomputation
`print-verify`	bool	false	Debug: print detailed RP mismatch diagnostic
`cta-reconfig-aware-mrpa`	bool	(varies)	CTA reconfiguration aware machine RP analysis

NVPTXBlockRemat Knobs

Knob	Type	Default	Effect
`nv-remat-block`	int	14	Bitmask controlling remat modes (bits 0-3)
`nv-remat-max-times`	int	10	Max iterations of the outer remat loop
`nv-remat-block-single-cost`	int	10	Max cost per single live value pull-in
`nv-remat-block-map-size-limit`	int	6	Map size limit for single pull-in
`nv-remat-block-max-cost`	int	100	Max total clone cost per live value reduction
`nv-remat-block-liveout-min-percentage`	int	70	Min liveout % for special consideration
`nv-remat-block-loop-cost-factor`	int	20	Loop cost multiplier
`nv-remat-default-max-reg`	int	70	Default max register pressure target
`nv-remat-block-load-cost`	int	10	Cost assigned to load instructions
`nv-remat-threshold-for-spec-reg`	int	20	Threshold for special register remat
`nv-dump-remat-block`	bool	false	Debug dump toggle
`load-remat`	bool	true	Enable load rematerialization

Pipeliner Knobs

Knob	Type	Default	Effect
`enable-pipeliner`	bool	true	Enable the MachinePipeliner pass
`pipeliner-max-mii`	int	27	Maximum Minimal Initiation Interval before abort
`pipeliner-max-stages`	int	3	Maximum pipeline stages
`pipeliner-ii-search-range`	int	10	Number of consecutive II values to try
`pipeliner-register-pressure`	bool	false	Enable RP tracking during pipelining
`pipeliner-register-pressure-margin`	int	5	RP margin before pipeliner backs off
`pipeliner-ignore-recmii`	bool	false	Zero out RecMII, use only ResMII
`pipeliner-annotate-for-testing`	bool	false	Annotate schedule without modifying code
`pipeliner-experimental-cg`	bool	false	Use experimental peeling code generator
`pipeliner-mve-cg`	bool	true	Use MVE code generator (default path)
`outliner-benefit-threshold`	int	1	Minimum size in bytes for outlining candidate

Register Pressure Target Knobs

Knob	Type	Default	Effect
`reg-target-adjust`	int	0	Adjust register pressure target (-10 to +10)
`pred-target-adjust`	int	0	Adjust predicate register pressure target (-10 to +10)
`fca-size`	int	8	Max size of first-class aggregates in bytes
`remat-load-param`	bool	(varies)	Support remating const `ld.param` not exposed in NVVM IR
`cta-reconfig-aware-rpa`	bool	(varies)	CTA reconfiguration aware register pressure analysis

Function Address Map

Address	Size	Function	Role
`sub_215DC20`	--	GenericToNVVM registration	Address space normalization
`sub_215D530`	320B state	GenericToNVVM factory	Allocates pass state with 2 DenseMaps
`sub_215D780`	--	GenericToNVVM cleanup	GVMap iteration and Value ref-counting
`sub_2166D20`	1.5KB	addISelPasses	Pre-ISel pass configuration
`sub_2166ED0`	1.6KB	addPreRegAlloc	Pre-RA pass configuration
`sub_21668D0`	1.2KB	addPostRegAlloc	Post-RA pass configuration
`sub_217D300`	--	BlockRemat pass name	`"NVPTX Machine Block Level Rematerialization"`
`sub_217DBF0`	--	BlockRemat registration	`"nvptx-remat-block"`
`sub_217E810`	5.2KB	MULTIDEF detection	Single-def checker with opcode exclusion table
`sub_2181550`	~3KB	Recursive pullability	Depth-limited chain validation (depth <= 50)
`sub_2181870`	19KB	Second-chance heuristic	Re-evaluates rejected remat candidates
`sub_2183E30`	--	Cost evaluator	Computes clone cost for rematerialization
`sub_2184890`	12KB	Remat allocation helper	Simulates pressure after remat
`sub_2185250`	17KB	Liveness propagation	Core instruction cloning/replacement engine
`sub_2186590`	--	Max-live computation	Per-block pressure scan
`sub_2186D90`	47KB	BlockRemat main engine	Iterative pull-in algorithm (1742 lines)
`sub_21810D0`	9.4KB	Instruction replacement	Replaces register uses after remat
`sub_21BC5A0`	--	AllocaHoisting name	Pass name registration
`sub_21BC7D0`	--	AllocaHoisting registration	`"alloca-hoisting"`
`sub_21BCD80`	--	ValidGlobalNames registration	`"nvptx-assign-valid-global-names"`
`sub_21BCF10`	--	ImageOptimizer registration	`"NVPTX Image Optimizer"`
`sub_21DA810`	--	ProxyRegErasure	Redundant `cvta.to.local` removal
`sub_21DB090`	--	NVPTXPeephole registration	`"nvptx-peephole"`
`sub_21DB5F0`	--	NVPTXPrologEpilog registration	`"NVPTX Prolog Epilog Pass"`
`sub_21DBEA0`	--	ReplaceImageHandles registration	`"NVPTX Replace Image Handles"`
`sub_21DD1A0`	16KB	Image type validation	`tex`/`suld`/`sust`/`suq` type checking
`sub_21E9A60`	4.9KB	RP stats printer	`"Max Live RRegs: "` / `"PRegs: "`
`sub_21E9E80`	--	ExtraMachineInstrPrinter registration	`"extra-machineinstr-printer"`
`sub_21EAA00`	--	MRPA registration	`"machine-rpa"`
`sub_21EEB40`	68KB	MRPA full recomputation	Per-BB pressure computation
`sub_21F2780`	--	LdgXform registration	`"ldgxform"`
`sub_21F2C80`	19KB	LDG split body	`.ldgsplit` / `.ldgsplitinsert`
`sub_21F3A20`	44KB	Vector splitting engine	`splitVec` / `vecBitCast` / `extractSplitVec`
`sub_21F9920`	--	NVPTXMem2Reg registration	`"nvptx-mem2reg"`
`sub_21FA880`	22KB	Mem2Reg body	Machine-level mem2reg driver
`sub_21FC920`	33KB	Mem2Reg engine	Promotion/replacement logic
`sub_2200150`	78KB	DAGToDAG ISel main	Hash-table pattern matching (`h = (37*idx) & (size-1)`)
`sub_2203290`	--	ParamOpt registration	`"param-opt"`
`sub_2204E60`	--	Redundant move elim	`"Remove redundant moves"`
`sub_22058E0`	--	TruncOpts registration	`"nvptx-trunc-opts"`
`sub_2E5A4E0`	48KB	MRPA incremental updater	Incremental RP tracking for MCSE
`sub_1E00370`	78KB	MRPA backend variant	Alternative RP tracker
`sub_35B1110`	68KB	PrologEpilogInserter	`.local` frame layout (2388 lines)
`sub_3563190`	58KB	MachinePipeliner	Swing Modulo Scheduling
`sub_355F610`	64KB	ScheduleDAGMILive	Post-RA instruction ordering
`sub_3557A10`	--	SMS instruction selection	Scheduling heuristic

Global Variable Reference

Variable	Type	Default	Role
`byte_4FD1980`	byte	(opt-level)	MachineLICM enable flag
`byte_4FD18A0`	byte	(opt-level)	MachineCSE enable flag
`byte_4FD1A60`	byte	(opt-level)	MachineSink enable flag
`byte_4FD25C0`	byte	(opt-level)	nvptx-mem2reg enable
`byte_4FD2160`	byte	--	Extra ISel pass enable
`byte_4FD2E80`	byte	off	nv-dump-remat-block
`dword_4FD26A0`	dword	--	Scheduling mode (1 = simple, else = full)
`dword_4FD3740`	dword	10	nv-remat-max-times
`dword_4FD3820`	dword	14	nv-remat-block mode bitmask
`dword_4FD33C0`	dword	70	nv-remat-default-max-reg (global)
`qword_501F988`	qword	1	incremental-update-mcse
`qword_501F8A8`	qword	0	verify-update-mcse
`qword_501F7C8`	qword	0	print-verify

Cross-References

SelectionDAG -- the ISel pass that produces MachineInstrs consumed by machine passes
Register Allocation -- pressure-driven greedy allocator with NVPTX register classes
Register Coalescing -- NVPTX-custom copy elimination framework
PrologEpilogInserter & Frame Layout -- .local memory frame computation
MachineOutliner -- suffix-tree-based code size reduction
Block Placement -- profile-guided basic block ordering
Instruction Scheduling -- MRPA, MachinePipeliner, ScheduleDAGMILive
Rematerialization -- NVIDIA's custom machine-level remat
NVVM Peephole -- IR-level NVVM peephole (distinct from machine-level nvptx-peephole)
AsmPrinter & PTX Emission -- final pass: MachineInstr to PTX text
Code Generation -- pipeline overview including ISel and DAG infrastructure
StructurizeCFG -- mandatory CFG structurization (runs before ISel, feeds machine passes)
Hash Infrastructure -- DenseMap hash function (ptr >> 9) ^ (ptr >> 4) used throughout MRPA
Register Classes -- NVPTX register class definitions consumed by all machine passes

SelectionDAG & Instruction Selection

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: Target-independent DAG infrastructure: llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp, DAGCombiner.cpp, LegalizeDAG.cpp, LegalizeTypes.cpp, SelectionDAGBuilder.cpp, SelectionDAGISel.cpp. NVPTX target: llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp, NVPTXISelDAGToDAG.cpp, NVPTXInstrInfo.td (LLVM 20.0.0).

LLVM version note: The target-independent SelectionDAG infrastructure at 0xF05000--0xF70000 appears to be stock LLVM 20 with no detectable NVIDIA modifications. All NVIDIA customization lives in the NVPTX target range (0x3290000--0x35FFFFF) via virtual dispatch through NVPTXTargetLowering and NVPTXDAGToDAGISel. The intrinsic lowering switch covers IDs up to 14196 (0x3774), far exceeding upstream NVPTX which covers approximately IDs 0--300.

CICC v13.0 contains a complete NVPTX SelectionDAG backend derived from LLVM 20.0.0, with substantial NVIDIA customizations for GPU-specific lowering, the PTX .param-space calling convention, tensor core intrinsic selection, and a 343KB intrinsic lowering mega-switch covering over 200 CUDA intrinsic IDs. The SelectionDAG pipeline converts LLVM IR into machine-level PTX instructions through four major phases: type legalization, operation legalization, DAG combining, and pattern-based instruction selection.

The NVPTX SelectionDAG backend spans roughly 4MB of code across two address ranges: 0xF05000--0xF70000 for the target-independent DAG infrastructure (combining, known-bits, node management) and 0x3290000--0x35FFFFF for the NVPTX-specific lowering, instruction selection, and register allocation. The infrastructure range is stock LLVM with no detectable NVIDIA modifications; all NVIDIA customization lives in the latter range via target hooks and virtual dispatch.


LowerOperation dispatcher	`sub_32E3060` (111KB, 3,626 lines)
LowerCall (.param ABI)	`sub_3040BF0` (88KB, 2,909 lines)
Intrinsic lowering switch	`sub_33B0210` (343KB, 9,518 lines)
ISel::Select driver	`sub_3090F90` (91KB, 2,828 lines)
LegalizeTypes	`sub_20019C0` (348KB, 10,739 lines)
LegalizeOp dispatcher	`sub_1FCE100` (91KB, ~100 opcodes)
LegalizeOp action dispatch	`sub_1FFB890` (137KB, 967 cases)
DAG combiner visitor	`sub_F20C20` (64KB)
DAG combiner orchestrator	`sub_F681E0` (65KB)
DAGCombiner::combine (NVPTX)	`sub_3425710` (142KB, "COVERED"/"INCLUDED" tracing)
PerformDAGCombine (NVPTX)	`sub_33C0CA0` (62KB)
DAG combine: post-legalize	`sub_32EC4F0` (92KB)
computeKnownBits (NVPTX)	`sub_33D4EF0` (114KB, 3,286 lines)
Inline asm lowering	`sub_2079C70` (83KB, 2,797 lines)
Inline asm constraints (NVPTX)	`sub_338BA40` (79KB)
NVPTXTargetLowering init	`sub_3056320` (45KB, constructor)
Type legalization setup	`sub_3314670` (73KB, table population)
Upstream	`lib/CodeGen/SelectionDAG/`, `lib/Target/NVPTX/NVPTXISelLowering.cpp`

Complexity

Let N = number of DAG nodes and E = number of edges (use-def relationships). The SelectionDAG pipeline runs eight sequential phases. SelectionDAGBuilder converts IR instructions to DAG nodes in O(I) where I = LLVM IR instruction count. Each DAG Combiner pass is worklist-driven: O(N) nodes are visited, each matched against pattern rules in O(1) via opcode dispatch; ReplaceAllUsesWith is O(U) per node where U = uses. The three combiner passes total O(3 * N * U_avg). Type legalization (sub_20019C0, 348KB) iterates until all types are legal -- each iteration processes O(N) nodes, and convergence is guaranteed in O(T) iterations where T = max type-promotion depth (typically 2--3 for GPU types). Operation legalization (sub_1FFB890, 137KB) visits each node once: O(N). The action table lookup is O(1) via the 2D array at TLI + 259 * VT + opcode + 2422. ISel pattern matching (sub_3090F90, 91KB) visits each node once in topological order: O(N). Per-node matching is O(P) where P = number of patterns for that opcode, but NVPTX patterns are organized by opcode-indexed tables making this effectively O(1) for common opcodes. The DAG worklist uses ((addr >> 9) ^ (addr >> 4)) & (cap - 1) hashing for O(1) amortized membership tests. Overall: O(I + N * U_avg * 3 + N * T + N) which simplifies to O(N * U_avg) in practice. The intrinsic lowering mega-switch (343KB, 200+ IDs) adds O(1) per intrinsic call via the jump table, not O(200).

Pipeline Position

The SelectionDAG phases execute in a fixed sequence after SelectionDAGBuilder (sub_2081F00) converts LLVM IR into an initial DAG:

SelectionDAGBuilder -- IR-to-DAG lowering, visitor dispatch at sub_2065D30
DAG Combiner (sub_F681E0 / sub_F20C20) -- initial algebraic simplification
DAGTypeLegalizer (sub_20019C0) -- iterates to fixpoint until all types are legal; see Type Legalization
DAG Combiner -- second pass after type legalization
LegalizeDAG (sub_1FCE100 dispatcher, sub_1FFB890 action engine) -- legalizes operations on legal types
DAG Combiner -- third pass after operation legalization
NVPTXTargetLowering::PerformDAGCombine (sub_33C0CA0) -- NVPTX-specific post-legalize combines
Instruction Selection (sub_3090F90) -- see ISel Patterns

Type Legalization

Type legalization (sub_20019C0) is the largest single function in the SelectionDAG pipeline at 348KB. Unlike upstream LLVM, which splits legalization across LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, and LegalizeVectorTypes.cpp, NVIDIA ships all type-legalization logic inlined into a single monolithic dispatch. This may be an LTO artifact or a deliberate choice for branch-prediction locality.

The master switch dispatches on approximately 50 ISD opcodes. Type legalization actions follow the standard LLVM model:

Promote -- widen small types to register width (e.g., i8 to i32) via ANY_EXTEND/ZERO_EXTEND, perform the operation, then TRUNCATE the result.
Expand -- split wide types into halves (e.g., i128 into two i64 values) using shift-and-OR sequences.
Soften -- emulate unsupported FP types through integer libcall sequences.
Scalarize/Split Vector -- decompose illegal vector types into scalar element operations.

The legality table lives inside NVPTXTargetLowering at offset +2422, organized as a 2D array indexed by 259 * VT + opcode. The 259-byte row stride accommodates LLVM's ~250 generic opcodes plus approximately 10 NVPTX target-specific opcodes. A secondary condition-code action table at offset +18112 uses 4-bit packed nibbles indexed by (VT_row + 15 * CC).

The SimpleVT type encoding appears as a recurring pattern throughout the function (at least 11 instances of the same bitwidth-to-VT mapping):

SimpleVT	Type	SimpleVT	Type
1	`i1`	7	`i128`
3	`i8`	8	`f16`
4	`i16`	9	`f32`
5	`i32`	10	`f64`
6	`i64`	14--109	vector types

The vector type range 14--109 maps fixed-width (14--55) and scalable (56--109) vector MVTs to their scalar element types through a ~100-case switch block that appears six times in the function body. The definitive MVT::getSizeInBits() mapping (confirmed at sub_1FDDC20) is:

MVT Range	Bits	Description
0, 1	0	Other, Glue
2	1	`i1`
3	8	`i8`
4, 8	16	`i16`, `f16`
5, 9	32	`i32`, `f32`
6, 10	64	`i64`, `f64`
7	128	`i128`
11	80	`ppcf128` / `x87 f80`
14--23	varies	2-element vectors
24--109	varies	3+ element vectors
111--114	0	token, metadata, untyped

Type legalization workers fan out from several dispatch functions:

Dispatcher	Role	Size	Cases
`sub_201E5F0`	Promote/expand secondary dispatch	81KB	441 case labels, 6 switches
`sub_201BB90`	ExpandIntegerResult	75KB	632 case labels
`sub_2000100`	PromoteIntegerResult	45KB	recursive self-calls
`sub_2029C10`	SplitVectorResult	5KB (dispatcher)	~190 cases
`sub_202E5A0`	SplitVectorOperand	6KB (dispatcher)	~157 cases
`sub_2036110`	ScalarizeVectorResult	dispatch	"Do not know how to scalarize..."
`sub_2035F80`	ScalarizeVectorOperand	dispatch	"Do not know how to scalarize..."

For complete detail, see Type Legalization.

Operation Legalization

LegalizeOp Dispatcher: `sub_1FCE100`

The top-level operation legalizer (sub_1FCE100, 91KB) is a massive switch on SDNode::getOpcode() (read as *(uint16_t*)(node + 24)) that dispatches approximately 100 ISD opcodes to dedicated per-opcode handler functions. The switch covers all major categories:

Opcode	ISD Name	Handler	Size
0x02	`EntryToken`	`sub_1F823C0`
0x03--0x04	`TokenFactor`	`sub_1F73660`
0x32	`CopyFromReg`	`sub_1F78510`
0x33	`CopyToReg`	`sub_1F987D0`
0x34	`MERGE_VALUES`	`sub_1FC08F0`
0x35	`ADD`	`sub_1FA8F90`	31KB
0x36	`SUB`	`sub_1FAA420`	26KB
0x37	`MUL`	`sub_1FAB9E0`
0x38	`SDIV`/`UDIV`	`sub_1FABFF0`
0x39--0x3A	`SREM`/`UREM`	`sub_1F99DA0`
0x3B	`AND`	`sub_1FD2F20`
0x3C	`OR`	`sub_1FD2A20`
0x40	`SHL`	`sub_1FA27D0`
0x41	`SRA`	`sub_1FA2510`
0x42	`SRL`	`sub_1F71080`
0x43	`ROTL`	inline	builds opcode 65 target node
0x44	`ROTR`	`sub_1FA2D60`
0x47	`CTLZ`	`sub_1FA7370`
0x49	`CTPOP`	`sub_1FA2A00`
0x4A	`BSWAP`	inline	16-bit width check
0x4B	`BITREVERSE`	inline
0x4C	`SELECT`	`sub_1FAC480`	78KB
0x4D	`SELECT_CC`	`sub_1FAE680`	87KB
0x4E	`SETCC`	`sub_1FB04B0`	26KB
0x4F	`VSELECT`	`sub_1FCC170`
0x63	`SIGN_EXTEND`	`sub_1F8D440`	22KB
0x65	`ZERO_EXTEND`	`sub_1F74E80`
0x68	`TRUNCATE`	`sub_1F912F0`	77KB
0x69	`FP_ROUND`	`sub_1F97850`	27KB
0x6A	`FP_EXTEND`	`sub_1FC15C0`	36KB
0x6C	`BITCAST`	`sub_1F94350`	22KB
0x6D	`LOAD`	inline	alignment+memtype checks
0x70	`STORE`	`sub_1F766E0`
0x72--0x75	`ATOMIC_FENCE`..`LOAD`	`sub_1FAA010`
0x76	`ATOMIC_STORE`	`sub_1FBDC00`	76KB
0x77	`ATOMIC_LOAD_ADD`	`sub_1FB1F30`	37KB
0x78	`ATOMIC_LOAD_SUB`	`sub_1FBB600`	44KB
0x7A	`ATOMIC_LOAD_AND`	`sub_1FB8710`	47KB
0x7B	`ATOMIC_LOAD_OR`	`sub_1FBA730`	24KB
0x7C	`ATOMIC_LOAD_XOR`	`sub_1FB6C10`	39KB
0x86	`INTRINSIC_WO_CHAIN`	`sub_1F9E480`	47KB
0x87	`INTRINSIC_W_CHAIN`	`sub_1F9D3D0`	26KB
0x88	`INTRINSIC_VOID`	`sub_1F9CFD0`
0x8E	`BUILD_VECTOR`	`sub_1FA3B00`	26KB
0x8F	`INSERT_VECTOR_ELT`	`sub_1FA4AC0`	67KB
0x90	`EXTRACT_VECTOR_ELT`	`sub_1FA0CA0`	20KB
0x91	`CONCAT_VECTORS`	`sub_1FB3BB0`	65KB
0x94	`EXTRACT_SUBVECTOR`	`sub_1FB5FC0`	19KB
0x9A	`DYNAMIC_STACKALLOC`	`sub_1F8F600`
0x9E	`BR_CC`	`sub_1F8B6C0`

Opcodes not listed (0--1, 5--0x31, 0x3D--3F, 0x46, 0x48, 0x51--0x62, etc.) return immediately with code 0 (legal, no transformation needed).

Action Dispatch Engine: `sub_1FFB890`

The operation legalization action engine (sub_1FFB890, 137KB) determines what to do for each DAG node based on the target's action table, then executes the chosen strategy. It reads the per-opcode action byte from NVPTXTargetLowering + 2422 using the formula *(uint8_t*)(TLI + 259 * VT + opcode + 2422):

Action	Code	Behavior
Legal	0	Return immediately -- node is natively supported
Custom	1	Call `NVPTXTargetLowering::LowerOperation` (vtable slot #164, offset `+1312`); if NULL returned, fall through to expand
Expand	2	Try `LegalizeTypes`, then `ExpandNode` (`sub_1FF6F70`) as fallback
LibCall	3	Call `ExpandNode` directly for libcall substitution
Promote	4	Find a larger legal type and rebuild the node

The function contains 967 case labels dispatching on opcode. When LowerOperation returns NULL (the custom lowering cannot handle the node), the framework falls through to the expansion path. When it returns a different node, ReplaceAllUsesWith (sub_1D44C70) splices the replacement into the DAG and marks the old node as dead (tombstone value -2 in the worklist hash set).

The promote path contains approximately 30 opcode-specific expansion strategies covering integer arithmetic, FP operations, vector operations, bitcasts, shifts, and NVPTX-specific operations. For FP promotion, the pattern is: FP_EXTEND both operands to the promoted type, apply the original operation, then FP_ROUND the result back.

Worklist management uses sub_1FF5010 with a DenseSet-like structure. The hash function for SDNode pointers follows the standard LLVM pattern: ((addr >> 9) ^ (addr >> 4)) & (capacity - 1).

Load/Store Legalization

The largest individual per-opcode handlers deal with memory operations:

Handler	Opcode	Size	Behavior
`sub_1FC2C30`	LOAD (complex)	70KB	Extending loads, vector loads, memory type conversion
`sub_1FC66B0`	Load/Store vectorization	68KB	Offset-based coalescing with introsort (`sub_1F6CA30`)
`sub_1FC9570`	STORE legalization	60KB	Alignment checks, store splitting, scatter sequences

The load/store vectorization helper sorts operands by memory offset to detect coalescing opportunities, then creates vector load/store sequences when contiguous accesses are found. This is important for NVPTX because PTX supports ld.v2/ld.v4/st.v2/st.v4 instructions that load/store 2 or 4 elements in a single transaction.

Atomic Legalization

All atomic operations (ATOMIC_STORE through ATOMIC_LOAD_XOR, opcodes 0x72--0x7C) follow a shared structural pattern:

Check operation legality via sub_1D16620 (isAtomicStoreLegal / isOperationLegalOrCustom)
If legal, emit the operation directly
If custom, call NVPTXTargetLowering::LowerOperation for scope-aware NVPTX atomics
Build atomic fence pairs around the operation when needed
Lower to target-specific NVPTX atomic operations with CTA/GPU/SYS scope

The ATOMIC_LOAD_SUB handler at sub_1FBB600 converts subtraction to atom.add of the negated operand when the target lacks native atom.sub.

NVPTX Custom Lowering: `sub_32E3060`

The LowerOperation dispatcher (sub_32E3060, 111KB) handles NVPTX-specific ISD opcode lowering. This is the second-largest function in the 0x32XXXXX range. It operates through a multi-phase approach rather than a clean switch-on-opcode, with approximately 620 local variables and a 0x430-byte stack frame.

The dispatcher is reached via vtable slot #164 (offset +1312) of the NVPTXTargetLowering object whenever the operation legalizer encounters action code 1 (Custom).

Supported Opcodes

Opcode	ISD Node	Lowering Strategy
51	`UNDEF`	Direct pass-through via `getNode(UNDEF)`
156	`BUILD_VECTOR`	Iterates operands, detects all-same, calls dedicated handler
186	`VECTOR_SHUFFLE`	Three-level approach by result count (1, 2, 3+)
234	`EXTRACT_VECTOR_ELT`	Three sub-paths: predicate check, direct sub-register, general extract

Additionally, the function handles load/store lowering (sub_32D2680, 81KB companion), integer/FP operation legalization (sub_32983B0, 79KB), address space casts (sub_32C3760, 54KB), bitcast/conversion (sub_32C7250, 57KB), and conditional/select patterns (sub_32BE8D0, 54KB). These large helper functions are called from within sub_32E3060's dispatch logic.

BUILD_VECTOR Lowering

BUILD_VECTOR (opcode 156) lowering begins by iterating all operands to detect the all-same (splat) case. When all elements are the same value, the lowering produces a single scalar load followed by register-class-appropriate replication. When elements differ, it falls through to a per-element insert chain.

For NVPTX, BUILD_VECTOR is significant because PTX has no native vector construction instruction -- vectors are built by storing elements into .param space and reloading as a vector type, or through register-pair packing for 2-element vectors.

VECTOR_SHUFFLE Three-Level Lowering

Vector shuffle lowering (lines 2665--3055 of the decompilation) implements a three-level strategy based on the result element count:

Level 1 -- Single-result shuffle. When the shuffle produces a single element, the lowering extracts the source element directly via EXTRACT_VECTOR_ELT and wraps it in a BUILD_VECTOR if needed. This avoids any actual shuffle machinery.

Level 2 -- Two-result shuffle. The handler uses a two-phase identity/extract detection with BitVector tracking. Phase A scans the shuffle mask to identify which source elements map to which result positions. Phase B determines whether each result position is an identity (element already in the correct position in one of the source vectors) or requires extraction. Results that are identities are left in place; non-identity elements are extracted and inserted.

Level 3 -- General shuffle (3+ results). Falls back to a BUILD_VECTOR-based reconstruction. Each result element is individually extracted from the appropriate source vector using EXTRACT_VECTOR_ELT, then all elements are combined via BUILD_VECTOR. For certain mask patterns, pairwise shuffle via sub_32B2430 is attempted first as an optimization.

EXTRACT_VECTOR_ELT Three Sub-Paths

EXTRACT_VECTOR_ELT (opcode 234) lowering takes one of three paths based on the extraction context:

Predicate extraction. When extracting from a vector of i1 (predicates), the lowering produces a bitwise test on the packed predicate register. This is NVPTX-specific: PTX stores predicate vectors packed into integer registers.
Direct sub-register extraction. When the element index is a compile-time constant and the element aligns with a register boundary, the lowering generates a direct sub-register reference. This maps to PTX's mov.b32 or mov.b64 for extracting elements from packed register pairs.
General extraction. For non-constant indices or non-aligned elements, the lowering stores the entire vector to local memory, computes the byte offset from the index, and loads the element back. This generates st.local + ld.local sequences, which is expensive but handles all cases.

Supporting NVPTX Lowering Functions

The custom lowering infrastructure at 0x3290000--0x32FFFFF consists of approximately 13 large functions totaling ~850KB:

Function	Size	Role
`sub_32E3060`	111KB	Master LowerOperation dispatcher
`sub_32A1EF0`	109KB	Custom type promotion for NVPTX types
`sub_32EC4F0`	92KB	Post-legalize DAG combine
`sub_32FE970`	88KB	Vector operation splitting/scalarization
`sub_32D2680`	81KB	Load/store DAG lowering (address space, alignment)
`sub_32983B0`	79KB	Integer/FP operation legalization
`sub_32B8A20`	71KB	NVVM intrinsic lowering (tex/surf/special)
`sub_32CBCB0`	57KB	Extended type legalization
`sub_32C7250`	57KB	Bitcast/conversion lowering
`sub_32A9030`	55KB	Vector operation lowering
`sub_32C3760`	54KB	Address space cast / pointer lowering
`sub_32BE8D0`	54KB	Conditional/select lowering
`sub_32B6540`	50KB	Special register / intrinsic lowering

Common helpers shared across all functions in this cluster:

Range	Role
`sub_325Fxxx`	EVT/MVT type utilities
`sub_326xxxx`	DAG node creation (`getNode` variants)
`sub_327xxxx`	DAG memory node creation
`sub_328xxxx`	Target-specific node creation
`sub_33Exxxx`	NVPTX-specific node builders
`sub_33Fxxxx`	NVPTX instruction node helpers
`sub_340xxxx`	NVPTX constant/register node helpers
`sub_341xxxx`	NVPTX chain/glue node construction

The .param-Space Calling Convention

PTX does not use registers for argument passing. Instead, all arguments flow through .param memory space, a compiler-managed address space specifically for call sites. LowerCall (sub_3040BF0, 88KB) implements this convention by emitting a structured sequence of NVPTXISD custom DAG nodes.

Call Sequence DAG Structure

CallSeqBegin(315, seq_id, 0)
  DeclareScalarParam(506, align=4, idx=0, size=32)   // scalar arg
  DeclareParam(505, align=4, idx=1, size=N)           // struct arg (byval)
    StoreV1(571, ...)                                  // 8 bytes at a time
    StoreV2(572, ...)                                  // or 2-element vector
  DeclareRetScalarParam(508, 1, 32, 0)                // return decl
  CallProto(518, callee, ...)
  CallStart(514, ...)                                  // actual call
  LoadRetParam(515, 1, 0, ...)                         // load return value
  CallSeqEnd(517, ...)
CallSeqEnd_Outer(316, ...)

Each call increments a monotonic sequence counter at NVPTXTargetLowering + 537024 (offset 134256 * 4), used to match CallSeqBegin/CallSeqEnd pairs and generate unique .param variable names (e.g., __param_0, __param_1, etc.).

Scalar Widening Rules

Scalar arguments narrower than 32 bits are widened to 32 bits; values between 32 and 64 bits are widened to 64 bits. This matches the PTX ABI requirement that .param scalars have a minimum 32-bit size:

Source Width	Widened To	PTX Type
`i1` (1 bit)	`i32` (32 bit)	`.param .b32`
`i8` (8 bit)	`i32` (32 bit)	`.param .b32`
`i16` (16 bit)	`i32` (32 bit)	`.param .b32`
`i32` (32 bit)	`i32` (no change)	`.param .b32`
`i64` (64 bit)	`i64` (no change)	`.param .b64`
`f16` (16 bit)	`i32` (32 bit)	`.param .b32`
`f32` (32 bit)	`f32` (no change)	`.param .f32`
`f64` (64 bit)	`f64` (no change)	`.param .f64`

Vector Parameter Passing

Vector arguments use StoreV1/StoreV2/StoreV4 (opcodes 571--573) mapping to PTX st.param.b32, st.param.v2.b32, st.param.v4.b32 and their 64-bit variants. The element count determines the opcode:

Opcode	Name	PTX	Description
571	`StoreV1`	`st.param.b32` / `.b64`	Single element store
572	`StoreV2`	`st.param.v2.b32` / `.v2.b64`	2-element vector store
573	`StoreV4`	`st.param.v4.b32` / `.v4.b64`	4-element vector store

For byval struct arguments, the lowering decomposes the aggregate into chunks that fit the largest available vector store. An 80-byte struct, for example, might be lowered as five StoreV4.b32 operations (5 x 4 x 4 = 80 bytes).

NVPTXISD DAG Node Opcodes

The complete set of NVPTXISD opcodes used in call lowering:

Opcode	Name	Role
315	`CallSeqBegin`	Marks start of call parameter setup (maps to ISD opcode)
316	`CallSeqEnd`	Outer end-of-call marker (maps to ISD opcode)
505	`DeclareParam`	Declares a byval `.param` aggregate parameter
506	`DeclareScalarParam`	Declares a scalar `.param` parameter with width+alignment
508	`DeclareRetScalarParam`	Declares the return value `.param` parameter
510	`CallDirect`	Direct call with prototype
511	`CallDirectNoProto`	Direct call without prototype (old-style C)
512	`CallIndirect`	Indirect call (function pointer) with prototype
513	`CallIndirectNoProto`	Indirect call without prototype
514	`CallStart`	The actual call instruction
515	`LoadRetParam`	Loads return value from `.param` space
517	`CallSeqEnd` (inner)	Inner end-of-call marker
518	`CallProto`	Call prototype declaration (type signature)
571--573	`StoreV1`/`V2`/`V4`	Stores to `.param` space

Four Call Flavors

Call dispatch is selected by prototype availability and call directness:

Opcode	Name	When Used
510	`CallDirect`	Direct call to a named function with a known prototype
511	`CallDirectNoProto`	Direct call without prototype (K&R C style, rare in CUDA)
512	`CallIndirect`	Function pointer call with known prototype
513	`CallIndirectNoProto`	Function pointer call without prototype

In CUDA code, CallDirect (510) dominates because the vast majority of device function calls are direct with full prototypes. CallIndirect (512) appears when calling through __device__ function pointers. The no-prototype variants are legacy paths that may not be exercisable from CUDA C++ but are retained for C compatibility.

Libcall Generation

When the lowering needs to synthesize a library call (e.g., for __divdi3 software division), it attaches "nvptx-libcall-callee" metadata set to "true" on the callee. This metadata string was extracted from the binary at sub_3040BF0. The metadata tells later passes that the callee is a compiler-generated runtime helper rather than user code.

The primary helpers called from LowerCall:

Helper	Role
`sub_302F170`	Parameter marshaling setup
`sub_3031480`	Argument type coercion
`sub_3031850`	Scalar widening
`sub_30351C0`	Struct decomposition for byval args
`sub_303E700`	Return value handling

DAG Combining

The DAG combiner runs three times during the SelectionDAG pipeline: once after initial DAG construction, once after type legalization, and once after operation legalization. The combiner consists of a target-independent framework and NVPTX-specific target hooks.

Target-Independent Combiner Framework

The combiner orchestrator (sub_F681E0, 65KB) manages the worklist-driven iteration over all DAG nodes:

function DAGCombine(dag):
    worklist = dag.allNodes()    // linked list iteration
    visited = SmallPtrSet()
    while worklist not empty:
        node = worklist.pop()
        if visited.count(node): continue
        visited.insert(node)     // sub_C8CA60 / sub_C8CC70
        result = visitNode(node) // sub_F20C20
        if result != node:
            ReplaceAllUsesWith(node, result) // sub_F162A0
            add users of result to worklist
            mark node dead

The worklist operates on the SDNode linked list. Nodes are processed via sub_C8CA60 (SmallPtrSet::count for visited check) and sub_C8CC70 (SmallPtrSet::insert with vector growth for worklist membership). The exclusion list at this + 64 (with count at this + 76) prevents certain nodes from being visited.

Global flag byte_4F8F8E8 enables verbose/debug tracing of the combining process.

Visitor: `sub_F20C20`

The per-node combine visitor (sub_F20C20, 64KB) implements six sequential optimization phases for each node:

Phase 1: Opcode-specific combine. Calls sub_100E380, the target-independent combine dispatcher, which switches on the node's opcode and applies algebraic simplifications (e.g., x + 0 -> x, x & -1 -> x, x * 1 -> x). For NVPTX, this also invokes the target-specific combine hook via vtable dispatch.

Phase 2: Known-bits narrowing. For nodes with constant operands, the combiner builds APInt masks and calls sub_11A3F30 (computeKnownBits / SimplifyDemandedBits) to narrow constants. When all high bits of a result are known-zero, the operation can be narrowed to a smaller type. Two global cl::opt flags gate this phase: qword_4F8B3C8 controls strict-FP known-bits combining, and qword_4F8B548 controls 2-operand reassociation.

Phase 3: Operand type-narrowing loop. For each operand, the combiner computes the legalized type, skips zero-constant operands, creates legalized replacements, and inserts SIGN_EXTEND/TRUNCATE cast nodes as needed. This handles the common case where an operation was originally on i64 but only uses the low 32 bits.

Phase 4: All-constant-operand fold. Detects when every operand is a ConstantSDNode (opcode 17) and calls sub_1028510 for full constant-fold evaluation. The constant check uses a 4x-unrolled loop for performance. The operand count is extracted via the 0x7FFFFFF mask from the packed SDNode header.

Phase 5: Division-by-constant strength reduction. Replaces division by power-of-two constants with shift+mask sequences via APInt shift/mask computation. Division by non-power-of-two constants uses the magic-number reciprocal multiplication technique: x / C becomes (x * M) >> shift where M is the multiplicative inverse.

Phase 6: Vector stride / reassociation patterns. Attempts associative FP decomposition via sub_F15980, with fast-math flag propagation when both sub-results are known non-negative. This handles patterns like (a + b) + c -> a + (b + c) when nsz and arcp flags permit.

ReplaceAllUsesWith: `sub_F162A0`

The combiner's RAUW implementation walks the use-list and hashes each user into a worklist map using the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function and growth policy.

Supporting Combine Functions

Function	Size	Role
`sub_F0F270`	25.5KB	Pattern matcher (STORE/BITCAST/CONSTANT)
`sub_F24210`	34.6KB	DAG simplification pass
`sub_F2B940`	29.8KB	Truncation/extension chain combines
`sub_F29CA0`	26.9KB	Node morphing / operand updating
`sub_F27020`	25KB	Specific operation combines
`sub_F2D1B0`	22.2KB	Comparison combines
`sub_F2DD30`	11.5KB	Shift combines
`sub_F62E00`	46.7KB	Address/memory operation combines
`sub_F657D0`	26.1KB	Vector operation combines
`sub_F6C1B0`	15.7KB	TokenFactor chain management

SDNode Data Structure

The combiner manipulates SDNodes using these field offsets (reconstructed from access patterns throughout the combining code):

Offset	Size	Field
-8	8	Operand list pointer (when bit 6 of byte +7 is set)
0	8	First operand / use chain linked list
+4	4	Packed: `NumOperands` (bits 0--26) \| `Flags` (bits 27--31)
+7	1	Extra flags (bit 6 = has operand pointer at -8)
+8	8	ValueType / MVT
+16	8	Use chain (next user pointer, 0 if none)
+24	2	Opcode (`uint16_t`)
+32	4	Result type info
+36	4	DebugLoc / location ID
+40	8	Chain operand
+48	8	Value pointer / type info
+72	4	NumResults
+80	4	Additional operand count / mask index

Operand stride is 32 bytes. Access pattern: node - 32 * (node[+4] & 0x7FFFFFF) yields the first operand.

NVPTX Target-Specific Combines: `sub_33C0CA0`

NVPTXTargetLowering::PerformDAGCombine (sub_33C0CA0, 62KB) provides NVPTX-specific algebraic optimizations. This function is called from the target-independent combiner framework via vtable dispatch. It receives an SDNode and returns either NULL (no transformation) or a replacement node.

The function calls sub_2FE8D10 (13x), sub_2FE6CC0 (12x), sub_30070B0 (14x), and sub_2D56A50 (9x), with 27 calls into sub_B2D*/B2C* for debug value builders.

A secondary NVPTX DAG combine function at sub_32EC4F0 (92KB) handles post-legalize optimization, operating after the main legalization pass. It calls into the same shared DAG construction helpers (sub_2FE3480, sub_2FE6750, sub_325F5D0, sub_3262090).

The NVIDIA-side DAGCombiner at sub_3425710 (142KB) includes debug tracing with "COVERED: " and "INCLUDED: " prefix strings, confirming it was built with NVIDIA's internal debug infrastructure. This function calls sub_C8D5F0 (31x for type action checks), sub_2E79000 (14x for value type access), and sub_3423E80 (8x for combine helper dispatch).

NVPTX Address Spaces

Address space constants appear throughout the SelectionDAG lowering. See Address Spaces for the master table and SelectionDAG Address Space Encoding for the backend-specific secondary encoding used in .param passing conventions.

In LowerCall, pointer arguments undergo addrspacecast to generic (AS 0) via sub_33F2D30. The pointer size for AS 5 follows a power-of-two encoding: sizes 1, 2, 4, 8, 16, 32, 64, 128 bytes map to codes 2, 3, 4, 5, 6, 7, 8, 9.

Address space handling permeates the entire lowering infrastructure. Functions sub_33067C0 (74KB), sub_331F6A0 (62KB), sub_331C5B0 (60KB), and sub_33D4EF0 (114KB) all contain address-space-aware logic for NVPTX memory operations, global address lowering, argument handling, and complex pattern matching respectively.

Intrinsic Lowering

The intrinsic lowering mega-switch (sub_33B0210, 343KB) dispatches over 200 distinct NVPTX intrinsic IDs into DAG node construction. The switch covers intrinsic IDs 0--0x310 in the main body, with high-ID ranges for texture/surface operations extending to ID 14196 (0x3774). The function contains approximately 1,000 local variables and calls sub_338B750 (getValue helper) 195 times, sub_3406EB0 (getNode) 116 times, and sub_337DC20 (setValue) 100 times.

Key intrinsic categories:

Category	ID Range	Handler	Count
Math ops (rounding modes)	2, 10, 12, 20, 21, 63, ...	`sub_33FA050`	~20
WMMA / MMA (tensor core)	0xA4--0xA8, 0x194--0x1EC	`sub_33A64B0`	95
Texture sampling	0x5D--0x8D	`sub_33A4350`	50
Surface read/write	0x8E--0x90	`sub_33A3180`	3
Warp shuffle	0xD4, 0xD5, 0xDF, 0xE0	`sub_33FAF80`	4
Vote intrinsics	0xE1--0xE6	`sub_339CDA0` / `sub_339E310`	6
Atomics	0xEB--0xF8	`sub_3405C90` / `sub_340AD50`	~14
cp.async / TMA	0x175--0x17C	`sub_33AD3D0`	~8
MMA sm90+ (Hopper wgmma)	0x183--0x191	`sub_33AC8F0`	15
Texture/surface handle	10578	inline	`nvvm_texsurf_handle`

The WMMA/MMA block is the largest single-handler group: 95 consecutive case labels (intrinsic IDs 404--492) all delegate to sub_33A64B0, covering wmma.load, wmma.store, wmma.mma, mma.sync (sm70+), mma.sp (sm80+), and mma.f64 (sm90+). The warp shuffle intrinsics map to specific NVPTXISD opcodes: __shfl_down_sync to 277, __shfl_up_sync to 275, __shfl_xor_sync to 278, and __shfl_sync to 276.

Math intrinsics encode explicit rounding modes via an inner opcode table. For example, ADD_RN (round-to-nearest) maps to opcode 252, ADD_RZ (round-toward-zero) to 249, ADD_RM (round-toward-minus-infinity) to 245, and ADD_RP (round-toward-plus-infinity) to 270.

NVIDIA-specific intrinsic IDs include high-value entries: ID 10578 handles nvvm_texsurf_handle, IDs 8920/8937--8938 handle texture/surface operations. The overflow path at sub_33A1E80 handles intrinsic IDs that fall outside the main switch range.

NVPTX computeKnownBits

The NVPTX target provides a custom computeKnownBitsForTargetNode implementation (sub_33D4EF0, 114KB) that propagates bit-level information through 112 opcode cases in the SelectionDAG. This function calls sub_969240 (SDNode accessor) 399 times and itself recursively 99 times. It supports demanded-bits pruning via an APInt mask parameter and caps recursion at depth 6 (matching LLVM's default MaxRecursionDepth).

Notable NVPTX-specific known-bits behaviors:

Memory operation type inference (opcode 0x12A): Propagates known bits through load operations based on extension mode (zero-extend, sign-extend, any-extend) encoded in the node flags byte at bits [2:3]. Handles ld.global.u32 vs ld.global.s32 vs ld.global.b32 distinctions.
Texture/surface fetch results (opcodes 0x152--0x161): Sets known bits in the range [elementSize..width] based on the result type, encoding the known bit-width of texture fetch results.
Constant pool integration (opcode 0x175): Uses LLVM's ConstantRange class to derive known bits from constant pool values, chaining fromKnownBits through intersect to toKnownBits.
Target fence at opcode 499 (ISD::BUILTIN_OP_END): All opcodes above 499 delegate to the TargetLowering virtual method; below that, the generic ISD switch handles everything.

APInt values with width at most 64 bits use inline storage; wider values trigger heap allocation. The constant 0x40 (64) appears hundreds of times as the inline/heap branch condition.

The target-independent known-bits infrastructure at 0xF50000--0xF60000 includes:

Function	Size	Role
`sub_F5A610`	36.7KB	`computeKnownBits` for generic ISD opcodes (depth limit at `a4 == 48`)
`sub_F5F040`	52.4KB	Extended known-bits with recursive expansion limit: `(v74-1)*v77 > qword_4F8BF28`
`sub_F5CD10`	26.6KB	DAG combine using known-bits results
`sub_F54050`	17.8KB	Known-bits for multi-result nodes
`sub_F54F50`	10.7KB	Known-bits for vector operations

Global qword_4F8BF28 is a threshold that limits recursive known-bits expansion to prevent combinatorial blowup.

Inline Assembly Lowering

Inline assembly lowering spans two locations in the binary: the target-independent SelectionDAGBuilder::visitInlineAsm at sub_2079C70 (83KB) and the NVPTX-specific constraint handler at sub_338BA40 (79KB).

Target-Independent Framework: `sub_2079C70`

The inline assembly visitor (sub_2079C70, 83KB, 2,797 lines) lowers LLVM IR asm statements into ISD::INLINEASM (opcode 193) or ISD::INLINEASM_BR (opcode 51) DAG nodes. The function allocates an 8.4KB stack frame and processes operands in five phases:

Initialization. Parses the asm string and metadata. Looks up "srcloc" metadata on the asm instruction for error location reporting.
Constraint pre-processing. Each constraint string is parsed into a 248-byte record. Constraints are classified as: immediate ('i', flag 0x20000), memory ('m', flag 0x30000), or register (determined by target).
Tied operand resolution. Input operands tied to output operands (e.g., "=r" and "0") are matched and validated for type compatibility. Diagnostic: "inline asm not supported yet: don't know how to handle tied indirect register inputs".
Per-operand lowering. Each operand is lowered to an SDValue. Register operands go through TargetLowering::getRegForInlineAsmConstraint() (virtual dispatch). Diagnostics: "couldn't allocate output register for constraint '", "couldn't allocate input reg for constraint '".
DAG node finalization. All operands are assembled into an INLINEASM SDNode with chain and flag operands.

The function uses a 16-entry inline operand buffer (7,088 bytes on stack), reflecting the assumption that CUDA inline asm rarely exceeds 16 operands. Each operand working structure is 440 bytes. Overflow triggers heap reallocation via sub_205BBA0.

Diagnostic strings found in the binary:

String	Condition
`"couldn't allocate output register for constraint '"`	Register constraint unsatisfiable
`"couldn't allocate input reg for constraint '"`	Input constraint unsatisfiable
`"Don't know how to handle indirect register inputs yet..."`	Indirect tied operand
`"inline asm error: This value type register class is not natively supported!"`	Unsupported type for register
`"invalid operand for inline asm constraint '"`	Generic operand mismatch
`"Indirect operand for inline asm not a pointer!"`	Non-pointer indirect operand

NVPTX Constraint Handler: `sub_338BA40`

The NVPTX-specific inline asm constraint handler (sub_338BA40, 79KB) is part of the NVPTXTargetLowering class. It processes constraint strings specific to the NVPTX backend:

Simplified constraint model. NVPTX recognizes single-character 'i' (immediate) and 'm' (memory) constraints through sub_2043C80, avoiding the complex multi-character constraint tables used by x86/ARM backends.
Register class mapping. The function maps MVT values to NVPTX register classes using a 544-case switch (confirmed at sub_204AFD0, 60KB): MVTs 0x18--0x20 map to Int32Regs, 0x21--0x28 to Int64Regs, 0x29--0x30 to Float32Regs, 0x31--0x36 to Float64Regs, 0x37 to Int128Regs, 0x56--0x64 to 2-element vector registers.
Convergent flag handling (bit 5): Ensures barrier semantics are preserved for inline asm, checked via operand bundle attribute or function-level convergent.
Scalar-to-vector conversion. String "non-trivial scalar-to-vector conversion" indicates that the handler attempts to pack scalar inline-asm results into vector register classes when the output constraint specifies a vector type.

Additional support at sub_2046E60 emits ", possible invalid constraint for vector type" when a vector type is used with an incompatible constraint.

ISel Pattern Matching Driver

The instruction selection driver (sub_3090F90) manages the top-level selection loop rather than performing pattern matching directly. It builds a cost table for function arguments using a hash table with hash function key * 37, processes the topological worklist using a min-heap priority queue, and calls the actual pattern matcher (sub_308FEE0) for each node.

The driver maintains an iteration budget of 4 * numInstructions * maxBlockSize to guard against infinite loops. When the budget is exceeded, selection terminates for the current function.

For complete ISel detail, see ISel Pattern Matching & Instruction Selection.

NVPTXTargetLowering Initialization

The NVPTXTargetLowering constructor (sub_3056320, 45KB + sub_3314670, 73KB) populates the legalization action tables that drive all subsequent SelectionDAG processing. It calls sub_302E500, sub_302F030, sub_3030230, and sub_3034720 to register legal/custom/expand actions for each {ISD_opcode, MVT} pair.

Key aspects of the initialization:

Subtarget-gated feature checks. Offsets +2843, +2584, and +2498 in the subtarget object encode SM-version-dependent feature availability. These control which types and operations are marked Legal vs. Custom vs. Expand.
Vector support. NVPTX has limited native vector support. Most vector operations are marked Custom or Expand, forcing them through the custom lowering at sub_32E3060.
Atomic support. The string "vector atomics not supported on this architecture!" at sub_3048C30 confirms SM-version-gated vector atomic support, likely SM 90+ (Hopper) or SM 100+ (Blackwell).
Address space assertions. AS values (generic=0, global=1, shared=3, const=4, local=5) are encoded directly into the legalization tables, with different legal operation sets per address space.

What Upstream LLVM Gets Wrong for GPU

Upstream LLVM's SelectionDAG framework was designed for CPU ISAs where register classes overlap and share a unified physical register file. The NVPTX target breaks these assumptions at every level:

Upstream assumes register classes interfere with each other. On x86, GR32 is a sub-register of GR64; allocating eax constrains rax. The interference graph, coalescing, and copy elimination infrastructure all assume overlapping classes. NVPTX has nine completely disjoint classes (%r, %f, %fd, %p, etc.) with zero cross-class interference. The DAG's register pressure tracking, copy coalescing hints, and class constraint propagation solve a problem that does not exist on this target.
Upstream assumes function calls are cheap register shuffles. CPU calling conventions move arguments through registers (rdi, rsi, etc.) or a stack backed by L1 cache. NVPTX function calls go through the .param address space with explicit DeclareParam/st.param/ld.param sequences -- O(n) memory operations per argument. The LowerCall function in cicc is 88KB (vs. upstream's few KB) because it must handle four call flavors, monotonic .param naming, and "nvptx-libcall-callee" metadata for synthesized calls.
Upstream assumes a small set of intrinsics. Upstream NVPTX intrinsic lowering covers approximately IDs 0-300. CICC's intrinsic mega-switch at sub_33B0210 (343KB) handles IDs up to 14196, covering cp.async, TMA, WGMMA, and the full SM 90/100 tensor operation set. The upstream framework's assumption that intrinsic lowering is a small switch case is off by two orders of magnitude.
Upstream assumes vector types are natively supported. CPU targets have native vector registers (XMM/YMM/ZMM, NEON Q-registers). NVPTX has no native vector registers -- most vector operations are marked Custom or Expand, forcing them through 111KB of custom lowering at sub_32E3060. The "legalize then select" pipeline spends most of its time decomposing vectors that never should have been formed.
Upstream assumes known-bits propagation is a small target hook. Upstream NVPTX's computeKnownBitsForTargetNode handles fewer than 20 opcodes. CICC's version at sub_33D4EF0 (114KB, 112 opcode cases) propagates bits through texture fetches, address space loads, and NVPTX-specific operations -- a 50x expansion that upstream's hook interface was never designed to support cleanly.

Differences from Upstream LLVM

The NVPTX SelectionDAG backend in cicc v13.0 diverges from upstream LLVM NVPTX in several structural and behavioral ways. This section catalogs the known differences.

Structural Divergences

Monolithic type legalizer. Upstream LLVM splits type legalization across four source files (LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, LegalizeVectorTypes.cpp, LegalizeTypes.cpp). In cicc, all four are collapsed into a single 348KB function (sub_20019C0), likely an LTO artifact. The behavioral result is identical, but the code layout makes the function nearly impossible to patch incrementally.

Dual-address ISel infrastructure. The NVPTX lowering code exists at two address ranges (0x32XXXXX and 0x33XXXXX), with functions at sub_32E3060 (LowerOperation) and sub_3377410 (secondary dispatch) forming a two-level dispatch. Upstream NVPTX uses a single LowerOperation method. The binary has a secondary overflow path for intrinsic IDs that fall outside the main switch range.

142KB NVPTX DAGCombiner. The function sub_3425710 includes "COVERED:" and "INCLUDED:" debug trace strings not present in any upstream LLVM release. This is NVIDIA internal instrumentation for tracking combine coverage during development.

Two inline asm subsystems. The target-independent visitInlineAsm at sub_2079C70 (83KB) and the NVPTX-specific constraint handler at sub_338BA40 (79KB) total 162KB. The upstream NVPTX inline asm support is approximately 200 lines of code. The cicc version is vastly more complex, likely handling NVIDIA-internal PTX inline asm patterns.

Behavioral Divergences

Calling convention. Upstream LLVM NVPTX uses a simplified LowerCall that handles only the standard .param space protocol. CICC's sub_3040BF0 (88KB) adds "nvptx-libcall-callee" metadata for synthesized libcalls, monotonic sequence counters for unique .param names, and four call flavors (with/without prototype x direct/indirect). The upstream has two flavors.

Intrinsic count. The cicc intrinsic lowering switch (sub_33B0210, 343KB) handles intrinsic IDs up to 14196 (0x3774), with dedicated handlers for cp.async/TMA and WGMMA instructions. Upstream LLVM's NVPTX intrinsic lowering covers approximately IDs 0--300. The extended range covers SM 90 (Hopper) and SM 100 (Blackwell) tensor operations.

Vector shuffle lowering. The three-level shuffle lowering (identity detection, BitVector tracking, BUILD_VECTOR fallback) is more sophisticated than upstream NVPTX, which typically scalarizes all shuffles unconditionally.

Atomic scope awareness. CICC's atomic lowering at sub_3048C30 (86KB) supports CTA/GPU/SYS scope atomics with SM-version gating. Upstream LLVM NVPTX handles basic atomics but lacks the full scope hierarchy.

Known-bits propagation. The NVPTX computeKnownBitsForTargetNode at sub_33D4EF0 (114KB, 112 opcode cases, 399 SDNode accesses, 99 recursive calls) is far more extensive than the upstream version, which typically handles fewer than 20 target-specific opcodes. The cicc version propagates bits through texture fetches, address space loads, and NVPTX-specific operations.

PerformDAGCombine depth. The NVPTX-specific combine at sub_33C0CA0 (62KB) plus the post-legalize combine at sub_32EC4F0 (92KB) total 154KB. Upstream NVPTXISelLowering::PerformDAGCombine is approximately 2KB.

Address space 101. CICC uses address space 101 as an alternative .param encoding (seen in sub_33067C0), which does not exist in upstream LLVM NVPTX. This may be an internal convention for distinguishing kernel .param from device-function .param.

Unchanged from Upstream

The following components appear to be stock LLVM with no NVIDIA modifications:

SelectionDAG core infrastructure at 0xF05000--0xF70000 (combining, known-bits, node management)
DAG node hashing with ((a3 >> 4) ^ (a3 >> 9)) & (capacity - 1) at sub_F4CEE0
Constrained FP intrinsic lowering at sub_F47010 (36KB, "round.tonearest", "fpexcept.ignore")
ReplaceAllUsesWith implementation at sub_F162A0
All SDNode creation, deduplication, and lifecycle management

Function Map

Function	Address	Size	Role
`SelectionDAGLegalize::LegalizeOp` dispatcher (~100 opcodes)	`sub_1FCE100`	91KB	--
`SelectionDAGLegalize` action dispatch (967 cases)	`sub_1FFB890`	137KB	--
Legalization worklist management	`sub_1FF5010`		--
`ExpandNode` fallback	`sub_1FF6F70`		--
`DAGCombiner::visitNode` (6-phase per-node combine)	`sub_F20C20`	64KB	--
`DAGCombiner::combine` orchestrator (worklist management)	`sub_F681E0`	65KB	--
`ReplaceAllUsesWith` (hash: `((id >> 9) ^ (id >> 4))`)	`sub_F162A0`		--
Combine pattern matcher (STORE/BITCAST/CONSTANT)	`sub_F0F270`	25.5KB	--
Target-independent opcode-specific combine dispatcher	`sub_100E380`		--
All-constant-operand fold evaluation	`sub_1028510`		--
Vector stride / reassociation combine	`sub_F15980`		--
Generic `computeKnownBits`	`sub_F5A610`	36.7KB	--
Extended known-bits (recursive expansion limit)	`sub_F5F040`	52.4KB	--
`SelectionDAG::getNode` / CSE hash table	`sub_F4CEE0`	41.3KB	--
DAG node builder (operand/result setup)	`sub_F49030`	38.2KB	--
Constrained FP intrinsic lowering	`sub_F47010`	36.4KB	--
`NVPTXTargetLowering::LowerOperation` dispatcher	`sub_32E3060`	111KB	--
LowerOperation secondary dispatch (overflow)	`sub_3377410`	75KB	--
NVPTX custom type promotion	`sub_32A1EF0`	109KB	--
NVPTX post-legalize DAG combine	`sub_32EC4F0`	92KB	--
NVPTX vector operation splitting	`sub_32FE970`	88KB	--
NVPTX load/store lowering	`sub_32D2680`	81KB	--
NVPTX integer/FP legalization	`sub_32983B0`	79KB	--
NVPTX intrinsic lowering (tex/surf)	`sub_32B8A20`	71KB	--
NVPTX vector operation lowering	`sub_32A9030`	55KB	--
NVPTX addrspacecast / pointer lowering	`sub_32C3760`	54KB	--
NVPTX conditional/select lowering	`sub_32BE8D0`	54KB	--
NVPTX special register lowering	`sub_32B6540`	50KB	--
`NVPTXTargetLowering::PerformDAGCombine`	`sub_33C0CA0`	62KB	--
NVPTX DAGCombiner with "COVERED"/"INCLUDED" tracing	`sub_3425710`	142KB	--
`NVPTXTargetLowering::LowerCall`	`sub_3040BF0`	88KB	--
NVPTX atomic operation lowering	`sub_3048C30`	86KB	--
`NVPTXTargetLowering` constructor (action setup)	`sub_3056320`	45KB	--
Type legalization table population	`sub_3314670`	73KB	--
Intrinsic lowering mega-switch	`sub_33B0210`	343KB	--
NVPTX `computeKnownBitsForTargetNode`	`sub_33D4EF0`	114KB	--
NVPTX inline asm constraint handler	`sub_338BA40`	79KB	--
`SelectionDAGBuilder::visitInlineAsm`	`sub_2079C70`	83KB	--
NVPTX `visitNVVMTexSurf` handler	`sub_2077400`	20KB	--
NVPTX argument passing / type coercion	`sub_2072590`	38KB	--
`NVPTXDAGToDAGISel::Select` driver	`sub_3090F90`	91KB	--
Address space / memory operation support	`sub_33067C0`	74KB	--
Global address lowering	`sub_331F6A0`	62KB	--
Formal arguments / return lowering	`sub_3349730`	82KB	--
Call lowering (`visitCall` / `LowerCallTo`)	`sub_332FEA0`	79KB	--

Reimplementation Checklist

NVPTXTargetLowering with legality tables. Populate the 2D action table at offset +2422 (259-byte row stride, indexed by 259 * VT + opcode) with per-SM-version legal/custom/expand/promote actions for all ISD opcodes and NVPTX-specific opcodes. Include the condition-code action table at offset +18112 and the SM-gated type legality rules (f16 on SM 53+, v2f16 on SM 70+, bf16 on SM 80+).
LowerOperation dispatcher (111KB equivalent). Implement the master LowerOperation switch dispatching ~3,626 lines of GPU-specific lowering for loads, stores, calls, atomics, vector operations, and address space casts, including the .param-space calling convention with DeclareParam/StoreV1-V4/LoadRetParam sequences.
Intrinsic lowering mega-switch (343KB equivalent). Build the intrinsic lowering function covering 200+ CUDA intrinsic IDs (up to ID 14196/0x3774), organized as a jump table with per-intrinsic lowering handlers for tensor core, warp, surface/texture, and math intrinsics.
PerformDAGCombine for NVPTX. Implement the NVPTX-specific DAG combines (62KB) that run after operation legalization, including load/store vectorization (offset-based coalescing with sorting for ld.v2/ld.v4/st.v2/st.v4 detection), NVPTX-specific algebraic simplifications, and the "COVERED"/"INCLUDED" tracing infrastructure.
ISel::Select pattern matching (91KB equivalent). Implement the top-down instruction selection driver that visits DAG nodes in topological order, matching against NVPTX-specific patterns via opcode-indexed tables, with special handling for tensor core instructions, inline assembly constraints, and multi-result nodes.
computeKnownBits for NVPTX (114KB). Implement the NVPTX-specific known-bits analysis covering ctaid, tid, ntid, address space pointer width constraints, and GPU-specific intrinsic range information to enable downstream optimization.

Cross-References

Type Legalization -- detailed 348KB monolith documentation
ISel Pattern Matching -- instruction selection patterns and matching
Register Allocation -- follows ISel in the pipeline
Address Spaces -- consolidated AS reference
Register Classes -- NVPTX register class definitions
NVPTX Opcodes -- MachineInstr opcode reference
NVPTXTargetMachine -- target machine and TTI hooks
Emission -- PTX emission from MachineInstrs
Tensor Core Intrinsics -- WMMA/MMA intrinsic detail
Surface/Texture Intrinsics -- tex/surf lowering

Type Legalization

Prerequisites: Familiarity with SelectionDAG, NVPTX register classes, and LLVM type system basics. Understanding of the compilation pipeline up to instruction selection is assumed.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Type legalization is the SelectionDAG phase that rewrites every DAG node whose result or operand type is illegal for the target into equivalent sequences of legal-type operations. In upstream LLVM this logic spans four source files (LegalizeTypes.cpp, LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, LegalizeVectorTypes.cpp) totaling roughly 16,000 lines. In CICC v13.0, NVIDIA ships all of it as a single 348KB monolithic function -- sub_20019C0 -- the largest function in the SelectionDAG address range and among the largest in the entire binary. Operation legalization follows in a separate 169KB function (sub_1FFB890), and vector split/scalarize dispatchers fan out into an additional 25+ worker functions.

The monolithic structure is either an LTO inlining artifact (all four upstream .cpp files collapsed by link-time optimization) or a deliberate choice for branch-prediction locality. The functional behavior is a faithful reproduction of upstream LLVM's DAGTypeLegalizer, but the legality tables, legal-type set, and vector legalization rules are heavily NVPTX-specific.


Type legalizer monolith	`sub_20019C0` (348KB, 10,739 lines)
Operation legalizer	`sub_1FFB890` (169KB)
SplitVectorResult	`sub_2029C10` (dispatcher, 190 cases)
SplitVectorOperand	`sub_202E5A0` (dispatcher, 157 cases)
ScalarizeVectorResult	`sub_2036110`
ScalarizeVectorOperand	`sub_2035F80`
WidenVector	`sub_2036AE0` (31KB, limited NVPTX usage)
ExpandIntegerResult	`sub_201BB90` (75KB, 632 case labels)
PromoteIntegerResult	`sub_2000100` (45KB)
PerformExpensiveChecks	`sub_2010FB0` (62KB, debug verifier)
NVPTXTargetLowering init	`sub_3314670` (73KB, table population)
Upstream	`LegalizeTypes.cpp`, `LegalizeIntegerTypes.cpp`, `LegalizeFloatTypes.cpp`, `LegalizeVectorTypes.cpp`

Pipeline Position

Type legalization runs as the first major SelectionDAG transformation after the initial DAG is built by SelectionDAGBuilder (sub_2081F00). The full sequence:

SelectionDAGBuilder converts LLVM IR to an initial DAG with potentially illegal types
DAG Combiner (sub_F20C20) runs initial combines
DAGTypeLegalizer (sub_20019C0) iterates until all types are legal -- this page
LegalizeDAG (sub_1FFB890) legalizes operations on now-legal types
DAG Combiner runs again to clean up
Instruction selection (sub_3090F90) pattern-matches the final legal DAG

The type legalizer iterates to a fixpoint: each pass may create new nodes with illegal types (e.g., splitting a vector creates two half-width vectors that may themselves be illegal), so the worklist loops until every node in the DAG has only legal result and operand types.

NVPTX Legal Type Model

The legal type set is defined in the NVPTXTargetLowering constructor (sub_3314670, 73KB) which populates the action table at offset +2422. NVPTX has a narrow set of legal types dictated by the PTX register file:

Register Class	Legal MVTs
Int1Regs (`%p`)	`i1`
Int16Regs (`%rs`)	`i16`
Int32Regs (`%r`)	`i32`
Int64Regs (`%rd`)	`i64`
Float32Regs (`%f`)	`f32`
Float64Regs (`%fd`)	`f64`
Int16HalfRegs (`%h`)	`f16`, `bf16`
Int32HalfRegs (`%hh`)	`v2f16`, `v2bf16`, `v2i16`, `v4i8`
Int128Regs (`%rq`)	`i128` (SM 70+)

For the complete register class table (vtable addresses, PTX types, encoded IDs, copy opcodes) see Register Classes.

The critical constraint: Int32HalfRegs is the only vector register class. It holds exactly 32 bits of packed data. The only legal vector types are those that pack into 32 bits:

v2f16 -- two f16 values in one 32-bit register
v2bf16 -- two bf16 values (SM 80+)
v2i16 -- two i16 values in one 32-bit register
v4i8 -- four i8 values in one 32-bit register

Every other vector type (v4f32, v2f32, v8i32, v4f16, v2f64, etc.) is illegal and must be split, scalarized, or expanded during type legalization. There is no packed float32 SIMD on NVPTX -- this is a fundamental architectural constraint.

SM-Gated Type Legality

The legal type set changes with the SM version. The constructor at sub_3314670 queries subtarget features and conditionally marks types legal or illegal:

SM Range	Legal Types Added	Legalization Change
SM < 53	(base: `i1`, `i16`, `i32`, `i64`, `f32`, `f64`)	`f16` ops promoted to `f32`; no legal vectors
SM 53--69	Scalar `f16`	`v2f16` legal for ld/st but packed arithmetic is Custom/Expand
SM 70+	`v2f16` packed arithmetic, `i128`	`f16x2` PTX instructions (`add.f16x2`, `mul.f16x2`, `fma.rn.f16x2`)
SM 80+	`v2bf16`	`bf16x2` PTX instructions
SM 100+	`e2m1x2` (FP4), `e2m3x2` (FP6), `e3m2x2` (FP6), `ue8m0x2`	Additional packed narrow FP types for tensor core feeders

On SM 70+, v2f16 operations marked Legal or Custom in the action table map directly to packed PTX instructions, delivering 2x throughput versus scalarized f16. This is why CUDA __half2 operations are efficient: the type stays packed through the entire pipeline. In contrast, float4 is always fully scalarized to four independent f32 operations on every SM generation.

The Legality Table

Primary Action Table (offset +2422)

The core data structure is a 2D array inside NVPTXTargetLowering:

action = *(uint8_t *)(TLI + 259 * VT + opcode + 2422)

Where:

TLI = pointer to NVPTXTargetLowering object (loaded from this->TLI at a1[1])
VT = SimpleVT enum value (1--10 for scalar types, 14--109 for vector types)
opcode = ISD opcode (0--258), capped at 0x102 by a guard check
259 = row stride (256 generic opcodes + 3 metadata bytes per VT row)

The action byte encodes:

Value	Action	Meaning
`0`	Legal	Node is natively supported -- return immediately
`1`	Custom	Call `NVPTXTargetLowering::LowerOperation` (vtable slot #164, offset `+1312`)
`2`	Expand	Call LegalizeTypes, then ExpandNode (`sub_1FF6F70`) as fallback
`3`	LibCall	Call ExpandNode directly for library-call substitution
`4`	Promote	Find a larger legal type and rebuild the node at that type

The legality check uses (action & 0xFB) == 0 as the "legal" predicate. This means bit 2 is a don't-care -- a node with action byte 0x04 is still treated as legal in certain fast-path checks, which is the standard LLVM encoding where bit 2 flags "custom-but-legal" operations.

Type-Supported Flag Array (offset +120)

A second structure at TLI + 8*VT + 120 is a pointer array: non-null means the type VT is natively supported by the target. This provides a fast "is this type legal at all?" check before the per-opcode lookup.

Promotion Action Table (offset +2681)

A 1D table indexed by opcode only (no VT dimension):

action = *(uint8_t *)(TLI + opcode + 2681)

Used for four specific opcodes: BSWAP (43), CTLZ (44), CTTZ (45), and BITREVERSE (199). Also used for opcode 204 (CONCAT_VECTORS) when the operand type is zero. This table encodes whether these operations should be promoted regardless of operand type.

FSINCOS Action Table (offset +3976)

Another 1D table for FSINCOS (opcode 211):

action = *(uint8_t *)(TLI + opcode + 3976)

FSINCOS has unique legalization requirements because it produces two results (sin and cos simultaneously).

Condition Code Action Table (offset +18112)

A packed 4-bit nibble table for condition-code-dependent operations (FP_TO_SINT, FP_TO_UINT, SELECT_CC, BR_CC):

base   = (VT_id >> 3) + 15 * condcode_type + 18112
action = (*(uint32_t *)(TLI + base * 4 + 12) >> (4 * (VT_id & 7))) & 0xF

The 15-entry stride per condition code allows per-CC/per-VT legalization decisions. Each nibble stores a 4-bit action code, so two VT actions pack into one byte. This is the standard LLVM condition-code action encoding, but the table is populated with NVPTX-specific rules (e.g., PTX's limited set of comparison predicates determines which CCs are legal for which types).

SimpleVT Type Encoding

Types throughout the legalizer are encoded as a single byte, the SimpleVT enum:

SimpleVT	Type	SimpleVT	Type
0	extended/custom	7	`i128`
1	`i1`	8	`f16`
2	`i2` (rare)	9	`f32`
3	`i8`	10	`f64`
4	`i16`	14--55	fixed-width vectors
5	`i32`	56--109	scalable vectors
6	`i64`

The bitwidth-to-SimpleVT conversion pattern appears as a recurring code fragment at least 11 times in sub_20019C0:

// Reconstructed from decompilation -- 11 instances in the function
if (bits == 32)       VT = 5;  // i32
else if (bits > 32) { VT = 6;  // i64 tentative
  if (bits != 64) { VT = 0;    // extended type
    if (bits == 128) VT = 7;   // i128
  }
} else {
  VT = 3;                      // i8 tentative
  if (bits != 8) VT = 4 * (bits == 16);  // i16 or 0
}

The vector type range 14--109 maps to scalar element types through a ~100-case switch block that also appears six times in the function body:

MVT Range	Scalar Element	Description
14--23	`i2` (VT 2)	Fixed-width `v2i2`..`v1024i2`
24--32	`i8` (VT 3)	Fixed-width `v2i8`..`v256i8`
33--40	`i16` (VT 4)	Fixed-width `v2i16`..`v64i16`
41--48	`i32` (VT 5)	Fixed-width `v2i32`..`v64i32`
49--54	`i64` (VT 6)	Fixed-width `v2i64`..`v32i64`
55	`i128` (VT 7)	Fixed-width `v2i128`
56--61	`i2` (VT 2)	Scalable `nxv2i2`..`nxv64i2`
62--67	`i8` (VT 3)	Scalable `nxv2i8`..`nxv64i8`
68--73	`i16` (VT 4)	Scalable `nxv2i16`..`nxv64i16`
74--79	`i32` (VT 5)	Scalable `nxv2i32`..`nxv64i32`
80--85	`i64` (VT 6)	Scalable `nxv2i64`..`nxv64i64`
86--88	`f16` (VT 8)	Scalable `nxv2f16`..`nxv8f16`
89--93	`f32` (VT 9)	Scalable `nxv2f32`..`nxv32f32`
94--97	`f64` (VT 10)	Scalable `nxv2f64`..`nxv16f64`
98--100	`f16` (VT 8)	Fixed-width `v2f16`..`v8f16` (additional)
101--105	`f32` (VT 9)	Fixed-width `v2f32`..`v32f32` (additional)
106--109	`f64` (VT 10)	Fixed-width `v2f64`..`v16f64` (additional)

This switch implements getVectorElementType() on the decompiled SimpleVT enum. Its six-fold repetition in the monolith accounts for a significant fraction of the function's 348KB size.

The Four Legalization Actions

Promote (Type Widening)

Promotion widens a narrow type to the nearest legal register width. The pattern is consistent across integer and FP promotion:

promoted_vt = TLI.getTypeToPromoteTo(opcode, VT)       // sub_1F40B60
extended    = DAG.getNode(ANY_EXTEND, DL, promoted_vt, input)   // opcode 143
result      = DAG.getNode(original_op, DL, promoted_vt, extended, ...)
truncated   = DAG.getNode(TRUNCATE, DL, original_vt, result)   // opcode 145

For integer promotion, ANY_EXTEND (opcode 143) or ZERO_EXTEND (opcode 144) widens the input depending on whether the high bits need defined values (unsigned operations use ZERO_EXTEND). For FP promotion, the pattern uses FP_EXTEND/FP_ROUND instead:

ext0 = DAG.getNode(FP_EXTEND, DL, promoted_vt, op0)
ext1 = DAG.getNode(FP_EXTEND, DL, promoted_vt, op1)
res  = DAG.getNode(FADD, DL, promoted_vt, ext0, ext1)
out  = DAG.getNode(FP_ROUND, DL, original_vt, res)

The promote path in sub_1FFB890 contains approximately 30 opcode-specific expansion strategies. The custom-promotion BST (red-black tree at TLI + 9257/9258) stores (opcode, VT) pairs that override the default promotion target. When no BST entry exists, a linear scan walks upward from the current VT until it finds a type where the action is not Custom (i.e., Legal or Expand).

Expand (Type Splitting)

Expansion splits a wide type into two halves and reassembles the result:

// i128 ADD expansion (simplified)
lo_a = DAG.getNode(EXTRACT_ELEMENT, DL, i64, a, 0)   // low half
hi_a = DAG.getNode(EXTRACT_ELEMENT, DL, i64, a, 1)   // high half
lo_b = DAG.getNode(EXTRACT_ELEMENT, DL, i64, b, 0)
hi_b = DAG.getNode(EXTRACT_ELEMENT, DL, i64, b, 1)
lo_r = DAG.getNode(ADD, DL, i64, lo_a, lo_b)
carry = ...  // carry detection via SETCC
hi_r = DAG.getNode(ADD, DL, i64, hi_a, hi_b)
hi_r = DAG.getNode(ADD, DL, i64, hi_r, carry)
result = DAG.getNode(BUILD_PAIR, DL, i128, lo_r, hi_r)

For CTLZ (case 53), expansion builds an all-ones mask, AND chain, and shift sequence. For SINT_TO_FP/UINT_TO_FP (cases 59/60), the helper sub_20B5C20 performs iterative two-way splitting: it finds the half-type, builds the pair, and recursively legalizes each half.

The ExpandIntegerResult handler at sub_201BB90 (75KB, 632 case labels) is itself a major function that dispatches expansion for specific opcodes including STORE (case 77), shifts (81--93), and atomics.

Soften (Float-to-Integer Emulation)

Softening converts unsupported FP operations to integer-based library call sequences. On NVPTX this primarily affects f128 (which has no hardware support on any SM) and f16 on SM < 53. The softened path at sub_2019DA0 (18KB) dispatches via the SoftenedFloats DenseMap.

The FADD/FMUL cases (74/75 in the main switch) compute twice the bit width, find the promoted FP type, and build SUB (opcode 54) / SRL (opcode 123) chains that implement the FP operation in integer arithmetic.

Scalarize and Split Vector

Vector legalization proceeds through recursive halving:

v8f32  -> split -> 2x v4f32
v4f32  -> split -> 2x v2f32
v2f32  -> scalarize -> 2x f32    (v2f32 is NOT legal on NVPTX)

v4f16  -> split -> 2x v2f16     (LEGAL on SM 70+ -- stops here)
v8f16  -> split -> 2x v4f16 -> 4x v2f16

v4i8   -> LEGAL (packed in Int32HalfRegs, no split needed)
v8i8   -> split -> 2x v4i8     (one split, then legal)

The splitting strategy follows LLVM's standard approach:

Determine half type: v4f32 splits to v2f32 via EVT::getVectorVT(scalar_element, count/2) (sub_1F58CC0)
Split operands: Look up the SplitVectors DenseMap to get {Lo, Hi} halves from the input's own legalization
Apply operation: Lo_result = DAG.getNode(opcode, DL, half_type, Lo_op1, Lo_op2), and similarly for Hi
Record result: Store {Lo_result, Hi_result} in the SplitVectors DenseMap via sub_20167D0

The critical observation for NVPTX: v2f32 is not legal (no 64-bit packed float register class), so v4f32 ends up fully scalarized to 4x f32. In contrast, v4f16 on SM 70+ splits to 2x v2f16 which is legal, enabling the f16x2 packed instruction path.

Master Opcode Dispatch (sub_20019C0)

The main body of sub_20019C0 is a switch on *(int16_t *)(node + 24) -- the ISD opcode of the current SDNode. Approximately 50 cases are handled:

Case	ISD Opcode	Action
10	`LOAD`	`legalizeLoad` -- type-aware load splitting
11	`STORE`	Iterative type demotion loop (see below)
20--21, 26	Generic arithmetic	Promote via `sub_1D38BB0` (getConstant)
27	`EXTRACT_ELEMENT`	Split + re-extract
29	`BUILD_PAIR`	Promote to `i32`
48	`BITCAST`	Promote or expand depending on `isSimple()`
49	`EXTRACT_SUBVECTOR`	Extract + rebuild via TRUNCATE (opcode 145)
50	`INSERT_SUBVECTOR`	Low/upper split via ANY_EXTEND (143) / ZERO_EXTEND_INREG (144)
51	`CONCAT_VECTORS`	Iterate operands, copy each to result list
53	`CTLZ` / `CTPOP`	Expand via mask-then-shift (AND=120, ADD=52)
54	`ATOMIC_CMP_SWAP`	Full promote path: check legality table, fallback to libcall
55--56	`SIGN_EXTEND_INREG` / `SMIN`	Legality check via `TLI + 259*VT + opcode + 2422`
57--58	`FP_TO_SINT` / `FP_TO_UINT`	Chain of promote + expand nodes
59--60	`SINT_TO_FP` / `UINT_TO_FP`	Iterative split via `sub_20B5C20`
70, 72	`FMINNUM` / `FMAXNUM`	BUILD_PAIR (opcode 0x89) reassembly
74--75	`FADD` / `FMUL`	Promote to wider FP type
77	`FMA`	Extend operands, FMA at wider type, round back
105	`BUILD_VECTOR`	Delegate to `sub_1FEC5F0`
106	`EXTRACT_VECTOR_ELT`	Check vector element count, dispatch
108	`MGATHER` / `MSCATTER`	Load/store with alignment fixup via `sub_20BD400`
110	`VSELECT`	Element-by-element type demotion loop
112--113	`SETCC`	Legality check with swapped-direction fallback
114--117	`VECREDUCE_*`	Opcode lookup in `dword_42FEAE0`, chain to VECREDUCE
122--124	`SHL` / `SRL` / `SRA`	Iterative width expansion
125--126	`ROTL` / `ROTR`	4-way split: shift + mask + OR
136	`BR_CC`	Uses CC action table at offset `+18112`
152	`ATOMIC_LOAD_*`	Delegate to `sub_20B7F50` (atomic promote)
153	`ATOMIC_CMP_SWAP_WITH_SUCCESS`	Full CAS expansion with APInt mask
199--200	`INTRINSIC_W_CHAIN` / `INTRINSIC_WO_CHAIN`	TLI+112 check, intrinsic lowering dispatch
211	`UNDEF`	Replicate zero-constant to fill operand count
243	`TOKEN_FACTOR`	Duplicate single operand to all slots

Cases not listed fall through to LABEL_25 (node already legal or handled by a different legalization category).

Store Iterative Demotion (Case 11)

The STORE case contains an explicit type-walking loop that searches downward for a legal store type:

// Reconstructed from case 11, lines ~2077-2095
while ((vt_byte - 8) > 1) {          // while VT is not f16(8) or f32(9)
    --vt_byte;                        // try next smaller type
    if (TLI.getTypeAction(VT))        // sub_1D16180
        if (TLI.isOperationLegal(STORE, VT))
            break;                    // found a legal store type
}

This walks i64 -> i32 -> i16 -> i8 (or f64 -> f32 -> f16) until it finds a type the target can store natively, then emits a truncating store sequence via sub_1D3C080 (getTruncStore).

Atomic CAS Expansion (Cases 54, 153)

Atomic operations receive extensive legalization because PTX has limited atomic type support. The CAS expansion at case 153 (ATOMIC_CMP_SWAP_WITH_SUCCESS) builds APInt masks via sub_16A4EF0, constructs compare-and-swap loops, and handles the success flag as a separate result. The helper sub_20B7E10 decides whether to use a CAS loop or a direct atomic based on the target SM's capabilities.

Vector Legalization Workers

SplitVectorResult (sub_2029C10)

This thin dispatcher reads the opcode from *(uint16_t *)(node + 0x18), subtracts base 0x30 (48), and dispatches across 190 cases (opcodes 48--237) to SplitVecRes_XXX workers. Key handler categories:

Handler	Cases	Description
`sub_20230C0`	FADD--FREM, SHL/SRA/SRL, int arith	Generic binary op split: split both inputs, apply op to each half
`sub_2028A10`	CONCAT, INSERT_ELT, load/store variants	Unary/multi-input split with reassembly
`sub_2025910`	Strict FP (cases 81--98)	Strict FP split with exception chain propagation
`sub_2023B70`	BUILD_VECTOR (case 104)	Split BUILD_VECTOR into two half-width constructs
`sub_2023F80`	CONCAT inner (case 107)	Trivial: return two operands as Lo and Hi
`sub_20293A0`	VECTOR_SHUFFLE (case 110, 10KB)	Decompose shuffle into sub-shuffles on half-width vectors
`sub_20251A0`	VSELECT, EXTRACT_ELT	Split condition mask along with operands
`sub_2025380`	Extending loads (cases 149--151)	Split load into two half-width loads

Four handlers in the 0x214xxxx range are NVPTX-specific split workers not present in upstream:

Handler	Opcode	NVPTX-Specific Behavior
`sub_2146BB0`	CONCAT_VECTORS	Checks VT range 0x0E--0x6D for packed-type dispatch
`sub_2146C90`	SELECT_CC / BR_CC (2.7KB)	Multi-operand split with per-operand type classification
`sub_2147770`	FP_ROUND-like	NVPTX-specific FP rounding split
`sub_2147AE0`	BITCAST	NVPTX-specific bitcast split for packed registers

After a handler returns, the dispatcher stores the {Lo, Hi} result pair in the SplitVectors DenseMap via sub_20167D0 (hash = 37 * key, quadratic probing, rehash at 75% load).

Fatal error on unhandled opcode: "Do not know how to split the result of this operator!" via sub_16BD130.

SplitVectorOperand (sub_202E5A0)

Same dispatch pattern as SplitVectorResult but for operand-side legalization. Base opcode 0x65 (101), range 157 (opcodes 101--258). Notable inline handling for FP_EXTEND/FP_ROUND (cases 146--147, 152--153) that compares source and destination type sizes to choose the correct split strategy:

// Inline in SplitVectorOperand, cases 146-147
src_size = getSizeInBits(src_vt);    // sub_2021900
dst_size = getSizeInBits(dst_vt);
if (dst_size < src_size)
    SplitVecOp_VSELECT(...)          // sub_202D8A0 -- shrinking
else
    SplitVecOp_Generic(...)          // sub_202A670 -- standard split

After the handler, ReplaceAllUsesOfValueWith (sub_2013400) substitutes the old node with the split result.

Scalarize and Widen

ScalarizeVectorResult (sub_2036110) handles vector types that reduce to scalar. ScalarizeVectorOperand (sub_2035F80) has 80 cases starting from base opcode 106. These cover the final step when splitting has reduced a vector to width 1 or 2 elements, and those elements must become individual scalars.

WidenVector (sub_2036AE0, 31KB) sees limited use on NVPTX. Widening is only useful when the wider type is legal:

Widening v1f16 to v2f16 is useful (promotes to legal packed type)
Widening v3i8 to v4i8 is useful (promotes to legal packed type)
Widening v3f32 to v4f32 is not useful (v4f32 is still illegal)

The WidenVector path uses the MVT lookup table at word_4305480 to determine element counts and find the nearest wider legal vector type.

Operation Legalization (sub_1FFB890)

After type legalization, operation legalization processes each node through a per-opcode action lookup. The same primary action table is used:

action = *(uint8_t *)(TLI + 259 * VT + opcode + 2422)

The dispatch:

Action	Code	Path
Legal	0	Return immediately
Custom	1	`TLI->LowerOperation(node, DAG)` via vtable slot #164 (offset `+1312`)
Expand	2	`sub_20019C0` (LegalizeTypes), then `sub_1FF6F70` (ExpandNode) as fallback
LibCall	3	`sub_1FF6F70` (ExpandNode) directly
Promote	4	Find larger legal type, rebuild node
Special	5+	`sub_1FF9780` (ExpandLoad) or `sub_1FF5310` (LegalizeLoadOps) for load/store variants

When Custom lowering returns NULL, the framework falls through to expansion. When it returns a different node, ReplaceAllUsesWith splices the replacement into the DAG and marks the old node dead (tombstone value -2 in the worklist hash set).

The operation legalizer also contains an outer switch on the ISD opcode (v11 = *(uint16_t *)(node + 24)) for opcode-specific handling before the table lookup. Shift/rotate opcodes (81--98) are remapped to internal opcode numbers before the table lookup (e.g., case 81 maps to internal opcode 76, case 82 to 77). The opcode-specific dispatch covers approximately 30 opcode groups.

How CUDA Vector Types Get Legalized

Tracing common CUDA types through the full legalization pipeline:

float4 (v4f32) -- fully scalarized on every SM:

SplitVectorResult: v4f32 -> 2x v2f32
ScalarizeVectorResult: v2f32 -> 2x f32 (no packed f32 register class)
Final: 4 independent f32 scalar operations
PTX: 4 separate add.f32 / mul.f32 instructions

half2 (__half2 / v2f16) -- stays packed on SM 70+:

Legal type, no splitting needed
Final: single v2f16 packed operation
PTX: add.f16x2, mul.f16x2, fma.rn.f16x2

__nv_bfloat162 (v2bf16) -- legal on SM 80+:

Same as half2 but with bf16x2 PTX instructions

float2 (v2f32) -- scalarized, not packed:

ScalarizeVectorResult: v2f32 -> 2x f32
No 64-bit packed float register class exists

v4f16 on SM 70+:

SplitVectorResult: v4f16 -> 2x v2f16 (legal -- stops here)
Final: 2x f16x2 packed operations (2x throughput vs scalarized)

v4f16 on SM < 53:

Split: v4f16 -> 2x v2f16
Scalarize: each v2f16 -> 2x f16
Promote: each f16 -> FP_EXTEND -> f32
Final: 4x f32 operations with FP_EXTEND/FP_ROUND wrappers

double2 (v2f64):

Scalarize: v2f64 -> 2x f64 (splitting would give v1f64 which is scalar)

Tensor core fragments bypass vector legalization entirely. WMMA/MMA intrinsics represent matrix fragments as individual scalar registers, not LLVM vector types. However, packed conversion types used with tensor cores (e4m3x2, e5m2x2, e2m1x2, etc.) do pass through legalization and map to Int32HalfRegs.

Verification Infrastructure

sub_2010FB0 (62KB) implements DAGTypeLegalizer::PerformExpensiveChecks, gated by the enable-legalize-types-checking flag (registered at ctor_341). It validates nine DenseMap categories that track the state of every legalized value:

Map	Content
`PromotedIntegers`	Values widened to a larger integer type
`ExpandedIntegers`	Values split into two halves
`SoftenedFloats`	FP values converted to integer representation
`PromotedFloats`	FP values widened to a larger FP type
`ExpandedFloats`	FP values split into halves
`ScalarizedVectors`	Vectors reduced to scalar elements
`SplitVectors`	Vectors split into `{Lo, Hi}` pairs
`WidenedVectors`	Vectors widened to a larger legal type
`ReplacedValues`	Values replaced by RAUW

Diagnostic strings on verification failure: "Processed value not in any map!", "Value in multiple maps!", "Value with legal type was transformed!".

DAG Node Builder Subroutines

Key subroutines called from the type legalizer for constructing replacement DAG nodes:

Function	Upstream Equivalent	Notes
`sub_1D309E0`	`DAG.getNode(opc, DL, VT, op)`	1-operand (TRUNCATE, ANY_EXTEND, etc.)
`sub_1D332F0`	`DAG.getNode(opc, DL, VT, op1, op2)`	2-operand
`sub_1D3A900`	`DAG.getNode(opc, DL, VT, op1, op2, op3)`	3-operand (FMA)
`sub_1D38BB0`	`DAG.getConstant(val, DL, VT)`	Integer constant creation
`sub_1D38970`	`DAG.getConstant(APInt)`	Wide constant / all-ones mask
`sub_1D364E0`	`DAG.getUNDEF(VT)`	Undefined value
`sub_1D37440`	`DAG.getSetCC(DL, VT, LHS, RHS, CC)`	Comparison node
`sub_1D36A20`	`DAG.getSelectCC(DL, VT, ..., CC)`	Select-on-comparison
`sub_1D3BC50`	`DAG.getExtLoad(opc, DL, VT, ...)`	Extending load
`sub_1D3C080`	`DAG.getTruncStore(...)`	Truncating store
`sub_1D23890`	`DAG.ReplaceAllUsesWith(old, new)`	RAUW for result replacement
`sub_1FEB8F0`	`MVT::getSizeInBits(SimpleVT)`	Bit width from SimpleVT
`sub_1F58D40`	`EVT::getSizeInBits()`	Bit width from extended VT
`sub_1F58D30`	`EVT::getVectorNumElements()`	Vector element count
`sub_1F40B60`	`TLI.getTypeToPromoteTo(opc, VT)`	Promotion target lookup
`sub_1D16180`	`TLI.getTypeAction(VT)`	Action for type
`sub_1D16EF0`	`TLI.getCondCodeAction(CC, VT)`	Condition code legality

Result Accumulation and Worklist

Results from each legalization step are accumulated into a SmallVector of {SDValue, SDValue} pairs (node pointer + result index). The vector grows via sub_16CD150 (SmallVector::grow()) when count exceeds capacity. After each pass, new nodes feed back into the worklist for iterative re-legalization until fixpoint -- all types are legal.

The worklist hash set uses open addressing with hash function ((id >> 9) ^ (id >> 4)) & (size - 1) and grows at 75% load factor. Dead nodes are marked with sentinel -2 (tombstone). The DenseMap instances used by the split/scalarize infrastructure use hash 37 * key with quadratic probing.

Differences from Upstream LLVM

Aspect	Upstream LLVM 20	CICC v13.0
Source organization	4 files, ~16,000 lines total	1 monolithic function, 10,739 lines (348KB)
Vector legal types	Target-dependent, often includes `v4f32`, `v2f64`	Only `v2f16`, `v2bf16`, `v2i16`, `v4i8` (32-bit packed)
`v2f32`	Legal on most targets (x86, ARM)	Illegal -- scalarized
Scalable vectors	Actively used (AArch64 SVE)	Encoded in tables but no SM target uses them
`i128`	Expanded on most targets	Legal on SM 70+ (Int128Regs / `.b128` / `%rq`)
NVPTX-specific split handlers	N/A	4 functions in 0x214xxxx range for packed-type dispatch
Custom-promotion BST	Standard red-black tree	Same, at TLI offsets +9257/+9258
Type-supported flag array	Pointer array at known offset	At `TLI + 8*VT + 120`
CC action table	4-bit packed nibbles	Same encoding, NVPTX-specific CC legal set

The monolithic structure means that code changes to any legalization category (integer promote, float soften, vector split) require recompilation of the entire 348KB function. In upstream LLVM, these are independent compilation units.

Configuration

Knob	Location	Default	Description
`enable-legalize-types-checking`	`ctor_341`	false	Enables `PerformExpensiveChecks` debug verifier

No CICC-specific legalization knobs beyond the standard LLVM flag were found. The ptxas assembler has a related knob MercuryDisableLegalizationOfTexToURBound for texture-to-uniform-register legalization, but this operates at the assembler level, not in CICC.

Key Functions

Function	Address	Size	Role
Type legalizer monolith	`sub_20019C0`	348KB	`DAGTypeLegalizer::run()` master dispatch
PromoteIntegerResult	`sub_2000100`	45KB	Integer type promotion
PromoteFloatResult	`sub_2019DA0`	18KB	Float type promotion / softening
ExpandFloatResult	`sub_201B410`	11KB	Float type expansion
ExpandIntegerResult	`sub_201BB90`	75KB	Integer type expansion (632 case labels)
Promote+expand dispatch	`sub_201E5F0`	81KB	Secondary dispatch (441 case labels)
PerformExpensiveChecks	`sub_2010FB0`	62KB	Debug verifier for 9 DenseMap categories
SplitVectorResult	`sub_2029C10`	5KB	Dispatcher for 190 opcode cases
SplitVectorOperand	`sub_202E5A0`	6KB	Dispatcher for 157 opcode cases
SplitVecRes_BinOp	`sub_20230C0`	--	Generic binary op split
SplitVecRes_VECTOR_SHUFFLE	`sub_20293A0`	10KB	Shuffle decomposition
ScalarizeVectorResult	`sub_2036110`	--	Vector-to-scalar reduction
ScalarizeVectorOperand	`sub_2035F80`	--	Operand scalarization (80 cases)
WidenVector	`sub_2036AE0`	31KB	Vector widening (limited NVPTX use)
Operation legalizer	`sub_1FFB890`	169KB	`LegalizeOp` per-node action dispatch
ExpandNode	`sub_1FF6F70`	43KB	Full node expansion fallback
ExpandLoad	`sub_1FF9780`	55KB	Load legalization
LegalizeLoadOps	`sub_1FF5310`	41KB	Store splitting/coalescing
NVPTX split: CONCAT	`sub_2146BB0`	219B	NVPTX-specific CONCAT_VECTORS split
NVPTX split: SELECT_CC	`sub_2146C90`	2.7KB	NVPTX-specific SELECT_CC split
NVPTX split: FP_ROUND	`sub_2147770`	--	NVPTX-specific FP rounding split
NVPTX split: BITCAST	`sub_2147AE0`	--	NVPTX-specific bitcast split
NVPTXTargetLowering init	`sub_3314670`	73KB	Populates legality tables
FP conversion split helper	`sub_20B5C20`	--	Iterative SINT_TO_FP/UINT_TO_FP
Atomic promote helper	`sub_20B7F50`	--	ATOMIC_LOAD promotion
CAS expansion decision	`sub_20B7E10`	--	CAS loop vs direct atomic
Gather/scatter alignment	`sub_20BD400`	--	MGATHER/MSCATTER alignment fixup

Reimplementation Checklist

NVPTX legal type model. Define the narrow set of legal types dictated by PTX register classes (i1, i16, i32, i64, f32, f64, f16, bf16, v2f16, v2bf16, v2i16, v4i8, i128), with SM-gated legality: f16 arithmetic on SM 53+, v2f16 packed ops on SM 70+, v2bf16 on SM 80+, FP4/FP6 packed types on SM 100+.
Primary legality table population. Build the 2D action table at TLI + 259 * VT + opcode + 2422 with per-opcode-per-type action bytes (0=Legal, 1=Custom, 2=Expand, 3=LibCall, 4=Promote), plus the type-supported flag array at offset +120, the promotion action table at offset +2681, and the condition-code action table at offset +18112 with 4-bit packed nibbles.
Four legalization actions. Implement Promote (widen via ANY_EXTEND/ZERO_EXTEND, operate, TRUNCATE), Expand (split via shift-and-OR for integers, libcall for floats), Soften (integer emulation of unsupported FP types), and Scalarize/Split-Vector (decompose illegal vectors into scalar or half-width vector operations).
Iterative fixpoint loop. Run the type legalizer worklist until every node in the DAG has only legal result and operand types, since each pass may create new nodes with illegal types (e.g., splitting a vector creates half-width vectors that may themselves require further splitting).
Vector legalization for NVPTX. Handle the critical constraint that Int32HalfRegs is the only vector class (32 bits total): scalarize all vectors wider than 32 bits (v4f32, v2f32, v8i32, etc.) while keeping v2f16/v2bf16/v2i16/v4i8 legal. Implement the SplitVectorResult/SplitVectorOperand/ScalarizeVector dispatchers with their 190+/157+/~100 case switches.
SimpleVT type encoding. Implement the bitwidth-to-SimpleVT conversion (11 instances in NVIDIA's monolith) and the ~100-case vector-element-type switch (6 instances) mapping MVT ranges 14--109 to their scalar element types.

Cross-References

SelectionDAG & Instruction Selection -- parent page covering the full SelectionDAG pipeline
NVPTX Target Infrastructure -- NVPTXTargetLowering constructor and TTI hooks
SM 70--89, SM 90, SM 100 -- per-SM legal type details
DAG Node -- SDNode layout (opcode at +24, operands at +32, type at +40)
Hash Infrastructure -- DenseMap mechanics used throughout legalization

ISel Pattern Matching & Instruction Selection

Prerequisites: Familiarity with SelectionDAG, Type Legalization, and DAG Node Layout. Understanding of the Pattern Database structure and NVPTX opcodes is recommended.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

The NVPTX instruction selector in cicc v13.0 translates legal SelectionDAG nodes into target MachineInstr opcodes through a three-level dispatch hierarchy totaling approximately 900KB of code. At the top sits NVPTXDAGToDAGISel::Select (sub_3090F90, 91KB), which builds a per-function cost table, manages a priority-queue-driven topological worklist, and calls the pattern matcher (sub_308FEE0) for every node. The pattern matcher fans out to a hand-written NVPTX-specific select switch (sub_347A8D0, 309KB) and a TableGen-generated SelectCode function (sub_348D3E0, 256KB). Surrounding this core are six NVPTX-specific sub-selectors covering memory operations, texture/surface fetches, complex addressing modes, vector patterns, and atomics. NVIDIA's key delta from upstream LLVM is (1) a compressed per-SM-variant legality table that gates which target opcodes exist on which GPU architecture, (2) a secondary 4-bit packed bitfield for fine-grained operand-class legality, and (3) the iteration budget that prevents the selector from looping indefinitely on pathological DAGs.


ISel driver	`sub_3090F90` (91KB, 2,828 lines)
Pattern matcher entry	`sub_308FEE0`
NVPTX Select switch	`sub_347A8D0` (309KB -- largest ISel function)
SelectCode (TableGen)	`sub_348D3E0` (256KB -- auto-generated)
Vector/SIMD patterns	`sub_3475BB0` (89KB)
Memory operation patterns	`sub_306D850` (77KB)
Complex addressing modes	`sub_30811D0` (77KB)
Addressing mode helper	`sub_30783B0` (39KB)
Texture/surface ISel	`sub_306A930` (52KB)
Atomic lowering	`sub_3048C30` (86KB)
Constraint table	`word_3F3E6C0` (see Pattern Database)
Compressed legality table	Base + 6414, 500-byte stride per SM variant
Secondary 4-bit bitfield	Base + 521536
Legalize action table	Object + 72760, 4-bit packed
Knob registration	`ctor_286` at `0x4FA0C0` (5KB)
Upstream LLVM source	`lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp`, `lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp`

ISel Driver: `sub_3090F90`

The top-level driver is not the pattern matcher itself; it is the orchestration loop that feeds nodes to the matcher in the right order and maintains shared state. It breaks into three phases.

Phase 1: Function Argument Cost Table

Before selecting any instructions, the driver builds a DenseMap-style hash table at this + 408 that maps function argument indices to their byte sizes. The hash table uses LLVM's standard integer-key hash function key * 37, open addressing with linear probing, and the tombstone sentinel -2. Growth triggers at 75% load factor (4 * (count + 1) >= 3 * capacity).

// Phase 1: build argument cost table
hash_table = this->arg_cost_map;  // at this + 408
for each argument A in function->args():
    byte_size = alignTo(getSizeInBits(A.type) / 8, A.alignment)
    key = A.index
    slot = (key * 37) & (capacity - 1)
    while hash_table[slot] is occupied and != key:
        slot = (slot + 1) & (capacity - 1)
    hash_table[slot] = { key, byte_size }
    if load_factor > 0.75: rehash()

The table layout:

Field	Offset from `this`	Description
`data`	+416	Pointer to hash bucket array
`count`	+424	Number of live entries
`tombstone_count`	+428	Number of tombstone slots
`capacity`	+432	Total bucket count (power of 2)

If the function has a non-void return type, the driver also inserts the return value sizes into the same table, computing aligned_size = ((size + 7) >> 3 + (1 << align) - 1) >> align << align for each return element. The return-type attribute check uses attribute kind 81 (likely sret).

Phase 2: Return Value Processing

For non-void functions, the driver iterates each return value element via:

sub_A74710(attribute, 81) -- checks for sret attribute
sub_A748A0(index) -- gets return type at given index
sub_AE5020(dataLayout, type) -- computes ABI alignment
sub_9208B0(dataLayout, type) -- computes size in bits

Each return value's aligned byte size is inserted into the argument cost table, so the pattern matcher can look up the cost of materializing any function parameter or return value during instruction selection.

Phase 3: Topological Selection Loop

The main selection loop processes DAG nodes in topological order using a min-heap priority queue where priority equals topological order (lower number = earlier in the DAG, processed first). The iteration is bounded by an explicit budget.

// Phase 3: main ISel loop
sub_308B6F0(this);  // initialize worklist from DAG
budget = 4 * numInstructions * maxBlockSize
iteration = 0

while heap is not empty:
    node = heap.extractMin()         // sub_3089BD0: heap-sift-down
    sub_308FEE0(this, node, &tmp)    // pattern matcher dispatch

    if this->selectionChanged:       // byte at this + 400
        re-scan affected nodes

    iteration++
    if iteration > budget:
        break  // anti-infinite-loop guard

sub_308AB30(this)    // cleanup
sub_264E600(this)    // deallocate worklist
sub_308B100(this)    // destroy hash table

The min-heap stores (SDNode*, priority) pairs at 16-byte stride. The heap-sift-down operation (sub_3089BD0) maintains the heap invariant after extraction. The selectionChanged flag at this + 400 is set by the pattern matcher when it replaces a node, signaling the driver to re-examine downstream users.

The iteration budget formula 4 * numInstructions * maxBlockSize is an NVIDIA addition -- upstream LLVM's SelectionDAGISel does not have this guard. It prevents pathological DAGs (for example, from heavily-inlined device functions with thousands of parameters) from causing the selector to spin indefinitely when combine/legalize/select cycles interact.

Pattern Matcher Dispatch: `sub_308FEE0`

The pattern matcher is called once per SDNode. It reads the node's opcode at *(node + 24) and dispatches through a multi-level decision tree:

Quick-reject filter. If the node is already selected (machine opcode bit set in flags), return immediately.
NVPTX-specific hand-written patterns. Calls sub_347A8D0 for NVPTX custom opcodes (NVPTXISD range >= 499). This handles texture loads, MMA instructions, atomic operations, .param-space loads/stores, and other GPU-specific patterns.
TableGen auto-generated matcher. Calls sub_348D3E0 (SelectCode) for standard ISD opcodes. This function is mechanically generated from the .td pattern files in the NVPTX backend and contains a massive switch table mapping DAG patterns to MachineInstr opcodes.
Complex pattern matching. For load/store addressing modes, calls sub_30811D0 (77KB) and sub_30783B0 (39KB), which match base + offset, base + scaled_index, and address-space-qualified patterns.
Fallback. If no pattern matches, the node is marked as "failed ISel" and the driver may retry after DAG combining.

NVPTX Select Switch: `sub_347A8D0` (309KB)

This is the largest single ISel function, containing the hand-written pattern matching for all NVIDIA-specific DAG nodes. It calls sub_969240 263 times (SDNode accessor), is self-recursive 42 times, and dispatches to:

Sub-selector	Size	Coverage
`sub_3447D70`	32KB	Specific pattern sub-dispatch
`sub_3441190`	--	Pattern helpers
`sub_343FD60`	--	Type-aware matching
`sub_3475BB0`	89KB	Vector/SIMD patterns (v2, v4 packed types)

The function switches on the SDNode opcode to handle:

Load/store with address spaces -- selects between ld.global, ld.shared, ld.local, ld.param, ld.const, and generic-space loads, each requiring different PTX instructions.
Texture/surface operations -- dispatches to sub_306A930 for tex, suld, sust instruction patterns.
MMA/WMMA/tensor ops -- selects the correct mma.sync, wmma.mma, wgmma variant based on operand types and SM architecture.
Atomic operations -- selects between atom.global.add, atom.shared.cas, red.global.add, etc., with scope qualifiers (.cta, .gpu, .sys).
Barrier/fence operations -- selects bar.sync, bar.warp.sync, membar.cta, membar.gl, membar.sys.

SelectCode (TableGen): `sub_348D3E0` (256KB)

This auto-generated function implements the standard LLVM TableGen pattern matching algorithm. It is a giant switch-table compiled from the .td instruction pattern files in lib/Target/NVPTX/*.td. The function:

Calls sub_969240 45 times and sub_32889F0 38 times (opcode/type checkers).
Contains no string literals (purely mechanical code).
Works in tandem with sub_347A8D0: the hand-written selector handles NVPTX custom nodes first, and anything that falls through goes to SelectCode.

The auto-generated matcher encodes patterns as a sequence of opcode checks, type checks, and operand recursive matches. When a full pattern matches, it calls MorphNodeTo to convert the SDNode into a MachineSDNode with the target opcode and register operands.

Compressed Instruction Legality Table

NVIDIA's instruction selector uses a per-SM-variant legality table to determine whether a given target opcode is legal on the current GPU architecture. This table is checked during instruction selection to gate SM-specific instructions (for example, wgmma instructions are illegal on SM 70 but legal on SM 90+).

The table lives at a fixed offset from the base of the ISel object, accessed by sub_376DE90:

legality = *(uint8_t*)(base + 500 * arch_variant + opcode + 6414)

Field	Encoding
Base offset	6414 bytes from object base
Row stride	500 bytes per architecture variant
Index	`500 * arch_variant + opcode`
Value 0	Illegal -- this opcode does not exist on this SM
Value 1	Custom -- requires custom lowering before emission
Value 2	Legal -- can be emitted directly

The arch_variant value selects which row of the table to consult. Each row contains 500 entries, one per target opcode. The table is read-only after initialization and occupies approximately num_variants * 500 bytes in the .data section.

Secondary 4-bit Packed Bitfield

A second legality table at base + 521536 provides fine-grained operand-class legality using 4-bit packed nibbles:

byte_offset = (opcode_class >> 3) + 36 * arch_id - arch_id
nibble      = (*(uint8_t*)(base + 521536 + byte_offset) >> (4 * (opcode_class & 7))) & 0xF

The offset simplification 36 * arch_id - arch_id equals 35 * arch_id, giving a 35-byte stride per architecture variant. Each byte packs two 4-bit legality fields, and the low/high nibble is selected by bit 0 of opcode_class. The 4-bit values encode a richer set of actions than the primary table's 3-value encoding.

Legalize Action Table

The operation legalization subsystem (separate from the ISel legality table above) uses a 4-bit packed action table at object offset 72760 to determine how to legalize each (opcode, type) pair:

index  = type_bits + 15 * opcode + 18112
action = (*(uint32_t*)(object + 4 * index + 72760) >> (4 * (type & 7))) & 0xF

Action	Value	Behavior
Legal	0	Node is natively supported
Promote	1	Widen to a larger legal type
Custom	5	Call `NVPTXTargetLowering::LowerOperation` via vtable slot 164
ExpandInteger	9	Split wide integers into halves
ExpandFloat	13	Emulate unsupported FP via libcalls
SplitVector	14	Decompose illegal vector into legal sub-vectors

This table is distinct from the type-legality table at TLI + 2422 (described in SelectionDAG), which uses a 259-byte stride and encodes the simpler 5-action set (Legal/Custom/Expand/LibCall/Promote). The table at +72760 is the operation-level action table used during the LegalizeOp phase, while the +2422 table is the type-level action table used during LegalizeTypes.

NVPTX-Specific Pattern Categories

Memory Operations: `sub_306D850` (77KB)

Selects PTX load/store instructions with the correct address space qualifier, vector width, and volatility. The function handles the full matrix of {ld,st} x {.global,.shared,.local,.param,.const,.gen} x {.b8,.b16,.b32,.b64,.b128} x {.v1,.v2,.v4} x {.volatile,.relaxed,.acquire,.release} instruction variants. Address space is determined by querying the pointer operand's address space attribute through the DAG.

The memory pattern matching also covers:

Vector loads/stores -- ld.global.v2.b32, ld.global.v4.b32, and their 64-bit variants, selected based on the vector element count (1, 2, or 4).
Parameter loads -- ld.param.b32 and st.param.b32 for call ABI (see SelectionDAG: .param ABI).
Generic-space loads with addrspacecast -- when the address space is generic (AS 0), the selector checks whether the source can be proven to be in a specific space and emits a non-generic load if so.

Texture/Surface Instructions: `sub_306A930` (52KB)

Selects tex, suld, and sust instructions from DAG nodes produced by the intrinsic lowering mega-switch. The selector dispatches through helper functions:

Helper	Purpose
`sub_2FE5F00`	Texture fetch type selection
`sub_2FE5F30`	Surface read type selection
`sub_2FE5F60`	Surface write type selection
`sub_2FE69A0`	Texture sampler mode selection
`sub_2FE6CC0`	Unified texture/surface dispatch

Texture instructions have complex operand requirements: sampler reference, texture reference, coordinate type (1D/2D/3D/cube), data type (f32/i32/f16), and optional LOD/gradient parameters. The selector maps each combination to a specific PTX tex.1d.v4.f32.f32 (or similar) opcode.

Complex Addressing Modes: `sub_30811D0` (77KB)

Matches addressing patterns for load/store operands. NVPTX supports a limited set of addressing modes compared to x86:

Register + immediate offset -- [%r1 + 16], the most common PTX addressing mode.
Register -- [%r1], zero-offset variant.
Immediate -- [0x1000], absolute address (rare on GPU).
Register + register -- not directly supported in PTX; decomposed into add + register addressing.

The complex pattern matcher at sub_30811D0 calls seven helper functions (sub_307B990 through sub_307FEF0) to decompose DAG address expressions into base-register + offset pairs. When the offset is a constant that fits in the PTX immediate field, it folds into the instruction encoding. When the offset is too large or non-constant, it generates a separate add instruction and uses register addressing.

MMA / Tensor Core Instructions

Tensor core instruction selection is split across the intrinsic lowering stage (which generates NVPTXISD nodes from wmma.load, wmma.mma, mma.sync, wgmma intrinsics) and the ISel stage (which selects the specific PTX opcode). The ISel switch in sub_347A8D0 handles these by checking:

SM architecture -- wmma requires SM 70+, mma.sync requires SM 75+, wgmma requires SM 90+.
Matrix dimensions -- m16n16k16, m8n8k4, m16n8k8, etc.
Data types -- f16, bf16, tf32, f64, i8, i4, b1, fp8 (SM 90+), fp4 (SM 100+).
Accumulator type -- f16 or f32 for half-precision MMA.

The architecture check consults the compressed legality table to determine whether a given MMA variant is legal on the target SM.

Atomic Operations: `sub_3048C30` (86KB)

Atomic instruction selection generates atom.{scope}.{op}.{type} instructions. The selector handles:

Operation	PTX	NVPTXISD opcodes
Compare-and-swap	`atom.cas`	462
Add (int)	`atom.add`	294--297
Min (signed)	`atom.min`	302--305
Max (signed)	`atom.max`	314--317
Exchange	`atom.exch`	(via generic path)
AND/OR/XOR	`atom.and` / `atom.or` / `atom.xor`	(via generic path)

The selector checks "vector atomics not supported on this architecture!" for vector-width atomics and gates them behind an SM version check (likely SM 90+). Scope qualifiers (.cta, .gpu, .sys) are determined from the memory ordering of the LLVM atomic instruction.

Vector / SIMD Patterns: `sub_3475BB0` (89KB)

Handles vector-type instruction selection for NVPTX's limited vector support (v2 and v4 packed types). The function calls sub_969240 121 times and is self-recursive 28 times. It selects between:

Packed register operations -- add.v2.f32, mul.v2.f32 when the SM supports native vector operations.
Scalarized fallback -- decomposes vector operations into per-element scalar operations when the vector type is not natively supported.
mov.v2 / mov.v4 -- register-to-register vector moves for shuffles and extracts.

Knobs

The ISel subsystem registers its knobs at ctor_286 (0x4FA0C0, 5KB):

Knob	Type	Description
`fast-isel-abort`	int	Abort mode for FastISel failures (0=silent, 1=warn, 2=abort)
`fast-isel-report-on-fallback`	bool	Report when FastISel falls back to SelectionDAG
`use-mbpi`	bool	Use Machine Branch Probability Info during ISel
`dag-disable-combine`	bool	Disable DAG combining entirely
`pre-RA-sched`	enum	Pre-RA scheduler variant: `"default"`, `"list-burr"`, `"source"`, `"list-hybrid"`, `"list-ilp"`

Note that cicc does not use FastISel for GPU code generation. The fast-isel-* knobs exist because the upstream LLVM SelectionDAGISel framework registers them unconditionally, but the NVPTX backend always takes the full SelectionDAG path. The dag-disable-combine flag is the only ISel-phase knob that has a meaningful effect on NVPTX code generation; setting it skips the DAG combiner entirely, which produces worse code but can be useful for debugging.

Differences from Upstream LLVM

Aspect	Upstream LLVM 20.0	NVIDIA cicc v13.0
Iteration budget	No explicit budget; relies on DAG invariants to terminate	Budget = `4 * numInstructions * maxBlockSize`
Argument cost table	Not present in `SelectionDAGISel`	Hash table with `key * 37` hash for argument byte sizes
Legality table	Simple `isLegal()` callback per target	Compressed 500-stride table + 4-bit packed secondary table
FastISel	Used for -O0 on most targets	Never used; always full SelectionDAG
ISel function size	Typical NVPTX `Select()` is ~50KB upstream	309KB hand-written + 256KB TableGen = 565KB total
Memory patterns	Standard load/store	5 address spaces, each with distinct PTX encoding
Texture/surface	Not present in upstream NVPTX (handled by intrinsics only)	52KB dedicated sub-selector for tex/suld/sust
Atomic patterns	Standard expansion via AtomicExpandPass	86KB custom selector with scope qualifiers and architecture gating

Function Map

Function	Address	Size	Role
`NVPTXDAGToDAGISel::Select` -- ISel driver	`sub_3090F90`	91KB	--
Pattern matcher entry (dispatches to Select switch and SelectCode)	`sub_308FEE0`	--	--
NVPTX hand-written Select switch	`sub_347A8D0`	309KB	--
TableGen-generated SelectCode	`sub_348D3E0`	256KB	--
Vector/SIMD pattern selection	`sub_3475BB0`	89KB	--
Memory operation patterns (ld/st with address spaces)	`sub_306D850`	77KB	--
Complex addressing mode matching	`sub_30811D0`	77KB	--
Addressing mode helper (base + offset extraction)	`sub_30783B0`	39KB	--
Texture/surface instruction selection	`sub_306A930`	52KB	--
Atomic operation selection	`sub_3048C30`	86KB	--
Sub-selector for specific NVPTX patterns	`sub_3447D70`	32KB	--
Pattern matching helpers	`sub_3472970`	36KB	--
Operand matching	`sub_343A2E0`	49KB	--
Compressed legality table lookup	`sub_376DE90`	--	--
Initialize topological worklist	`sub_308B6F0`	--	--
Min-heap sift-down (priority queue)	`sub_3089BD0`	--	--
ISel cleanup	`sub_308AB30`	--	--
Hash table destruction	`sub_308B100`	--	--

Cross-References

SelectionDAG & Instruction Selection -- parent page covering the full SelectionDAG pipeline (type legalization, operation legalization, DAG combining, and the ISel overview)
Pattern Database / Constraint Table -- the per-instruction operand constraint table at word_3F3E6C0
DAG Node Layout -- SDNode structure definition
NVPTX Target Infrastructure -- target machine, subtarget features, and register classes
Hash Infrastructure -- the key * 37 integer hash used throughout cicc
Tensor / MMA Builtins -- intrinsic lowering for MMA operations that feed into ISel
Surface & Texture Builtins -- intrinsic lowering for texture/surface operations
Atomics Builtins -- intrinsic lowering for atomic operations

InstrEmitter

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: SDNode field layout matches LLVM 20.0.0 base. NVIDIA merges the upstream EmitNode/EmitSpecialNode split into a single monolithic function, adds a dedicated CopyToReg handler, an extended MachineInstr flag at bit 36, and a triple vtable dispatch for GPU pseudo-expansion.

InstrEmitter is the final translation layer between LLVM's SelectionDAG representation and the machine-level MachineInstr pipeline. After instruction selection has converted LLVM IR into a DAG of target-specific SDNodes, and after scheduling has linearized those nodes into a sequence, InstrEmitter walks the scheduled sequence and converts each SDNode into one or more MachineInstrs inserted into the current MachineBasicBlock. In CICC v13.0, the emitter lives at sub_2EDDF20 (11,722 bytes) and is called by ScheduleDAGSDNodes::EmitSchedule (sub_2EE0CF0). NVIDIA's build contains three key modifications relative to upstream LLVM: a dedicated CopyToReg handler factored out for NVPTX's physical-register-heavy parameter ABI, a triple vtable dispatch pattern that gates custom pseudo-expansion for GPU-specific instructions, and an extended MachineInstr flag at bit 36 (0x1000000000) not present in stock LLVM.


EmitNode / EmitMachineNode	`sub_2EDDF20` (11,722 bytes, 872-byte stack frame)
EmitSchedule (top-level driver)	`sub_2EE0CF0` (59KB)
EmitCopyToReg handler	`sub_2ED95B0`
EmitSubregNode	`sub_2EDB7A0`
EmitCopyToRegClassOp	`sub_2EDD7E0`
ProcessOperands / EmitMachineNode core	`sub_2ED3660`
getRegForValue	`sub_2E8B400`
isDeadNode predicate	`sub_2DADC00`
MinRCSize threshold	4 (upstream default, unchanged)
VReg hash load factor	3/4 (rehash when `count * 4 >= capacity * 3`)
Hash function	`key * 37`, masked by `capacity - 1`
SDOperand stride	40 bytes (0x28) per entry

Emission Architecture

In upstream LLVM, InstrEmitter::EmitNode is a trivial dispatcher: if the SDNode carries a target-specific (machine) opcode, it calls EmitMachineNode; otherwise it calls EmitSpecialNode for ISD-level pseudo-operations. CICC merges both paths into a single monolithic function (sub_2EDDF20) that dispatches on the raw 16-bit opcode at SDNode offset +0x44. The entry point performs a bit-table test against a 64-bit immediate (0x80001078000) to classify opcodes <= 0x2B as "special" ISD nodes requiring dedicated handling; everything above falls through to the generic machine emission path.

The driver, ScheduleDAGSDNodes::EmitSchedule (sub_2EE0CF0), iterates the scheduled SUnit sequence. For each SUnit, it first walks the glue chain backwards (via SDNode::getGluedNode) and emits each glued predecessor before emitting the SUnit's own node. This guarantees that glued instructions appear as a contiguous sequence in the MachineBasicBlock, which is critical for NVPTX where texture sampling sequences must remain bundled with their address computation.

The Emission Algorithm

The combined EmitNode function proceeds through fourteen phases. The condensed flow:

EmitNode(InstrEmitter *self, SDNode *node):
    // Phase 1: Early exit for dead nodes
    if !self->forceEmit && node->useCount <= 1:
        return false  // single-use folded into consumer

    // Phase 2: Glue chain traversal
    root = node
    while root->predecessor has chain/glue bit set:
        root = strip_tag(root->predecessor)
        if root->hasChainResult:
            walk further to data-producing node

    // Phase 3: Opcode dispatch
    opc = node->opcode  // uint16 at +0x44
    switch opc:
        0x0E (CopyToReg):  call EmitCopyToReg(self, node)
        0x13 (TokenFactor): skip entirely
        0x14 (CopyFromReg): goto copyfromreg_path
        0x0F, 0x10, 0x1C, 0x2B: special ISD handling
        default: goto generic_emission

    // Phase 4: Generic machine emission
    desc = TII->get(opc)
    MI = BuildMI(MBB, node->debugLoc, desc)
    CreateVirtualRegisters(node, MI, desc)
    for each operand in node->operands:
        AddOperand(MI, operand)
    MI.setMemRefs(node->memoperands)
    MBB->insert(InsertPos, MI)

    // Phase 5: Custom inserter check (triple vtable dispatch)
    if TII->vtable[0xB8] != sub_2ED11C0:  // not default
        call custom inserter for NVPTX pseudos
    if TII->vtable[0x348] != sub_2ED11F0:
        call expandPostRAPseudo
    if TII->vtable[0x160] != sub_2ED11E0:
        call sub-register inserter

    // Phase 6: Implicit physreg defs
    collect UsedRegs from glue chain (CopyFromReg, RegisterSDNode)
    mark unused implicit defs as dead

    // Phase 7: Post-emission dead copy elimination
    for each emitted copy:
        if copy result has no remaining uses:
            eraseFromParent(copy MI)

Opcode Dispatch Details

The bit-table dispatch uses a 64-bit immediate as a compressed lookup: bt 0x80001078000, opcode. The bits that are set correspond to ISD opcodes that need special (non-generic) handling:

Opcode	ISD Value	Handler
`0x0E`	`ISD::CopyToReg`	`sub_2ED95B0` -- dedicated handler
`0x0F`	`ISD::EH_LABEL` / special	Label emission path
`0x10`	`ISD::INLINEASM`	Inline assembly emission
`0x13`	`ISD::TokenFactor`	Skipped (ordering-only, no MI)
`0x14`	`ISD::CopyFromReg`	Physical-to-virtual register copy
`0x1C`	`ISD::LIFETIME_START/END`	Frame index annotation
`0x2B`	`ISD::PSEUDO_PROBE`	Profiling probe emission

For opcodes above 0x2B, the emitter falls through to the generic path that calls TII->get(opc) to obtain the MCInstrDesc and builds a MachineInstr from its operand descriptors.

CopyToReg Emission

CopyToReg (sub_2ED95B0) handles the common case of copying a value from a virtual register into a physical register. Upstream LLVM handles this inline within EmitSpecialNode; NVIDIA factors it into a separate function, likely for code size reasons given how frequently CopyToReg appears in NVPTX code. NVPTX's parameter-passing convention maps kernel parameters to fixed physical registers %r1--%r255, which generates large CopyToReg cascades at function entry and before calls.

The handler:

Reads the destination register from SDNode->operand(1) (a RegisterSDNode).
If the destination is virtual and the source is an IMPLICIT_DEF, emits IMPLICIT_DEF dest directly instead of a COPY.
Otherwise resolves the source value to a virtual register via getVR (which consults the VRBaseMap).
If source and destination are the same register, does nothing (copy coalesced away).
Emits COPY dest, src.

CopyFromReg Emission

CopyFromReg (opcode 0x14) is the reverse: it copies a physical register into the virtual register domain. The CICC implementation at sub_2EDDF20 offset 0x2EDF423 follows a multi-step process:

Extract the source register from SDNode->operand(1). If virtual, insert the SDValue-to-VReg mapping directly into VRBaseMap and return.
If physical, determine the correct register class:
- Query all users of this CopyFromReg. If the sole user is a CopyToReg to a virtual register in the same class, reuse that destination register.
- Otherwise compute UseRC as the intersection of all user register class constraints via TRI->getCommonSubClass.
- Fall back to TRI->getMinimalPhysRegClass(SrcReg, VT).
If copying the physical register is impossible or expensive (RC->expensiveOrImpossibleToCopy()), use the physical register directly.
Otherwise emit COPY VRBase, SrcReg where VRBase is a new virtual register in DstRC.

The register class membership test at 0x2EDF4C2 uses LLVM's compressed bit-vector representation:

bool RegisterClass::contains(unsigned Reg) {
    unsigned class_idx = Reg >> 3;
    if (class_idx >= desc->num_classes)
        return false;
    return (desc->class_table[class_idx] >> (Reg & 7)) & 1;
}

NVPTX Custom Pseudo-Expansion

The triple vtable dispatch pattern is the emitter's most distinctive NVIDIA modification. After inserting a MachineInstr for a target-specific opcode, the emitter checks three separate vtable slots to determine whether the instruction requires custom expansion:

Vtable slot 0xB8: EmitInstrWithCustomInserter Default stub: sub_2ED11C0 (returns false). When the NVPTX target overrides this for a given opcode, the custom inserter replaces the pseudo MachineInstr with an expanded sequence. Approximately 15--20 NVPTX pseudo-instructions use this path:

Texture load operations (tex.1d, tex.2d, tex.3d) -- these expand into address register setup, sampler state configuration, and the actual texture fetch instruction.
Surface operations (sust, suld) -- surface load/store instructions that need coordinate clamping and format conversion.
Warp-level intrinsics (shfl, vote, match) -- instructions that require lane mask setup and predicate register manipulation.
Atomic operations -- certain atomics expand into compare-and-swap loops on older architectures.

Vtable slot 0x348: expandPostRAPseudo Default stub: sub_2ED11F0. This handles pseudo-instructions that can only be expanded after register allocation has assigned physical registers. In NVPTX this is less common since the PTX virtual register model defers most allocation to ptxas.

Vtable slot 0x160: sub-register insertion Default stub: sub_2ED11E0. Handles INSERT_SUBREG and related patterns that need target-specific lowering.

All three stubs are adjacent in memory (within 48 bytes of each other), confirming they are trivial return-false implementations in the NVPTXInstrInfo class.

Register Class Assignment During Emission

When creating virtual registers for SDNode results, CreateVirtualRegisters (sub_2E8B400 path) performs:

For each result value of the SDNode, obtain the register class from TII->getRegClass(II, i).
Refine based on the value type: if the type is legal, compute TLI->getRegClassFor(VT, isDivergent) and intersect with the instruction constraint via TRI->getCommonSubClass.
The divergence flag (SDNode::isDivergent) is critical in NVPTX: divergent values must go into general-purpose registers (not uniform/constant registers), which affects class selection.
If a result's sole consumer is a CopyToReg to a virtual register in a compatible class, reuse the CopyToReg destination directly to avoid a redundant copy.
Create the virtual register via MRI->createVirtualRegister(RC) and add it as a def operand on the MachineInstr.

The MinRCSize threshold (4, unchanged from upstream) prevents over-constraining: if the intersection of all register class constraints would yield a class with fewer than 4 registers, the emitter inserts a COPY to a less-constrained virtual register instead.

Implicit Def/Use Handling

After inserting a MachineInstr, the emitter processes implicit physical register definitions. This is essential for GPU instructions that clobber status registers or have side effects beyond their explicit operands.

The flow collects UsedRegs by scanning:

Implicit defs beyond explicit results: if NumResults > NumDefs, the extra results correspond to implicit physical register definitions from MCInstrDesc::implicit_defs(). For each such def that has at least one use, a CopyFromReg is emitted to capture the value.
Glue chain uses: the emitter walks the glue chain upward from the current node, collecting physical registers referenced by CopyFromReg nodes and RegisterSDNode operands.
Dead marking: MachineInstr::setPhysRegsDeadExcept(UsedRegs) marks any implicit def that is NOT in UsedRegs as dead, allowing the register allocator and later passes to ignore it.

NVIDIA Extended Flag: Bit 36 (`0x1000000000`)

Standard LLVM MachineInstr flags occupy bits 0--31 of the flags word (is_def, is_implicit, is_dead, is_kill, is_undef, is_early_clobber, etc.). CICC extends this to a 64-bit flags field and reserves bit 36 (0x1000000000) for an NVIDIA-specific purpose. The flag is queried via sub_2E88A90 (hasProperty) with argument rsi = 0x1000000000, edx = operand_index.

Where Bit 36 Is Checked

There are exactly two call sites within sub_2EDDF20:

Site 1 -- Generic emission path (0x2EDE50A--0x2EDE523)

0x2EDE4EF: mov  eax, [r13+2Ch]          ; load SDNode property flags
0x2EDE4F3: test eax, 0x20000            ; bit 17 = hasDebugValue?
0x2EDE4F8: jnz  skip_flag_check         ; if set, skip the bit-36 test
0x2EDE4FA: test al, 4                   ; bit 2 = isTied
0x2EDE4FC: jnz  loc_2EDF064             ; tied operand -> different path
0x2EDE502: test al, 8                   ; bit 3 = hasGlue
0x2EDE504: jz   loc_2EDF064             ; no glue -> different path
0x2EDE50A: mov  edx, 1                  ; operand index = 1
0x2EDE50F: mov  rdi, r13                ; SDNode*
0x2EDE512: mov  rsi, 0x1000000000       ; bit 36 flag mask
0x2EDE51C: call sub_2E88A90             ; hasProperty(node, flag, idx)
0x2EDE521: test al, al
0x2EDE523: jnz  loc_2EDE086             ; if set -> skip emission entirely

Site 2 -- CopyFromReg-adjacent path (0x2EDEE5D--0x2EDEE86)

0x2EDEE5D: test al, 4                   ; bit 2 = isTied
0x2EDEE5F: jnz  loc_2EDEFA2             ; tied -> sub-register path
0x2EDEE65: test al, 8                   ; bit 3 = hasGlue
0x2EDEE67: jz   loc_2EDEFA2             ; no glue -> sub-register path
0x2EDEE6D: mov  edx, 1                  ; operand index = 1
0x2EDEE72: mov  rdi, r13                ; SDNode*
0x2EDEE75: mov  rsi, 0x1000000000       ; bit 36 flag mask
0x2EDEE7F: call sub_2E88A90             ; hasProperty(node, flag, idx)
0x2EDEE84: test al, al
0x2EDEE86: jnz  loc_2EDE100             ; if set -> skip (no MI emitted)

Guard Conditions and Semantics

Both sites share the same guard pattern: the flag is only checked when the SDNode's property byte at +0x2C satisfies bit_3_set AND NOT bit_2_set -- i.e., the node has a glue result chain but is not a tied operand. This narrows the check to nodes that participate in glue chains: typically multi-instruction sequences like texture fetches, surface operations, and warp-level intrinsics where a chain of SDNodes must emit as a contiguous bundle.

When hasProperty(node, 0x1000000000, 1) returns true, the emitter skips the node entirely. The operand index of 1 means the flag is checked on the first data operand (operand 0 is typically the chain input). The effect is that nodes carrying bit 36 on operand 1 are treated as "already materialized" -- their value has been produced by a preceding glued instruction and does not require a separate MachineInstr.

The most likely interpretation of bit 36 is "implicit glue consumer already emitted": when a glued predecessor has already produced the value as a side effect (e.g., a texture fetch that writes both the result and a predicate), the glue consumer SDNode carries bit 36 to tell the emitter that no additional COPY or MI is needed. This is consistent with the check position immediately after getRegForValue succeeds -- the VReg mapping exists, the glue chain has been walked, and the emitter is about to create a potentially redundant MI.

`sub_2E88A90` Calling Convention

The function serves as a universal property query across the emitter and other codegen passes. Observed flag values and their meanings:

Flag Value	Bit	Meaning	Call Sites
`0x80`	7	isCall	Instruction scheduler (`sub_2EE40E0`)
`0x200`	9	isReservedReg	Branch folding (`sub_2F33DD0`)
`0x80000`	19	isImplicit	InstrEmitter generic path, StructurizeCFG
`0x100000`	20	isSimple / isMachineReg	InstrEmitter CopyFromReg, dead copy pass
`0x400000`	22	isSubRegister	InstrEmitter sub-register resolution
`0x40000000`	30	isAllocatable	InstrEmitter CopyFromReg class check
`0x1000000000`	36	NVIDIA: implicit glue consumer	InstrEmitter only (2 sites)

The function signature is bool hasProperty(SDNode *node, uint64_t flag_mask, unsigned operand_idx). It reads the MCInstrDesc via [node+10h] -> [desc+18h], extracts a bit field by shifting right by the appropriate amount, and ANDs with 1 to produce a boolean result.

Internal Data Structures

InstrEmitter Object Layout

The InstrEmitter instance carries three hash tables for tracking the SDNode-to-MachineInstr mapping:

Offset	Name	Entry Size	Purpose
`+0x410`	VReg Map (Table A)	16 bytes	SDNode result to virtual register
`+0x460`	MI Map (Table B)	40 bytes	Glue chain to MachineInstr mapping
`+0x4D0`	Result Map (Table C)	32 bytes	SDNode to result number
`+0x4E0`	forceEmit flag	1 byte	When set, emit even dead nodes

All three use LLVM's DenseMap implementation with open addressing and linear probing. The hash function is key * 37 (LLVM's DenseMapInfo<unsigned>::getHashValue). Empty sentinel: 0xFFFFFFFF. Tombstone: 0xFFFFFFFE. Table C uses an extended sentinel 0xFFFFFFFFFFFFF000. Rehash triggers at 3/4 load factor: entry_count * 4 >= capacity * 3. Growth is handled by sub_2E29BA0 which doubles capacity and rehashes.

SDOperand Output Record

Each emitted result is recorded in a 40-byte (0x28) structure:

struct EmitResultRecord {  // 40 bytes
    SDNode *producer;         // +0x00: SDNode that produced this result
    int32_t src_vreg;         // +0x08: source virtual register (-1 if physical)
    int32_t dst_vreg;         // +0x0C: destination virtual register (-1 if unassigned)
    TargetRegisterClass *RC;  // +0x10: register class pointer (or NULL)
    unsigned sub_reg_idx;     // +0x18: sub-register index (or 0)
    uint32_t flags;           // +0x20: tied, early_clobber, implicit bits
};

SDNode Field Offsets

Confirmed SDNode field layout from the binary (matches LLVM 20.0.0 base with minor NVIDIA extensions):

Offset	Type	Field
`+0x00`	tagged ptr	Chain/glue link (low 3 bits = type tag)
`+0x08`	uint32	Use count / reference count
`+0x20`	ptr	Operand array pointer
`+0x28`	uint32	Operand count (low 24 bits)
`+0x2C`	uint8	Property flags (bit 2 = isTied, bit 3 = hasEarlyClobber)
`+0x30`	tagged ptr	First predecessor link
`+0x38`	tagged ptr	Glue result chain
`+0x44`	uint16	Opcode
`+0x78`	uint32	Reference count (dead node detection)

Tagged pointers are stripped throughout with AND 0xFFFFFFFFFFFFFFF8 (clear low 3 bits). Physical registers are encoded with bit 31 set (negative int32); extraction uses AND 0x7FFFFFFF followed by a shift-left by 4 to index the register descriptor table.

Dead Copy Elimination

After the main emission loop completes, a dedicated cleanup pass (Phase 12 in the binary, offset 0x2EE0816--0x2EE09AC) scans all emitted result records and eliminates redundant COPY instructions. This is notably aggressive compared to upstream LLVM, which defers dead copy removal to a separate DeadMachineInstrElimination pass later in the pipeline. CICC performs it inline because NVPTX's SelectionDAG generates massive numbers of redundant copies when lowering kernel parameter loads -- each parameter maps to a fixed physical register (%r1--%r255 corresponding to PTX parameter registers), and the DAG legalizer inserts CopyFromReg nodes for every parameter access.

Dead Copy Elimination Algorithm

The algorithm walks the emitted result record array (0x28-byte stride, accumulated during Phases 4--11) and classifies each record for deletion or preservation.

DeadCopyElimination(InstrEmitter *self, ResultRecord *records, int count):
    // records is at [rbp-0x250], count at [rbp-0x248]
    // stride = 0x28 (40 bytes per record)

    end = records + count * 0x28
    cursor = records

    while cursor < end:
        MI = cursor->producer             // [rbx+0x00]: the MachineInstr*
        TII = self->TargetInstrInfo       // [r14+0x08]

        // Step 1: Classify by opcode
        if MI->opcode == 0x14:            // CopyFromReg
            // CopyFromReg-specific path: virtual dispatch to target
            vtable = TII->vtable
            result = vtable[0xF0](         // ~30th virtual method
                MI,                        // the CopyFromReg MI
                &cursor[0x08],             // source vreg slot
                /* additional args */
            )
            // This checks whether the target considers the copy
            // sinkable or rematerlizable -- NVPTX overrides this
            // for parameter register copies that are trivially dead

        else:
            // Generic MI path: check via vtable[0x350]
            result = TII->vtable[0x350](MI, cursor, ...)

        // Step 2: Check source register kill flags
        src_reg = cursor->src_vreg        // [rbx+0x08]
        if src_reg < 0:                   // physical register (sign bit set)
            clearKillFlags(self->MRI, src_reg)  // sub_2EBF120

        // Step 3: Check dest register kill flags
        dst_reg = cursor->dst_vreg        // [rbx+0x0C]
        if dst_reg < 0:                   // physical register
            clearKillFlags(self->MRI, dst_reg)  // sub_2EBF120

        // Step 4: Determine if MI is dead
        //   Check opcode: if (MI->opcode - 1) <= 1 (opcode 1 or 2)
        //   then check MI->operand[0] byte [+0x40] bit 4 (0x10)
        //   which indicates "result consumed by inline fold"
        opc = MI->opcode
        if (opc == 1 || opc == 2):        // COPY or REG_SEQUENCE
            if MI->operands[0].flags & 0x10:   // inline folded
                goto mark_dead

        // Step 5: Property gate
        flags_2c = MI->flags_2c           // [rdi+2Ch]
        if !(flags_2c & 0x04):            // bit 2 not set
            // Check TSFlags bit 20 via descriptor
            desc = MI->MCInstrDesc        // [rdi+10h]
            tsflags = desc->TSFlags       // [desc+18h]
            is_simple = (tsflags >> 20) & 1
            if !is_simple:
                goto emit_and_advance     // not a candidate

        // (falls through only when bit 2 set OR TSFlags bit 20 set)

        // Step 6: Check hasProperty(0x100000, 1) -- isMachineReg
        has_prop = hasProperty(MI, 0x100000, 1)   // sub_2E88A90
        if !has_prop:
            // MI is deletable: call eraseFromParent
            eraseFromParent(MI)            // sub_2E88E20
            advance cursor by 0x28
            continue

    mark_dead:
        // Step 7: Liveness check via isUnusedReg
        unused = isUnusedReg(MI)           // sub_2E8B100
        if unused:
            // Still has a def -- erase immediately
            eraseFromParent(MI)            // sub_2E88E20
        else:
            // Defer: add to dead list for bulk deletion
            addToDeadList(self->deadList, MI)  // sub_2ED56A0
            // deadList is at InstrEmitter+0x4A0

        advance cursor by 0x28

Glue Chain Walk in Dead Copy Context

After the per-record loop, the emitter performs a secondary traversal for CopyFromReg records that survived deletion. For each surviving copy whose SDNode has a glue result ([r13+38h] != 0):

Walk the glue chain backward via [r13+0] & 0xFFFFFFFFFFFFFFF8 (strip tag bits).
For each predecessor in the chain, check [rax+2Ch] & 4 -- if the predecessor has been scheduled (bit 2 set), continue walking.
If the predecessor has an unresolved glue reference ([r13+38h] non-null) and the predecessor's MI has zero uses after copy elimination, mark it for deferred deletion too.

This secondary walk catches cascading dead copies: when a CopyFromReg is deleted, its glued predecessor may also become dead.

Deferred Deletion via Dead List

MIs added to InstrEmitter+0x4A0 via sub_2ED56A0 are not deleted immediately. Instead, they are accumulated and deleted in bulk during Phase 14 (final cleanup at 0x2EE0C0B). The dead list is a SmallVector<MachineInstr*> with 8 inline entries (64 bytes inline buffer), growing via sub_C8D5F0 if needed. Bulk deletion avoids iterator invalidation during the emission loop and is more cache-friendly for large basic blocks.

Why NVPTX Needs Aggressive Dead Copy Elimination

NVPTX kernel signatures routinely have 20--60 parameters, each lowered through a CopyFromReg from a fixed physical register. The SelectionDAG legalizer creates CopyFromReg SDNodes for each parameter load, but many parameters are only used in a subset of the kernel's basic blocks. Without immediate dead copy elimination, a kernel with 50 parameters would carry 50 COPY MachineInstrs at function entry, most of which are dead in any given block. The standard LLVM DeadMachineInstrElimination pass would eventually clean these up, but doing so immediately during emission:

Reduces the MachineBasicBlock size that subsequent passes (register allocation, scheduling) must process.
Avoids creating unnecessary VReg-to-PhysReg interference entries in the register allocator.
Prevents false register pressure signals from dead copies during the MRPA (Machine Register Pressure Analysis) pass that NVIDIA uses for scheduling decisions.

NVIDIA-Specific Emission Patterns

Parameter Cascade Emission

NVPTX kernel entry functions map each parameter to a physical register via a cascade of CopyFromReg SDNodes. During emission, this produces a dense block of COPY MachineInstrs at the top of the entry MachineBasicBlock. The emitter handles this pattern specially:

When EmitSchedule processes the first SUnit, it detects a sequence of CopyFromReg nodes whose source registers are consecutive physical parameter registers (%r1, %r2, ...).
Each CopyFromReg is processed through the Phase 5 path (at 0x2EDF423). The register class resolution at 0x2EDF4C2 uses the compressed bit-vector test to verify the destination belongs to the Int32Regs or Int64Regs class.
Dead copy elimination (Phase 12) immediately removes copies whose destinations have no users, reducing the entry block size before subsequent passes see it.

Texture/Surface Glue Bundle Emission

Texture and surface operations are emitted as glue bundles: a chain of SDNodes connected by glue edges that must produce a contiguous sequence of MachineInstrs. The emitter walks the glue chain backward from the final node and emits predecessors first. The bit 36 flag is critical here: when a texture fetch produces both a data result and a predicate condition, the predicate-producing node carries bit 36 on its data operand, telling the emitter that the preceding glued instruction already materialized the value and no separate COPY is needed.

The triple vtable dispatch at the end of emission (Phase 5 in the algorithm) handles the expansion of texture pseudo-instructions: EmitInstrWithCustomInserter (vtable 0xB8) replaces the texture pseudo-MI with the actual address setup, sampler configuration, and fetch instruction sequence.

Multi-Result SDNode Self-Recursion

When an SDNode produces multiple results (e.g., a div+rem pair or a load-with-predicate), the emitter calls itself recursively at sub_2EDDF20 to emit MIs for each additional result. The self-recursive call shares the same InstrEmitter instance and hash tables. This is a CICC-specific pattern; upstream LLVM handles multi-result nodes in a loop within EmitMachineNode rather than via recursion. The recursive approach simplifies the handling of multi-result nodes that themselves have glue chains (e.g., a texture fetch that returns 4 components).

Opcode-1/Opcode-2 Inline Fold Detection

During the dead copy scan (Phase 12, offset 0x2EE08A0--0x2EE08BA), the emitter checks if the MI's opcode is 1 or 2 (COPY or REG_SEQUENCE). For these opcodes, it reads the first operand's byte at [operand_array + 0x40] and tests bit 4 (0x10). This bit indicates the result was consumed via an inline fold -- the consumer instruction selected a pattern that folds the copy directly into its own operand. When this bit is set, the COPY MI is marked dead regardless of its use count, because the consuming instruction no longer references it.

0x2EE08A0: movzx eax, word ptr [rdi+44h]   ; MI->opcode
0x2EE08A4: sub   eax, 1                     ; opcode - 1
0x2EE08A7: cmp   eax, 1                     ; is it 1 (COPY) or 2 (REG_SEQUENCE)?
0x2EE08AA: ja    not_copy                   ; no -> skip
0x2EE08AC: mov   rax, [rdi+20h]             ; MI->operands array
0x2EE08B0: test  byte ptr [rax+40h], 0x10   ; bit 4 = inline fold consumed
0x2EE08B4: jnz   mark_dead                  ; if folded -> dead

NVIDIA Modifications vs Stock LLVM

Area	Upstream LLVM	CICC v13.0
EmitNode dispatch	Two separate functions: `EmitMachineNode` + `EmitSpecialNode`	Single merged function `sub_2EDDF20` with bit-table dispatch
CopyToReg	Inline in `EmitSpecialNode`	Factored into dedicated `sub_2ED95B0`
Custom inserter check	Single vtable call to `EmitInstrWithCustomInserter`	Triple vtable dispatch (0xB8, 0x348, 0x160)
Extended MI flags	Standard LLVM flag set (32 bits)	Bit 36 (`0x1000000000`) for NVPTX-specific semantics
Dead copy elimination	Post-emission pass in ScheduleDAGSDNodes	Inlined aggressive cleanup within EmitNode
Stack frame	~300--400 bytes typical	872 bytes (multiple inline SmallVectors and hash tables)
Self-recursion	Not self-recursive	Self-recursive for multi-result SDNode chains
Inline fold detection	Not present at this stage	Opcode-1/2 fold bit check during dead copy scan
Glue chain secondary walk	Not present	Cascading dead copy detection through glue predecessors

Complexity

Main emission loop: O(N) in the number of scheduled SDNodes.
Hash table lookups: O(1) amortized with rehashing at 3/4 load.
Dead copy elimination: O(C * U) where C = copies emitted, U = average uses per register.
Glue chain traversal: O(G) per node where G = glue chain length (typically 1--5).
Memory: O(N) for the three hash tables + O(R) for result records.

Function Map

Function	Address	Size	Role
`InstrEmitter::EmitNode`	`sub_2EDDF20`	--	Main entry, 11,722 bytes
`ScheduleDAGSDNodes::EmitSchedule`	`sub_2EE0CF0`	--	Top-level driver, 59KB
`EmitCopyToReg`	`sub_2ED95B0`	--	Dedicated CopyToReg handler
`getRegForValue`	`sub_2E8B400`	--	SDValue to VReg mapping
`isUnusedReg`	`sub_2E8B100`	--	Dead register predicate
`isDeadNode`	`sub_2DADC00`	--	Dead SDNode predicate
`eraseFromParent`	`sub_2E88E20`	--	MachineInstr deletion
`hasProperty`	`sub_2E88A90`	--	Register/operand flag query
`getVRegDef`	`sub_2EBEE10`	--	Virtual register definition lookup
`isPhysReg`	`sub_2EBEF70`	--	Physical vs virtual register check
`replaceRegWith`	`sub_2EBECB0`	--	Virtual register substitution
`clearKillFlags`	`sub_2EBF120`	--	Remove kill annotations
Sub-register resolution	`sub_2ED7930`	--	SUBREG_TO_REG handling
`EmitSubregNode`	`sub_2EDB7A0`	--	Sub-register copy emission
`EmitCopyToRegClassOp`	`sub_2EDD7E0`	--	Class-constrained copy
`ProcessOperands`	`sub_2ED3660`	--	EmitMachineNode core
`isAllocatableInClass`	`sub_2E6D360`	--	Register class membership
`DenseMap::find`	`sub_2E5E6D0`	--	SDNode-to-MI lookup
`addToDeadList`	`sub_2ED56A0`	--	Queue MI for deletion
`DenseMap::grow`	`sub_2E29BA0`	--	Hash table resize
NVPTXInstrInfo default	`sub_2ED11C0`	--	`EmitInstrWithCustomInserter` stub
NVPTXInstrInfo default	`sub_2ED11E0`	--	`getInsertSubreg` stub
NVPTXInstrInfo default	`sub_2ED11F0`	--	`expandPostRAPseudo` stub
operand comparison	`sub_2ED1840`	--	Operand equality helper
MI builder	`sub_2ED19B0`	--	Additional MachineInstr construction
register mapping	`sub_2ED41E0`	--	Register mapping utility
register info query	`sub_2ED4900`	--	Register info accessor
MI property query	`sub_2ED5D10`	--	MachineInstr property reader
emission utility	`sub_2EDA920`	--	Additional emission helper
`setDesc`	`sub_2EAB0C0`	--	Sets MI operand descriptors during emission
`addOperand`	`sub_2E31210`	--	Appends operand to MachineInstr
MI manipulation	`sub_2E31DD0`	--	Additional MI manipulation utility
TRI utility	`sub_2E4EE60`	--	TargetRegisterInfo helper
NVPTXRegisterInfo	`sub_2E4F5F0`	--	Register class query vtable method

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
EmitNode structure	Separate `EmitNode` and `EmitSpecialNode` dispatchers	Merged into single monolithic function (`sub_2EDDF20`, 11,722 bytes) with bit-table opcode classification
CopyToReg handling	Inline within EmitSpecialNode	Factored out to dedicated handler (`sub_2ED95B0`) for NVPTX's physical-register-heavy `.param` ABI
MachineInstr flags	Standard flag bits (up to bit ~20)	Extended flag at bit 36 (`0x1000000000`) not present in stock LLVM; marks NVIDIA-specific instruction properties
Pseudo-expansion	Single vtable dispatch for target pseudo-instructions	Triple vtable dispatch pattern gating custom expansion for GPU-specific pseudo-instructions
Dead node predicate	Standard `isDeadNode` check	Custom `sub_2DADC00` predicate with NVPTX-specific liveness criteria
VReg hash table	Standard `DenseMap` for value-to-VReg mapping	Custom hash with `key * 37` and 3/4 load factor rehash policy

Cross-References

SelectionDAG & Instruction Selection -- the DAG construction and pattern-matching phase that produces the SDNodes consumed by InstrEmitter
Instruction Scheduling -- ScheduleDAGSDNodes::EmitSchedule calls InstrEmitter after linearizing the scheduled sequence
Register Allocation -- the VRegs created by InstrEmitter flow into the register allocator
Register Coalescing -- coalesces the COPY instructions emitted here
AsmPrinter & PTX Body Emission -- the final consumer of the MachineInstrs produced by InstrEmitter

TwoAddressInstruction

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Structurally identical to LLVM 20.0.0 TwoAddressInstructionPass.cpp. NVIDIA extensions are limited to deeper EXTRACT_SUBREG handling for multi-register results (texture/tensor/warp ops), extended LiveVariables maintenance, OptimizationRemarkEmitter integration, and the standard optnone/fast-compile gate.

The TwoAddressInstruction pass converts three-address MachineInstrs into two-address form by inserting COPY pseudo-instructions so that tied operand constraints are satisfied before register allocation. In upstream LLVM, many CPU targets have instructions where one source operand must be the same physical register as the destination (x86 addl %esi, %edi means %edi = %edi + %esi); the pass rewrites A = B op C into A = COPY B; A op= C. On NVPTX this pass is largely a formality -- PTX instructions are three-address and the virtual register file has no physical-register constraints -- but it still performs essential bookkeeping: eliminating REG_SEQUENCE and INSERT_SUBREG pseudo-instructions, building copy-equivalence maps for downstream coalescing, and handling the tied operands that arise from multi-result NVPTX intrinsics (texture loads, tensor core operations, warp-level collectives). CICC's binary is structurally identical to stock LLVM, with extended EXTRACT_SUBREG handling for multi-register results, deeper LiveVariables maintenance, OptimizationRemarkEmitter integration, and the standard NVIDIA optnone/fast-compile gate.


Pass name	`"Two-Address instruction pass"`
Pass ID	`"twoaddressinstruction"`
Pipeline slot	`"two-address-instruction"` (MachineFunction pass #521)
`runOnMachineFunction`	`sub_1F53550` (79KB, 2,470 lines)
`tryInstructionTransform`	`sub_1F4EF20` (28KB, 1,127 lines)
`processTiedPairs`	`sub_1F50270` (63KB, 2,209 lines)
Cluster address range	`0x1F4D000` -- `0x1F56000`
libNVVM twin	`sub_F4EA80` (2,455 lines, structurally identical)
Verification string	`"After two-address instruction pass"`
Ordering	After PHI elimination, before RegisterCoalescer

Why This Pass Exists on NVPTX

PTX is a three-address virtual ISA -- every arithmetic instruction takes separate dst, src0, src1 operands, and the hardware register allocator inside ptxas handles physical assignment. On a CPU target like x86, the TwoAddress pass is critical because most ALU instructions destroy one source register. On NVPTX, the pass fires primarily for three categories:

Pseudo-instruction lowering. REG_SEQUENCE, INSERT_SUBREG, and EXTRACT_SUBREG are LLVM-internal pseudo-opcodes that must be eliminated before register allocation regardless of target. The TwoAddress pass rewrites INSERT_SUBREG into COPY and expands REG_SEQUENCE into per-subreg copies.
Multi-result intrinsics. NVPTX texture/surface loads return v4f32 or v2f64 as multi-register results. Warp-level operations (wmma, mma) produce multi-register outputs. These get lowered into chains of EXTRACT_SUBREG pseudo-instructions that the pass must decompose into individual COPYs, one per extracted component.
Inline assembly tied operands. CUDA inline asm blocks with "+r" (read-write) constraints produce tied operands where the output register must match the input. The pass inserts a COPY from the input virtual register to the output register to satisfy the constraint.

For most ordinary NVPTX arithmetic instructions, collectTiedOperands finds nothing and the pass skips the instruction after updating the distance map and processing any copy-equivalence information. The pass is not a no-op, but the heavy transformation paths (commutation, 3-address conversion, load unfolding) almost never fire for GPU code.

Algorithm

The pass iterates over every MachineBasicBlock and every MachineInstr within it, maintaining per-block data structures that are cleared at block boundaries.

for each MBB in MF:
    clear DistanceMap, SrcRegMap, DstRegMap, SrcEqClassMap, DstEqClassMap, Processed
    dist = 0

    for each MI in MBB:
        skip bundle internals
        skip COPY (opcode 12) and SUBREG_TO_REG (opcode 13)
        skip if MI is in the "reprocess" set

        if MI is EXTRACT_SUBREG (opcode 14):
            // NVPTX extended path -- multi-result decomposition
            // See detailed algorithm below
            decomposeExtractSubreg(MI)
            continue

        if MI is REG_SEQUENCE (opcode 15):
            // Standard LLVM: expand into per-subreg COPYs
            eliminateRegSequence(MI)
            continue

        DistanceMap[MI] = ++dist

        // Build copy-equivalence classes for downstream coalescing
        processCopy(MI)  // tracks COPY, REG_SEQUENCE, INSERT_SUBREG chains

        // Collect (srcIdx, dstIdx) pairs for all tied operands
        if not collectTiedOperands(MI, TiedOperandMap):
            continue

        // Single-pair fast path: attempt commutation / 3-addr conversion
        if TiedOperandMap has exactly 1 register with 1 pair:
            if tryInstructionTransform(MI, srcIdx, dstIdx, dist):
                continue  // constraint eliminated without COPY

        // General path: insert COPYs for all remaining tied pairs
        for each (reg, pairs) in TiedOperandMap:
            processTiedPairs(MI, pairs, dist)

        // Rewrite INSERT_SUBREG to COPY after tied constraints satisfied
        if MI is INSERT_SUBREG:
            remove operands 3 and 1
            rewrite descriptor to COPY

tryInstructionTransform (sub_1F4EF20)

This is the optimization core. When OptLevel != None, it attempts to satisfy a tied constraint without inserting a COPY, in priority order:

Commutation. If swapping operands makes src match dst, commute the instruction via TII->commuteInstruction(). On NVPTX, most arithmetic instructions are commutative, so this is the most frequent success path. Upstream uses isProfitableToCommute() which walks up to MaxDataFlowEdge (default 3) dataflow edges to evaluate benefit.
3-address conversion. Call TII->convertToThreeAddress() to produce a true three-operand form. On NVPTX this is essentially dead code -- PTX instructions are already three-address -- but the infrastructure exists because the pass is shared LLVM code.
Rescheduling. When twoaddr-reschedule is enabled (default true), attempt to move the kill of the source register closer to the current instruction (rescheduleMIBelowKill) or move the current instruction below the kill (rescheduleKillAboveMI). This can eliminate the need for a copy by making the source register die at the tied use.
Load unfolding. For instructions with folded loads where the source is not killed, unfold the load into a separate MOV + arithmetic pair. Not applicable on NVPTX (no load folding).
COPY insertion. If all optimization attempts fail, fall through to processTiedPairs which inserts an explicit COPY.

The function calls itself recursively (22 cross-references including a recursive self-call at sub_1F4EF20) for transitive constraint resolution -- when unfolding creates a new instruction that itself has tied operands, the resolution recurses.

EXTRACT_SUBREG Multi-Result Decomposition Algorithm

This is the most substantial NVIDIA extension to the upstream pass. The code lives at lines 821--994 of sub_1F53550 (decompilation line numbers from the 2,470-line function body). Standard LLVM handles single-result EXTRACT_SUBREG; the NVPTX version handles multi-result instructions where the InstrEmitter has produced a single EXTRACT_SUBREG pseudo with multiple operand pairs representing all extracted components.

Why Multi-Result EXTRACT_SUBREG Exists

When InstrEmitter::EmitNode (sub_2EDDF20, 872-byte stack frame, self-recursive for multi-result SDNode chains) lowers a multi-result NVPTX intrinsic, it produces a single MachineInstr with opcode 14 (EXTRACT_SUBREG) carrying N operand pairs -- one per result component. Each pair contains a def register (the extracted component destination) and a use register (the source super-register) plus a subreg index encoding which component to extract. The TwoAddress pass must decompose this single multi-operand pseudo into N separate COPY instructions.

The three major producer categories:

Producer	Handler	ID range	Typical result width
Texture/surface loads	`sub_33A4350`	50 IDs (0x5D--0x8D)	v4f32, v2f64, v4i32
WMMA / MMA operations	`sub_33A64B0`	95 IDs (0xA4--0xA8, 0x194--0x1EC)	2--8 register fragments
Multi-element surface ops	case 0xA2	single	loop over elements
MMA sm90+ (wgmma)	`sub_33AC8F0`	0x183--0x191	8--16 register fragments
TMA operations	`sub_33AD3D0`	0x179--0x17C	varies
Async copy	`sub_33ADA20`	0x17F--0x182	2 results (data + token)

The DAG-level builders that produce multi-result nodes are sub_3411BE0 (multi-result DAG node), sub_33FC220 (multi-result variadic node), and sub_33F7800 (multi-result alternate form). The type list is built by sub_1D25C30 (SelectionDAG::getVTList for multi-result).

Operand Memory Layout

Each MachineOperand occupies 40 bytes in memory (stride 40 per operand in the operand array):

Offset within operand	Size	Field
+0	`byte`	Flags byte 0: bit 0 = isDef
+2	`word`	Flags word: bits 4--11 = subreg class index, bits 8--19 = subreg index
+3	`byte`	Flags byte 3: bit 4 = isTied, bit 6 = earlyTied
+4	`byte`	Flags byte 4: bit 0 = isTied flag (secondary)
+8	`int64`	Register number (virtual reg > 0, physical reg < 0)

The subreg index is extracted by the formula:

subregIdx = (*(uint32_t*)(operand + 0) >> 8) & 0xFFF

This 12-bit field encodes which sub-register of the source to extract: sub0, sub1, sub2, sub3, etc. For a v4f32 texture result, the values are typically 1 through 4.

Decomposition Pseudocode

// sub_1F53550 lines 821-994: EXTRACT_SUBREG handler (opcode == 14)
decomposeExtractSubreg(MI):
    numOps = MI.getNumOperands()              // v405
    pairIdx = 0                               // v286, stride-2 counter

    while pairIdx < numOps:
        defOp  = MI.getOperand(pairIdx)       // base + pairIdx * 40
        useOp  = MI.getOperand(pairIdx + 1)   // base + (pairIdx+1) * 40

        dstReg = defOp.getReg()               // *(int64*)(defOp + 8)
        srcReg = useOp.getReg()               // *(int64*)(useOp + 8)

        // Extract subreg index from def operand flags (bits 8-19)
        subregIdx = (defOp.flags >> 8) & 0xFFF

        // Check if this operand is already tied (bit 0 of byte +4)
        alreadyTied = (defOp.flagsByte4 & 1) != 0

        // === CREATE COPY INSTRUCTION ===
        // sub_1E0B640(MBB, insertPoint, MI.getDebugLoc(), 0)
        // This is BuildMI -- allocates a new MachineInstr with opcode COPY
        newCOPY = BuildMI(MBB, MI, MI.getDebugLoc(), TII.get(TargetOpcode::COPY))

        // Insert into block's instruction list
        if MI.isBundledWithSucc():
            sub_1DD6E10(MBB, MI, newCOPY)     // insertBefore (bundled variant)
        else:
            sub_1DD5BA0(MBB, MI, newCOPY)     // standard list insert

        // Add def operand: destination register with subreg class encoding
        // sub_1E1A9C0(newCOPY, dstReg, flags_with_subregclass)
        newCOPY.addOperand(MachineOperand::CreateReg(dstReg, /*isDef=*/true))

        // Add use operand: source register
        // sub_1E1A9C0(newCOPY, srcReg, flags_use)
        newCOPY.addOperand(MachineOperand::CreateReg(srcReg, /*isDef=*/false))

        // === EARLY TIED OPTIMIZATION ===
        // When this is NOT the first pair (pairIdx > 0) and the instruction
        // has tied constraints, check if a later pair shares the same dest
        // register. If so, mark the first operand of this COPY with isTied,
        // allowing the register coalescer to merge them without an extra COPY.
        if pairIdx > 0:
            earlyTiedCheck = (defOp.flagsByte3 >> 6) & 1   // bit 6
            isTiedCheck    = (defOp.flagsByte3 >> 4) & 1   // bit 4
            if earlyTiedCheck AND isTiedCheck:
                newCOPY.getOperand(0).setTied()  // set bit 0 of byte +4

        // === OPTIMIZATION REMARK ===
        if ORE != null:                       // pass object offset +272
            sub_1DCCCA0(ORE, dstReg, MI, newCOPY)   // emit copy remark
            remarkData = sub_1DCC790(ORE, dstReg)    // lookup remark data
            sub_1F4C640(remarkData)                   // filter/emit remark
            sub_1DCBB50(ORE)                          // push to output
            if newCOPY.isInsideBundle():
                walk to bundle head via successor chain
            if sub_1E1AFE0(bundleHead):               // hasProperty check
                sub_1DCC370(ORE, remarkNode)          // append to list

        // === LIVEVARIABLES UPDATE ===
        if LV != null:                        // pass object offset +280
            sub_1DBF6C0(LV, MBB, MI, newCOPY, ...)
            // This calls the full update chain:
            //   sub_1DBA290: createNewVarInfo for newCOPY's def register
            //   sub_1DBB110: initVarInfo (initialize kill/def lists)
            //   sub_1DB3C70: findKill (locate kill point in block)
            //   sub_1DB4410: addKill (update kill tracking for srcReg)
            //   sub_1DB8610: addNewBlock (update block-level liveness)

        pairIdx += 2                          // v286 += 2 (stride-2)

    // === CLEANUP ===
    // Remove all operands from original MI, then erase it
    sub_1E16240(MI)                           // RemoveOperand (bulk)
    MI.eraseFromParent()

earlyTied Optimization Detail

The earlyTied optimization is a critical performance path. Consider a v4f32 texture load producing 4 results. Without earlyTied, the decomposition creates 4 independent COPY instructions. The register coalescer must then discover independently that some of these COPYs can be coalesced.

The earlyTied flag (bit 6 of operand flags byte +3) is set during instruction emission when the emitter knows that consecutive extract results target adjacent sub-registers of a contiguous super-register. When detected, the pass marks the COPY's def operand with the isTied bit, creating a chain of tied constraints:

// Without earlyTied (4 independent COPYs, coalescer must work harder):
%dst0 = COPY %src.sub0
%dst1 = COPY %src.sub1
%dst2 = COPY %src.sub2
%dst3 = COPY %src.sub3

// With earlyTied (COPYs carry tie hints, coalescer has direct information):
%dst0 = COPY %src.sub0                          // first pair: no tie
%dst1 = COPY %src.sub1   [tied to %dst0.succ]   // isTied bit set
%dst2 = COPY %src.sub2   [tied to %dst1.succ]   // isTied bit set
%dst3 = COPY %src.sub3   [tied to %dst2.succ]   // isTied bit set

The condition is: (flagsByte3 >> 6) & 1 (earlyTied set) AND (flagsByte3 >> 4) & 1 (isTied set) AND pairIdx > 0 (not the first pair). This triple-guard prevents false positives on single-result extracts and on the first component which has no predecessor to tie to.

LiveVariables Update Chain

Every COPY produced by the decomposition triggers a six-function update sequence. This is deeper than upstream LLVM's TwoAddress LiveVariables handling and suggests NVIDIA's downstream register allocator (the greedy RA at sub_1E5B110) is particularly sensitive to stale liveness:

Step	Function	Purpose
1	`sub_1DBF6C0`	Entry: transfer liveness from old MI to new COPY
2	`sub_1DBA290`	`createNewVarInfo`: allocate VarInfo for the COPY's def register
3	`sub_1DBB110`	`initVarInfo`: initialize the VarInfo's kill list, def list, and alive-block bitvector
4	`sub_1DB3C70`	`findKill`: scan the current block to locate where srcReg is killed
5	`sub_1DB4410`	`addKill` / `removeKill`: move the kill point from the original MI to the new COPY (srcReg now dies at the COPY, not at the original EXTRACT_SUBREG)
6	`sub_1DB8610`	`addNewBlock`: update block-level liveness bitvectors if srcReg is live-in to this block from a predecessor

For a v4f32 decomposition, this executes 24 function calls (6 per component times 4 components). For a wmma.mma producing 8 fragments, it is 48 calls. The cost is quadratic in the worst case because findKill scans from the block start, but in practice the kill is always close to the insertion point.

Multi-Result Producers on NVPTX

The EXTRACT_SUBREG decomposition path fires for all NVPTX operations that produce more than one register result. These originate in the intrinsic lowering pass (sub_33A64B0 and friends in the 0x33A cluster) and flow through SelectionDAG ISel and InstrEmitter before reaching TwoAddress.

Texture and Surface Loads

The texture bulk handler sub_33A4350 covers 50 intrinsic IDs (0x5D through 0x8D). A tex.1d.v4.f32 intrinsic produces an SDNode with value type list {f32, f32, f32, f32, chain} via sub_1D25C30 (getVTList). InstrEmitter converts this into a single MachineInstr with 8 operands (4 def/use pairs), which TwoAddress decomposes into 4 COPYs.

Surface read/write handlers at sub_33A3180 (IDs 0x8E--0x90) and the scatter/gather handler at case 0xA2 follow the same pattern with variable result widths.

WMMA and MMA Operations

The mega-handler sub_33A64B0 services 95 intrinsic IDs covering all wmma/mma variants across sm70+. A wmma.mma.sync on sm70 with fp16 accumulation produces 8 f16x2 fragments; on sm80 with tf32 it produces 4 f32 fragments. The sm90+ wgmma handler at sub_33AC8F0 (IDs 0x183--0x191) can produce up to 16 register fragments for large matrix shapes.

Each fragment becomes one operand pair in the EXTRACT_SUBREG pseudo. The TwoAddress pass decomposes a 16-fragment wgmma result into 16 individual COPYs, each with full LiveVariables update. This is the most expensive decomposition path in the entire pass.

TMA and Async Copy

TMA bulk operations (sub_33AD3D0, IDs 0x179--0x17C) and async copy operations (sub_33ADA20, IDs 0x17F--0x182) produce 2-result nodes (data + completion token). These are simpler decompositions with only 2 COPY instructions.

Inline Assembly Tied Operands

CUDA inline assembly with "+r" read-write constraints is the third category that exercises the TwoAddress pass on NVPTX. The tied operand pipeline spans three compilation stages:

Stage 1: EDG Constraint Construction (sub_1286D80 path)

The EDG frontend's inline asm codegen (analyzed in p2-B07-inline-asm-codegen.txt) detects tied operands when the operand descriptor byte at offset +24 equals 3. It constructs the constraint string by:

Emitting the input value via sub_1286D80
Appending * for indirect operands
Appending the tied operand index as a decimal number to the constraint string

If the type size is a power-of-2 and 64 bits or less, it may insert a bitcast to matching integer type. GCC-style matching-digit constraints in input position are explicitly rejected with "tied input/output operands not supported!".

Stage 2: DAG-Level Tied Resolution (sub_2079C70)

SelectionDAGBuilder::visitInlineAsm (sub_2079C70, 83KB) uses:

sub_20B4290: hasTiedOperand() -- checks if tied index is not -1
sub_20B42B0: getTiedOperand() -- returns the tied index
sub_2045250: resolveTiedOperand() -- creates the DAG-level constraint

The error string "inline asm not supported yet: don't know how to handle tied indirect register inputs" guards against the unsupported case of tied operands on memory-indirect inline asm operands.

Stage 3: TwoAddress COPY Insertion

After ISel, the tied operand from inline asm appears as a regular tied constraint in the MachineInstr operand list. The TwoAddress pass processes it through the standard collectTiedOperands / processTiedPairs path. For "+r" constraints this typically produces a single COPY before the INLINEASM instruction.

processTiedPairs Detail (sub_1F50270)

This 63KB / 2,209-line function is the heavyweight tied-operand resolver. It is called from the main loop whenever collectTiedOperands finds constraints that the fast path (tryInstructionTransform) could not resolve.

processTiedPairs(MI, tiedPairs, distance):
    for each (srcIdx, dstIdx) in tiedPairs:
        srcReg = MI.getOperand(srcIdx).getReg()
        dstReg = MI.getOperand(dstIdx).getReg()

        if srcReg == dstReg:
            continue    // constraint already satisfied

        // === ATTEMPT COMMUTATION (OptLevel != None) ===
        if canCommute(MI):
            // isProfitableToCommute walks up to MaxDataFlowEdge (default 3)
            // dataflow edges from srcReg and dstReg, comparing distances
            // in DistanceMap to determine if commuting reduces copies
            if isProfitableToCommute(MI, srcIdx, dstIdx, distance):
                TII->commuteInstruction(MI)
                if MI.getOperand(srcIdx).getReg() == MI.getOperand(dstIdx).getReg():
                    continue    // resolved by commutation

        // === ATTEMPT RESCHEDULING (twoaddr-reschedule = true) ===
        if twoAddrReschedule:
            // Try to move MI below the kill of srcReg
            if rescheduleMIBelowKill(MI, srcIdx, dstIdx, distance):
                continue    // resolved by rescheduling
            // Try to move the kill of srcReg above MI
            if rescheduleKillAboveMI(MI, srcIdx, dstIdx, distance):
                continue    // resolved by rescheduling

        // === ATTEMPT 3-ADDRESS CONVERSION ===
        // On NVPTX, convertToThreeAddress always returns null (dead code)
        if TII->convertToThreeAddress(MI, LIS):
            continue    // resolved by conversion (never happens on NVPTX)

        // === INSERT COPY (last resort) ===
        newCOPY = BuildMI(MBB, MI, DL, TII.get(COPY), dstReg).addReg(srcReg)

        // Extract subreg index from original operand
        subregIdx = (MI.getOperand(srcIdx).flags >> 8) & 0xFFF
        if subregIdx != 0:
            newCOPY.getOperand(1).setSubReg(subregIdx)

        // Insert into DistanceMap with incremented counter
        // Walk predecessor chain to find scheduling unit
        DistanceMap[newCOPY] = ++distance
        DistanceMap[MI] = ++distance

        // Rewrite srcReg to dstReg in original MI
        MI.getOperand(srcIdx).setReg(dstReg)     // sub_1E310D0

        // Update SrcEqClassMap: map srcReg -> dstReg
        SrcEqClassMap.insert(srcReg, dstReg)      // sub_1F4E3A0

        // === LIVEVARIABLES UPDATE ===
        if LV:
            varInfo = LV.getVarInfo(dstReg)        // sub_1DC1550
            if varInfo not found:
                varInfo = LV.createNewVarInfo(dstReg)  // sub_1DBA290
                LV.initVarInfo(varInfo)                 // sub_1DBB110
            // Transfer kill info: srcReg kill moves from MI to newCOPY
            killInfo = varInfo.findKill(MBB)       // sub_1DB3C70
            varInfo.addKill(newCOPY, flags)        // sub_1DB4410
            // Update block-level liveness
            varInfo.addNewBlock(MBB, position)     // sub_1DB8610

        // === OPTIMIZATION REMARK ===
        if ORE and commutationWasAttempted:        // v384 flag
            sub_1DCC790(ORE, srcReg)               // lookup remark data
            sub_1F4C640(remarkData)                 // filter remark
            sub_1DCBB50(ORE)                        // push
            if newCOPY.isInsideBundle():
                walk to bundle head
            if sub_1E1AFE0(bundleHead):
                sub_1DCC370(ORE, remarkNode)       // append to list

        // === REGISTER CLASS TIGHTENING ===
        // sub_1E69410(SubtargetInfo, dstReg, regClass, 0)
        // constrainRegClass on the destination register to the intersection
        // of the current class and the class required by the tied operand

INSERT_SUBREG Rewrite (lines 2386--2396)

After all tied pairs are processed for an INSERT_SUBREG instruction (opcode 8), the pass converts it into a plain COPY:

if MI.getOpcode() == INSERT_SUBREG:
    // Propagate subreg encoding from operand[3] into operand[0]
    subregBits = MI.getOperand(3).getSubRegIdx()
    MI.getOperand(0).setSubReg(subregBits)
    // Copy tie flag from operand[1] into operand[0]
    MI.getOperand(0).setTied(MI.getOperand(1).isTied())
    // Remove operands 3 and 1 (in reverse order to preserve indices)
    MI.RemoveOperand(3)              // sub_1E16C90(MI, 3)
    MI.RemoveOperand(1)              // sub_1E16C90(MI, 1)
    // Rewrite opcode descriptor to COPY
    MI.setDesc(TII.get(COPY))        // descriptor at TII + 960

Copy-Equivalence Classes

The pass builds two maps (SrcEqClassMap at offset +552, DstEqClassMap at +584) that track transitive copy chains. When it encounters COPY, REG_SEQUENCE, or INSERT_SUBREG instructions, it records the source-to-destination register mapping. The helper collectRegCopies (sub_1F4E620, 357 lines) walks use-def chains to build transitivity: if A -> B -> C via COPYs, then A maps directly to C. These maps are consumed by the downstream RegisterCoalescer to improve copy elimination.

The collectRegCopies algorithm:

collectRegCopies(startReg):
    chain = SmallVector()
    reg = startReg

    while true:
        if not MRI.hasOneUse(reg):        // sub_1E69E00
            break
        defMI = MRI.getVRegDef(reg)
        if defMI.getOpcode() not in {COPY, REG_SEQUENCE, INSERT_SUBREG}:
            break
        nextReg = defMI.getOperand(1).getReg()
        chain.push(reg)
        reg = nextReg

    // Process chain in reverse: build transitivity
    for i in reverse(chain):
        SrcEqClassMap.insert(chain[i], chain[i+1])    // sub_1F4E3A0

Data Structures

TiedOperandMap (stack-allocated SmallDenseMap<unsigned, SmallVector<pair<unsigned,unsigned>, 4>> with 4 inline entries):

Offset in entry	Type	Field
+0	`int32`	Key (virtual register number; `-1` = empty, `-2` = tombstone)
+8	`ptr`	Pair list pointer (points to +24 for inline storage)
+16	`int32`	Pair list size
+20	`int32`	Pair list capacity
+24	`int64[4]`	Inline pair storage (each `qword` packs `srcIdx

Entry stride: 56 bytes. Hash function: 37 * key, linear probing, load factor 3/4. Total inline size: 224 bytes on stack.

DistanceMap (DenseMap<MachineInstr*, unsigned> at pass object offsets +312..+336): maps each MI to its sequential position within the current block. Hash: (ptr >> 4) ^ (ptr >> 9). Used by tryInstructionTransform and processTiedPairs for rescheduling decisions and commutation profitability evaluation.

Pass Object Layout (selected fields):

Offset	Type	Field
+232	`MachineFunction*`	Current function
+240	`MachineRegisterInfo*`	MRI
+248	`TargetInstrInfo*`	TII
+256	`TargetRegisterInfo*`	TRI
+264	`ptr`	InstrItineraryData* or TargetSubtargetInfo*
+272	`OptimizationRemarkEmitter*`	ORE (NVIDIA addition)
+280	`LiveVariables*`	LV
+288	`LiveIntervals*`	LIS (via SlotIndexes at +160)
+296	`int`	Effective optimization level
+304	`MachineBasicBlock*`	Current MBB
+312..+336	`DenseMap`	DistanceMap
+344..+376	`SmallPtrSet`	Processed set
+448..+476	`SmallPtrSet`	Second set (reprocessing)
+552..+576	`DenseMap`	SrcEqClassMap
+584..+608	`DenseMap`	DstEqClassMap

Tied Operand Scanning (Lines 1183--1413)

The collectTiedOperands logic iterates all operands of an instruction checking for tied constraints. The inner loop (at STEP 7 in the raw analysis) contains a special-case direct resolution path:

for opIdx in 0..numOps-1:
    // Skip defs, already-tied, and operands with no subreg class
    if operand.isDef():           continue     // byte +0 != 0
    if operand.isTied():          continue     // bit 4 of byte +3
    if operand.subregClass == 0:  continue     // bits 4-11 of word +2

    tiedIdx = MI.findTiedOperandIdx(opIdx)     // sub_1E16AB0
    srcReg = operand[opIdx].getReg()
    dstReg = operand[tiedIdx].getReg()

    if srcReg == dstReg:
        continue    // already satisfied

    // SPECIAL CASE: direct resolution without COPY
    if operand.isTied(secondary) AND def.subregClass == 0:
        if dstReg < 0:    // physical register
            regClass = sub_1F3AD60(MRI, instrDesc, opIdx, TII, MF)
            if regClass:
                MRI.constrainRegClass(dstReg, regClass)   // sub_1E69410
        operand.setReg(dstReg)                     // sub_1E310D0
        operand.clearSubregBits()                  // *operand &= 0xFFF000FF
        // Constraint resolved: use now points to same reg as def
        continue

    // NORMAL: add to TiedOperandMap
    TiedOperandMap[srcReg].push({opIdx, tiedIdx})  // packed as qword

The special-case path at the isTied(secondary) check (bit 0 of byte +4) handles the case where the operand carries a secondary tie flag from instruction emission and the def side has no subreg class constraint. In this case the pass can directly rewrite the use register to match the def without inserting a COPY, and clears the subreg bits with the mask 0xFFF000FF.

NVIDIA Modifications

The pass is structurally stock LLVM -- the libNVVM build at sub_F4EA80 is byte-for-byte identical in structure, confirming shared source. The NVIDIA delta consists of four additions:

Extended EXTRACT_SUBREG handling (lines 821--994 of the decompilation). Standard LLVM handles single EXTRACT_SUBREG; the NVPTX version handles multi-result instructions with multiple extract chains via stride-2 operand iteration. This is required for texture/surface loads returning v4f32, wmma/mma producing multi-register fragments, and similar multi-result NVPTX intrinsics. The earlyTied optimization (checking bits 4 and 6 of operand flags byte +3) is unique to this extension and provides direct coalescing hints for contiguous sub-register sequences.
Deeper LiveVariables maintenance (lines 1791--2064). When a COPY is inserted, the pass creates new VarInfo entries (sub_1DBA290), initializes them (sub_1DBB110), updates kill info (sub_1DB3C70 / sub_1DB4410), and maintains block-level liveness (sub_1DB8610). This six-function chain executes per COPY, not per instruction. For a 16-fragment wgmma result, this produces 96 function calls for liveness maintenance alone.
OptimizationRemarkEmitter integration (lines 2207--2258). The pass reports cases where tied-operand constraints forced extra COPY insertions, providing performance diagnostic information. This is absent in upstream LLVM's TwoAddress pass. The ORE pointer is stored at pass object offset +272 and acquired via analysis lookup of unk_4FC4534. The five-function chain (sub_1DCCCA0 through sub_1DCC370) handles remark creation, filtering, and bundle-aware emission.
optnone/fast-compile gate (sub_1636880). When the function has optnone or when NVIDIA's fast-compile mode is active, the effective optimization level is forced to 0. This disables commutation, 3-address conversion, and rescheduling attempts in tryInstructionTransform (which returns false immediately when OptLevel == None), making the pass a pure COPY-insertion pass with no optimization.

Knobs

Knob	Default	Effect
`twoaddr-reschedule`	`true`	Enable/disable instruction rescheduling to coalesce copies. When `true`, the pass attempts to move instructions up or down within the block to avoid needing a COPY.
`dataflow-edge-limit`	`3`	Maximum number of dataflow edges to traverse when evaluating the profitability of commuting operands in `isProfitableToCommute()`. Higher values allow deeper analysis at compile-time cost.

Both knobs are registered in constructor ctor_337 (found in the sweep at 0x4F0000--0x51FFFF). They are standard upstream LLVM options with no NVIDIA-specific modifications to their defaults.

The optnone/fast-compile gate is not a knob per se but has the effect of disabling all optimization paths in the pass, equivalent to setting both knobs to their most conservative values.

Function Map

Function	Address	Size	Role
Pass registration (name + ID)	`sub_1F4D900`	small	Sets `"Two-Address instruction pass"` and `"twoaddressinstruction"`
Constructor	`sub_1F4D9F0`	small
Helper: rescheduleMIBelowKill support	`sub_1F4CC10`	--	Called by `sub_1F4EF20`
Helper: rescheduleKillAboveMI support	`sub_1F4D060`	--	Called by `sub_1F4EF20`
`SmallPtrSet::contains(MI*)`	`sub_1F4DD40`	67 lines	Processed set membership check
`SmallDenseMap::clear()`	`sub_1F4DE20`	180 lines	TiedOperandMap cleanup, frees heap-allocated pair lists
`DenseMap<int,int>::insert`	`sub_1F4E3A0`	166 lines	EqClassMap insertion, hash = `37 * key`
`collectRegCopies`	`sub_1F4E620`	357 lines	Walks COPY chains to build transitive equivalence classes
`DenseMap<ptr,int>::insert`	`sub_1F4EC70`	164 lines	DistanceMap insertion, hash = `(ptr>>4) ^ (ptr>>9)`
`tryInstructionTransform`	`sub_1F4EF20`	28KB / 1,127 lines	Core tied-operand rewriter: commutation, 3-addr, COPY. Recursive (22 xrefs).
`processTiedPairs`	`sub_1F50270`	63KB / 2,209 lines	Full pipeline: commute, convert, COPY insertion, LV/LI update
`SmallDenseMap::grow`	`sub_1F53020`	312 lines	TiedOperandMap rehash, 56-byte entry stride
`runOnMachineFunction`	`sub_1F53550`	79KB / 2,470 lines	Pass entry point
Helper: find matching superclass	`sub_1F3AD60`	--	Finds register class for tied physical reg constraints
Helper: implicit tied operands	`sub_1F4C460`	--	Checks if MI has implicit tied operand pairs
Helper: filter/emit remark	`sub_1F4C640`	--	ORE filtering for copy-insertion diagnostics
`LiveVariables::createNewVarInfo`	`sub_1DBA290`	--	Allocates VarInfo for new register
`LiveVariables::initVarInfo`	`sub_1DBB110`	--	Initializes kill/def lists and alive bitvector
`VarInfo::findKill`	`sub_1DB3C70`	--	Scans block for register kill point
`VarInfo::addKill` / `removeKill`	`sub_1DB4410`	--	Updates kill tracking
`VarInfo::addNewBlock`	`sub_1DB8610`	--	Updates block-level liveness bitvectors
`LiveVariables::HandlePhysRegDef`	`sub_1DBF6C0`	--	Transfer liveness from old MI to new COPY
`ORE::emit` (copy remark)	`sub_1DCCCA0`	--	Emits optimization remark for COPY insertion
`ORE::lookup`	`sub_1DCC790`	--	Looks up remark data for register
`ORE::push`	`sub_1DCBB50`	--	Pushes remark to output
`ORE::appendToList`	`sub_1DCC370`	--	Appends remark (bundle-aware)
`MachineFunction::verify`	`sub_1E926D0`	--	Called with `"After two-address instruction pass"`
`isOptNone` / fast-compile check	`sub_1636880`	--	Forces `OptLevel = 0` when active

Binary Size Note

The 79KB runOnMachineFunction plus 63KB processTiedPairs plus 28KB tryInstructionTransform total approximately 170KB of machine code. Upstream LLVM source for the entire pass is approximately 2,000 lines of C++. The binary bloat is almost entirely explained by aggressive inlining: every DenseMap::insert, DenseMap::find, DenseMap::clear, SmallPtrSet::insert, and SmallPtrSet::find operation is fully expanded inline with all template specialization, sentinel initialization, grow/rehash, and power-of-2 computation logic. This accounts for roughly 40% of the binary. The remaining expansion comes from the COPY-creation path (operand setup, flag manipulation, list splicing) being duplicated for each opcode-specific branch rather than factored into a shared helper.

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Primary purpose	Convert 3-address to 2-address form for physical register constraints (x86 tied operands)	Largely a formality on NVPTX (PTX is 3-address); primary role is eliminating `REG_SEQUENCE`/`INSERT_SUBREG` and building copy-equivalence maps
EXTRACT_SUBREG handling	Standard sub-register extraction for CPU multi-result instructions	Extended decomposition for multi-register NVPTX results: texture loads, tensor core operations (WMMA/MMA), and warp-level collectives
LiveVariables maintenance	Standard liveness tracking	Deeper `LiveVariables` maintenance with explicit `VarInfo` allocation/init (`sub_1DBA290`/`sub_1DBB110`) for new registers created during decomposition
ORE integration	Basic or absent remark emission for copies	Full `OptimizationRemarkEmitter` integration for COPY insertion diagnostics (`sub_1DCCCA0`/`sub_1DCC790`/`sub_1DCBB50`)
Binary size	~2,000 lines of C++ source	170 KB of machine code (79 KB `runOnMachineFunction` + 63 KB `processTiedPairs` + 28 KB `tryInstructionTransform`); bloat from aggressive DenseMap inlining
optnone/fast-compile gate	Standard `OptLevel` check	NVIDIA `optnone` / fast-compile check (`sub_1636880`) forces `OptLevel = 0` for fast-compile kernels

Cross-References

Register Coalescing -- runs immediately after TwoAddress; consumes the SrcEqClassMap/DstEqClassMap built here
Register Allocation -- the downstream consumer that requires tied operands to be resolved
SelectionDAG -- produces the EXTRACT_SUBREG/INSERT_SUBREG/REG_SEQUENCE pseudo-instructions that this pass eliminates
Instruction Emitter -- sub_2EDDF20 creates multi-result EXTRACT_SUBREG chains from SDNode output
MMA Code Generation -- WMMA/MMA intrinsics producing multi-register results that require decomposition
ISel Patterns -- instruction selection creates the tied operand constraints
Instruction Scheduling -- runs before TwoAddress in the pre-RA scheduling slot
Pipeline & Ordering -- full pass ordering context
CLI Flags -- optnone and fast-compile mode
LLVM Knobs -- twoaddr-reschedule, dataflow-edge-limit
Hash Infrastructure -- DenseMap and SmallDenseMap internals used throughout
Diagnostics -- OptimizationRemarkEmitter system

Instruction Scheduling

Prerequisites: Familiarity with Register Allocation, NVPTX register classes, and the codegen pipeline. Understanding of the GPU execution model (warp scheduling, latency hiding) is essential.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/CodeGen/MachineScheduler.cpp (ScheduleDAGMILive), llvm/lib/CodeGen/MachinePipeliner.cpp (Swing Modulo Scheduler) (LLVM 20.0.0). The MRPA incremental pressure tracker and Texture Group Merge pass are NVIDIA-only with no upstream equivalent.

CICC v13.0 implements three distinct scheduling subsystems: MRPA (Machine Register Pressure Analysis) for incremental pressure tracking during MCSE, a Swing Modulo Scheduling pipeliner for loop bodies, and ScheduleDAGMILive for post-RA instruction ordering. All three maintain per-register-class pressure arrays but differ in granularity and update frequency. A texture group merge pass (sub_2DDE8C0) acts as a scheduling-adjacent optimization that groups texture load instructions for hardware coalescing.


MRPA incremental tracker	`sub_2E5A4E0` (primary), `sub_1E00370` (backend variant)
MachinePipeliner (SMS)	`sub_3563190`
ScheduleDAGMILive	`sub_355F610`
Instruction selection heuristic	`sub_3557A10`
Texture group merge	`sub_2DDE8C0`
Scheduling mode switch	`sub_21668D0` (post-RA), `sub_2165850` (pre-RA)

MRPA: Incremental Register Pressure Tracking

MRPA (Machine Register Pressure Analysis) provides incremental register pressure tracking for the Machine Common Subexpression Elimination (MCSE) pass. Rather than recomputing pressure from scratch after each instruction move or elimination, MRPA applies delta updates to maintain a running pressure state.

The primary implementation lives at sub_2E5A4E0 (48KB), with a backend variant at sub_1E00370 (78KB). Both use DenseMap hash tables for per-instruction pressure data with the hash function (ptr >> 9) ^ (ptr >> 4), empty sentinel -8, tombstone sentinel -16, minimum 64 buckets, and power-of-two sizing. The sub_1E00370 backend variant calls the pressure computation core at sub_1DF7390 (8 call sites) and sub_1DFB9D0 (6 call sites), plus pressure set queries via sub_1E1C690 / sub_1E15D60.

The MRPA pressure cluster spans the address range 0x1DF0000--0x1E0FFFF:

Function	Role
`sub_1DF3D00`	Scheduler support (lowest address)
`sub_1DF4120`	Scheduler support
`sub_1DF4FB0`	Scheduler support
`sub_1DF5810`	Machine function pass (pressure-aware scheduling)
`sub_1DF7390`	Pressure computation core (called 8x from `sub_1E00370`)
`sub_1DF76E0`	Register liveness query
`sub_1DF7A80`	Code motion feasibility check
`sub_1DF81C0`	Pressure computation core
`sub_1DF9E90`	Schedule optimization pass
`sub_1DFB810`	DenseMap (64-bit value variant)
`sub_1DFB9D0`	DenseMap (32-bit value variant, called 6x)
`sub_1E00370`	MRPA entry -- backend variant

Incremental Update Flow

The incremental update is the core algorithm. Rather than performing a full O(n) pressure recomputation after every MCSE transform, it maintains a running pressure state through delta operations. The pseudocode below is reconstructed from sub_2E5A4E0:

function mrpa_incremental_update(context, basicBlock):
    // Phase 1: Build worklist via DFS
    visited = DenseSet()                        // v292--v295
    worklist = []
    dfs_push(worklist, basicBlock, visited)     // standard DFS seed

    while worklist is not empty:
        bb = worklist.pop()

        // Phase 2: Create instruction tracking entries
        tracking = context.densemap[+80..+104]  // DenseMap at context offsets
        for each instr in bb.instructions:
            tracking.insert(instr, PressureEntry{})

        // Phase 3: Filter schedulable instructions
        if not sub_2E501D0(instr):              // schedulability predicate
            continue

        // Phase 4: Scan operands (40-byte entries)
        for i in range(instr.num_operands):     // iterated at v69/v70
            operand = instr.operand[i]          // 40-byte stride
            if not operand.isVirtualRegister():
                continue

            // Phase 5: Virtual register operand processing
            old_reg = sub_2EBEF70(operand)      // find existing rename mapping
            reg_info = sub_2EBEE10(operand)     // query register class, constraints
            new_reg = sub_2EBE820(operand)      // attempt rename if profitable
            if new_reg != old_reg:
                sub_2EBF120(old_reg)            // free old register

            // Phase 6: Register class constraint validation
            sub_reg_list = sub_E922F0(reg_info) // sub-register list for class
            for each sub_reg in sub_reg_list:
                validate_class_constraint(sub_reg, context.class_limits)

        // Phase 7: Pressure feasibility check
        bb_pressure = context.per_bb_data[bb]   // at v279[36]
        if not sub_2E4F9C0(bb_pressure):        // exceeds class limits?
            // Rename was unprofitable -- roll back
            context.rename_count_fail++         // *((_DWORD*)v254 + 17)
            sub_2E88E20(instr)                  // erase unprofitable instruction
        else:
            context.rename_count_success++      // *((_DWORD*)v254 + 16)

The key insight is that steps 5--7 form a speculative rename-then-validate loop: MRPA tentatively renames a virtual register, checks whether the rename reduces pressure below the class limit, and rolls back if it does not. The rename counts at *((_DWORD*)v254 + 16) (success) and *((_DWORD*)v254 + 17) (failure) provide a diagnostic ratio of how often speculative renames succeed.

Register Liveness Queries

Register liveness (sub_1DF76E0) checks whether a register is live in an instruction range [a3, a4] using _bittest on register class bitmaps. A compressed alias table at context offset +240 stores sub-register overlap information in 24-byte entries containing alias counts and alias data offsets.

The alias table structure:

Offset	Size	Content
+0	8	Sub-table pointer
+8	56	Alias data block
+8..+10 (per entry)	2	Alias count (uint16)
+10..	variable	Alias register IDs (2 bytes each)

Sub-register overlap is resolved through an incremental alias walk: for each register in the query range, the alias table is consulted to expand the register into its physical sub-registers, and each sub-register is tested against the liveness bitmap.

Code Motion Feasibility

Code motion feasibility (sub_1DF7A80) validates whether an instruction can be moved between basic blocks:

Check single-predecessor relationship between source and destination BBs.
Validate against the allocation bitmask at allocator offset +38.
Walk an instruction window bounded by offset +296 (configurable window size).
Count conflicting operands within the window.
Track affected registers in an rb-tree set (offsets 56--88) with node structure [left(16), right(24), value(32)].

An instruction is movable only if the conflicting operand count within the window is zero and the allocation bitmask permits the move.

MRPA Verification

A debug-only verification path checks incremental update correctness against full recomputation. The trigger path in sub_2E5A4E0 (decompiled lines 1702--1708):

if ( *(_BYTE *)(v7 + 40)               // [1] context enable flag -- always ON during MCSE
  && (_BYTE)qword_501F8A8              // [2] verify-update-mcse -- user must enable
  && (_BYTE)qword_501F988              // [3] incremental-update-mcse -- default ON
  && !sub_2E59B70(                     // [4] full recomputation DISAGREES
       *(_QWORD*)(v7+48),
       qword_501F7C8, ...) )
{
  sub_C64ED0("Incorrect RP info from incremental MRPA update\n", 1u);
}

All four conditions must hold simultaneously:

Context enable flag (v7 + 40) is set -- always true during MCSE.
verify-update-mcse is ON -- user must explicitly enable this debug knob.
incremental-update-mcse is ON -- default is ON.
sub_2E59B70 returns false -- full recomputation disagrees with the incremental state.

When all conditions hold, the error "Incorrect RP info from incremental MRPA update" fires via sub_C64ED0 (LLVM's report_fatal_error). The print-verify knob controls whether detailed per-register-class mismatch data is printed.

The backend variant (sub_1E00370, decompiled lines 2416--2420) uses byte_4FC6020 as its guard flag, calls sub_1DFF720 for verification, and falls back to byte_4FC62C0 (a cached result) if verification is disabled.

Knob	Default	Description
`incremental-update-mcse`	true	Incrementally update register pressure analysis
`verify-update-mcse`	false	Verify incremental update by full RP analysis
`print-verify`	false	Print problematic RP info if verification failed

To trigger verification: cicc -Xcuda -verify-update-mcse input.cu. NVIDIA keeps this check off by default since the full rescan is O(n) and expensive.

MachinePipeliner: Swing Modulo Scheduling

Complexity. Let N = number of instructions in the loop body and E = number of dependency edges in the DDG. DDG construction is O(N + E). RecMII computation (computeRecMII) finds the maximum cycle ratio via enumeration of elementary circuits in the DDG -- worst-case exponential, but bounded in practice by small loop sizes (N < 100) and sparse dependency graphs. ResMII computation is O(N) (sum of resource vectors). ASAP/ALAP computation is O(N + E) each (topological traversals). The II search probes at most pipeliner-ii-search-range (default 10) candidate IIs. For each II, node placement is O(N * II) -- each of N nodes probes up to II cycle slots. The total scheduling cost is O((N + E) + R * N * II_max) where R = search range. The pipeliner-max-stages (default 3) and pipeliner-max-mii (default 27) provide additional constant-factor bounds. For MRPA, the incremental pressure update is O(1) per instruction move (delta update), compared to O(N) for a full recomputation -- this is the key efficiency gain over a naive approach.

The MachinePipeliner (sub_3563190, ~2030 decompiled lines, ~58KB) implements Swing Modulo Scheduling (SMS) for software pipelining of loop bodies. It overlaps iterations of a loop body to improve throughput on pipelined hardware by interleaving instructions from different iterations. The upstream LLVM equivalent is SwingSchedulerDAG::schedule().

Pass discovery: the pipeliner walks an analysis array at this+3456 (offset 3456) looking for vtable unk_4F86530 (the MachinePipeliner analysis pass), then extracts the SwingSchedulerDAG context at offset +176.

Phase 1: Initialization and DDG Construction

The setup chain builds the data dependence graph and computes MII lower bounds:

Step	Function	Description
1	`sub_2F97F60`	`initializeDAG` -- build data dependence graph (DDG) over the single-BB loop body
2	`sub_3559990`	`computeNodeLatencies` -- fill latency fields per SUnit from the target scheduling model
3	`sub_3542B20`	`addDependencies` -- add register/memory/order dependency edges to the DDG
4	`sub_2F90200`	`updateRegPressure` -- compute initial register pressure state for the loop body
5	`sub_354CBB0`	`computeRecMII` -- find the maximum cycle length of any recurrence in the DDG
6	`sub_35449F0`	`computeResMII` -- compute `ceil(total_resource_usage / functional_unit_count)`

The context object SwingSchedulerDAG occupies approximately 4100 bytes:

Offset	Field
+32	`MachineFunction*`
+48..56	BB range (iterated at 256-byte stride)
+944	DenseMap: pre-existing ordering constraints
+3456	Analysis pass vector
+3472	MII (int32)
+3480	`schedulingSucceeded` (bool)
+3488	DiagnosticsEngine / remark context
+3520	`TargetSubtargetInfo*`
+3944..3952	DDG node storage (vector)
+4016..4072	Recurrence DenseMap (24-byte entries)

Phase 2: MII Computation and II Search

MII computation combines two lower bounds:

RecMII (Recurrence MII): the longest cycle in the DDG, computed by sub_354CBB0. Each recurrence (loop-carried dependency cycle) constrains the minimum II because the cycle must fit within one iteration interval. If pipeliner-ignore-recmii is set, RecMII is forced to zero so only resource constraints matter.
ResMII (Resource MII): ceil(sum of resource usage across all instructions / number of available functional units), computed by sub_35449F0. This reflects the throughput bottleneck of the hardware.

function compute_MII():
    recMII = sub_354CBB0()           // max recurrence length
    resMII = sub_35449F0()           // resource throughput limit
    if pipeliner-ignore-recmii:      // qword_503E888
        recMII = 0
    MII = sub_3542AB0(resMII, recMII)  // max(resMII, recMII)
    sub_3542AE0()                    // store MII at this+3472
    return MII

The II search algorithm starts at MII and probes upward:

function ii_search(MII):
    max_ii = MII + pipeliner-ii-search-range      // default: MII + 10
    if pipeliner-force-ii != 0:                    // qword_503EB80
        return try_schedule(pipeliner-force-ii)    // skip search entirely

    for II = MII to max_ii:
        // 1. Compute ASAP/ALAP at this II
        asap = compute_ASAP(DDG, II)               // sub_354BFF0 -> v369
        alap = compute_ALAP(DDG, II)               // sub_354BFF0 -> v373

        // 2. Place all nodes into II-wide modulo reservation table
        success = place_nodes(asap, alap, II)       // sub_354C3A0

        if not success:
            continue                                // try next II

        // 3. Compute stage count
        numStages = (lastCycle - firstCycle) / II   // (v84 - v80) / v88

        // 4. Validate stage count
        if numStages > pipeliner-max-stages:        // default 3
            continue

        // 5. Register pressure check (if enabled)
        if pipeliner-register-pressure:             // qword_503E2C0
            if not verify_pressure(II, pipeliner-register-pressure-margin):
                continue                            // sub_355C7C0

        return (II, schedule)

    return FAILURE   // "Unable to find schedule"

The pipeliner-force-ii knob (default 0) bypasses the search entirely and forces a specific II value. This is useful for testing or when the compiler team knows the optimal II for a specific loop shape.

Phase 3: ASAP/ALAP Computation

ASAP (As Soon As Possible) and ALAP (As Late As Possible) define the scheduling window for each instruction at a given II:

ASAP computation (sub_354BFF0, first invocation producing v369): traverses the DDG in topological order. For each node, ASAP = max over all predecessors of (predecessor.ASAP + edge.latency). The root nodes (no predecessors) have ASAP = 0. This gives the earliest cycle each instruction can execute without violating data dependencies.

ALAP computation (sub_354BFF0, second invocation producing v373): traverses the DDG in reverse topological order. For each node, ALAP = min over all successors of (successor.ALAP - edge.latency). Leaf nodes (no successors) have ALAP = II - 1 (or the schedule length bound). This gives the latest cycle an instruction can execute.

The scheduling window for instruction i is [ASAP(i), ALAP(i)]. Instructions with narrow windows (ASAP close to ALAP) are more constrained and are typically scheduled first by the node ordering heuristic.

Phase 4: Node Placement

Node placement (sub_354C3A0) attempts to assign each instruction to a specific cycle in the modulo reservation table (MRT). The MRT has II columns (one per cycle in the initiation interval) and tracks resource usage per cycle.

The placement algorithm follows the Swing Modulo Scheduling strategy:

Node ordering (sub_35630A0): nodes are prioritized by a combination of critical-path depth, recurrence membership, and scheduling freedom (ALAP - ASAP). Nodes in tight recurrences and on the critical path are placed first.
Direction selection: for each node, the scheduler decides whether to place it "forward" (from ASAP toward ALAP) or "backward" (from ALAP toward ASAP) based on its dependency relationships. The "swing" refers to alternating direction between predecessor-constrained and successor-constrained nodes.
Cycle probing: starting from the preferred direction, the scheduler tries each cycle in the node's [ASAP, ALAP] window. At each candidate cycle, it checks resource availability in the MRT (the cycle modulo II must have sufficient functional unit capacity) and verifies that all dependency constraints remain satisfied.
Conflict resolution: if no cycle in the window is feasible, the placement fails for this II and the search continues with II+1.

Phase 5: Kernel Generation

After a valid schedule is found, the pipeliner builds the kernel, prolog, and epilog. The numStages value ((lastCycle - firstCycle) / II) determines how many iterations overlap.

function build_kernel(schedule, II, numStages):
    // Build instruction-to-stage and instruction-to-cycle DenseMaps
    instrToStage = DenseMap<SUnit*, int>()      // v317/v318/v319
    instrToCycle = DenseMap<SUnit*, int>()       // v320/v321/v322
    // DenseMap config: hash=(key>>9)^(key>>4), empty=-4096, tombstone=-8192

    for stage in range(numStages):
        for each SUnit in schedule.stage_bundle(stage):
            instrToStage[SUnit] = stage
            instrToCycle[SUnit] = SUnit.assigned_cycle

    // Cross-reference recurrence edges with stage assignments
    if this+4064 (recurrence count) != 0:
        for each recurrence_edge in this+4056:
            edge.stage = instrToStage[edge.instruction]
            // Build per-recurrence analysis DenseMap (24-byte entries)

    // Select codegen backend (priority order):
    if pipeliner-annotate-for-testing:          // testing mode: annotate only
        sub_359AD80(schedule)
        return
    if pipeliner-experimental-cg:               // peeling code generator
        if numStages == 0:
            sub_35A5710()                       // trivial kernel (no overlap)
        else:
            sub_35A93B0()                       // experimental peeling CG
            sub_3598EB0()                       // finalize prolog/epilog
        return
    if pipeliner-mve-cg:                        // MVE code generator (DEFAULT)
        if numStages == 0 and target_supports_mve():
            sub_35A7730()                       // MVE compatibility check
            sub_35A76E0()                       // MVE code generator
            return
        // else fall through to experimental CG
    // Default fallthrough: experimental CG path

The codegen backend priority is: (1) pipeliner-annotate-for-testing for test infrastructure, (2) pipeliner-experimental-cg for peeling-based generation, (3) pipeliner-mve-cg (default enabled) for the MVE (Modulo Variable Expansion) code generator. The MVE path is gated on numStages == 0 and a target callback at **(this+3520)+72 returning non-default (i.e., not sub_2FDC510).

The SBO (Small Buffer Optimization) pattern is used for nodeInfo arrays: v416 = v418 (inline buffer of 704 bytes = 8 nodes x 88 bytes). When the loop body exceeds 8 instructions, sub_35498F0 sorts and possibly heap-allocates.

Error Conditions

Condition	Diagnostic	Severity
MII == 0	`"Invalid Minimal Initiation Interval: 0"`	0x15 (missed)
MII > `pipeliner-max-mii`	`"Minimal Initiation Interval too large: MII > SwpMaxMii"`	0x15 (missed)
Scheduling failure	`"Unable to find schedule"`	0x15 (missed)
numStages == 0	`"No need to pipeline - no overlapped iterations in schedule."`	0x15 (missed)
numStages > `pipeliner-max-stages`	`"Too many stages in schedule: numStages > SwpMaxStages"`	0x15 (missed)
Success	`"Pipelined succesfully!"` [sic]	0x13 (passed)

The typo "succesfully" (single 's') is preserved from upstream LLVM.

Pipeliner Knobs

Knob	Global	Default	Description
`enable-pipeliner`	`unk_503EE20`	true	Master switch for SMS
`enable-pipeliner-opt-size`	`qword_503ED40`	false	Enable SWP at -Os
`pipeliner-max-mii`	`qword_503ECE8`	27	Maximum allowed MII
`pipeliner-force-ii`	`qword_503EB80`	0	Force specific II (0 = auto)
`pipeliner-max-stages`	`qword_503EB28`	3	Maximum pipeline stages
`pipeliner-prune-deps`	`qword_503E9C0`	true	Prune deps between unrelated Phi nodes
`pipeliner-prune-loop-carried`	`qword_503E8E0`	true	Prune loop-carried order deps
`pipeliner-ignore-recmii`	`qword_503E888`	false	Ignore RecMII (hidden knob)
`pipeliner-show-mask`	`qword_503E720`	false	Debug: show scheduling mask
`pipeliner-dbg-res`	`qword_503E640`	false	Debug: resource usage
`pipeliner-annotate-for-testing`	`qword_503E5E8`	false	Annotate instead of codegen
`pipeliner-experimental-cg`	`qword_503E508`	false	Use peeling code generator
`pipeliner-ii-search-range`	`qword_503E3A0`	10	Range to search for II
`pipeliner-register-pressure`	`qword_503E2C0`	false	Consider register pressure
`pipeliner-register-pressure-margin`	`qword_503E1E0`	5	Margin % for reg pressure limit
`pipeliner-mve-cg`	`unk_503E100`	true	Use MVE code generator
`pipeliner-enable-copytophi`	`qword_503E020`	true	Enable CopyToPhi DAG Mutation
`pipeliner-force-issue-width`	`qword_503DF40`	0	Force issue width (0 = auto)

All registered in ctor_676_0_0x5a3430.c.

MachinePipeliner Function Map

Function	Identity
`sub_3563190`	Top-level SMS orchestrator (`SwingSchedulerDAG::schedule`)
`sub_2F97F60`	`initializeDAG` -- build DDG
`sub_3559990`	`computeNodeLatencies`
`sub_3542B20`	`addDependencies` -- register/memory/order edges
`sub_2F90200`	`updateRegPressure`
`sub_354CBB0`	`computeRecMII`
`sub_35449F0`	`computeResMII`
`sub_3542AB0`	`setMII` = max(ResMII, RecMII)
`sub_3542AE0`	`validateMII` / store at +3472
`sub_3556270`	`collectNodeInfo` -- gather 88-byte per-node records
`sub_35476E0`	`initNodeOrder` -- compute scheduling order
`sub_35523F0`	`computeSchedule` -- build SUnit ordering
`sub_35546F0`	`orderDependences` -- topological sort
`sub_3543340`	`computeStart` -- ASAP/ALAP times
`sub_35630A0`	`normalizeSchedule` -- adjust cycle numbering
`sub_35568E0`	`scheduleNodes` -- core SMS placement
`sub_35433F0`	`adjustSchedule` -- post-adjustment
`sub_3557A10`	`computeFinalSchedule` -- finalize stage/cycle
`sub_354A760`	`buildStageMap` -- iteration-to-stage mapping
`sub_355F610`	`schedule()` -- II search loop (2351 lines)
`sub_354BE50`	`getScheduleForStage`
`sub_35498F0`	`sortNodeInfo` (for >8 nodes)
`sub_359AD80`	`annotateForTesting`
`sub_35A5710`	`generateTrivialKernel`
`sub_35A93B0`	`experimentalPeelingCG`
`sub_3598EB0`	`finalizeExperimentalKernel`
`sub_35A76E0`	`mveCG` -- MVE code generator
`sub_35A7730`	`mveCompatCheck`

ScheduleDAGMILive: Post-RA Instruction Ordering

ScheduleDAGMILive (sub_355F610, 64KB) is the post-RA machine instruction scheduler. It takes the pipeliner's output (or standalone scheduling regions) and determines the final instruction order while respecting register pressure limits.

Data structures:

SUnit (Scheduling Unit): 88 bytes per instruction, consistent across both the pipeliner and ScheduleDAGMILive.
Instruction-to-node hash map: 632-byte entries per instruction. The unusually large entry size suggests extensive caching of per-instruction metadata (RP deltas, latency info, dependency edges) to avoid recomputation.
RP tracking structure: 112 bytes, with per-register-class pressure arrays at offsets 32--48 (current) and 56--72 (limits).

The scheduling flow:

Initialize RP tracking via sub_3551AB0 (if pipeliner-register-pressure is set).
Set per-class pressure defaults via sub_2F60A40.
Walk BB instruction list, build instruction-to-node hash map.
Compute ASAP (earliest cycle) via sub_354BFF0 -> v369.
Compute ALAP (latest cycle) via sub_354BFF0 -> v373.
Place instructions via sub_354C3A0 (returns success/failure).
Calculate stage count: (lastCycle - firstCycle) / II = (v84 - v80) / v88.
Verify placement via sub_355C7C0.
Build stage descriptors via sub_355D7E0 (80 bytes per stage, 10 QWORDs each).

Instruction Selection Heuristic

The instruction selection heuristic (sub_3557A10, 47KB) determines which instruction to schedule next from the ready set. It implements a multi-level priority scheme operating on 88-byte SUnit entries:

Level 1 -- Latency/Depth priority (SUnit offset +240): instructions deeper in the dependency graph are scheduled first. Depth is measured as the longest path from the instruction to a sink node in the DDG. This ensures that critical-path instructions are placed early, preventing them from becoming bottlenecks. Latency recomputation occurs via sub_2F8F5D0 during priority comparison to account for any scheduling decisions already made.

Level 2 -- Target priority table (context a1+3944): a table of 16-byte entries, each containing:

Offset	Size	Field
+0	4	`start` -- first cycle of priority window
+4	4	`end` -- last cycle of priority window
+8	4	`priority` -- target-assigned priority value
+12	4	`window_width` -- scheduling window size

The target (NVPTX backend) populates this table to express hardware-specific ordering preferences -- for example, prioritizing memory operations that can be overlapped with computation, or ensuring that warp-synchronous instructions are scheduled in specific relative positions. Instructions that fall within a priority window with a higher priority value are selected first.

Level 3 -- Schedule window width: when levels 1 and 2 are tied, the instruction with the narrower scheduling window (ALAP - ASAP) is preferred. Narrower windows mean fewer legal placement options, so these instructions should be placed before more flexible ones to avoid creating conflicts.

The ready queue is managed by sub_3553D90. Pattern matching on ready instructions proceeds through sub_35540D0 (applicability check) and sub_35543E0 (pattern application), with validation via sub_3546B80. A hash table at a1+3976 maps instructions to schedule nodes for O(1) lookup during priority comparison.

function select_next_instruction(ready_set):
    best = null
    for each candidate in ready_set:
        if best is null:
            best = candidate
            continue

        // Level 1: depth comparison
        if candidate.depth > best.depth:        // offset +240
            best = candidate
            continue
        if candidate.depth < best.depth:
            continue

        // Level 2: target priority table
        cand_prio = lookup_target_priority(candidate, priority_table)
        best_prio = lookup_target_priority(best, priority_table)
        if cand_prio > best_prio:
            best = candidate
            continue
        if cand_prio < best_prio:
            continue

        // Level 3: prefer narrower window
        cand_width = candidate.ALAP - candidate.ASAP
        best_width = best.ALAP - best.ASAP
        if cand_width < best_width:
            best = candidate

    return best

Texture Group Merge

The Texture Group Merge pass (sub_2DDE8C0, 74KB, 2382 decompiled lines) groups texture load instructions that access related memory locations, enabling the hardware texture unit to coalesce them into fewer requests. This is an NVIDIA-specific pass not present in upstream LLVM.

Fibonacci Hashing

The pass uses Fibonacci hashing for candidate bucketing:

hash = (ptr * 0xBF58476D1CE4E5B9) >> shift

The constant 0xBF58476D1CE4E5B9 is the 64-bit Fibonacci hash multiplier, derived from 2^64 / phi where phi = (1 + sqrt(5)) / 2 is the golden ratio. This multiplicative hash provides near-optimal distribution for pointer-based keys because the golden ratio's irrational nature ensures that consecutive multiples are maximally spread across the output range. The same constant appears in:

Linux kernel's hash_64() in include/linux/hash.h
LLVM's FoldingSet and DenseMap internals
CICC's SCEV expression uniquing (sub_DC2B70)
CICC's OpenMP SPMD region hash table

The shift parameter controls how many high bits are retained, effectively determining the number of hash buckets as 2^(64 - shift). For a 1024-bucket table, shift would be 54.

Algorithm Detail

Walk the BB instruction list.
For each instruction, call sub_2DDC600 (candidate identification) to determine if it is a texture load eligible for merging.
Hash the candidate's key pointer using Fibonacci hashing to assign it to a bucket.
Insert the candidate into the group table.

Group table entries are 56 bytes (7 QWORDs):

Offset	Size	Content
+0	8	Key pointer (texture base address or descriptor)
+8	8	Data pointer (to member array)
+16	4	Member count
+20	4	Member capacity
+24	32	Reserved / padding

Group members are 32 bytes each:

Offset	Size	Content
+0	8	`MachineInstr*` -- the texture load instruction
+8	8	Symbol -- the texture symbol reference
+16	8	Debug info -- source location
+24	8	Scope info -- DWARF scope

Generated group names carry a .Tgm (Texture Group Merge) suffix via sub_2241490. This suffix appears in debug output and internal symbol tables.

4-Callback Framework

The pass operates through a general instruction grouper framework (sub_3147BA0) that supports multiple types of instruction grouping through a common callback interface. Four callbacks are registered for texture group merge:

#	Callback	Function	Purpose
1	Candidate identification	`sub_2DDC600`	Examines each `MachineInstr` and returns true if it is a texture load eligible for grouping. Checks opcode, address space (texture memory), and operand constraints.
2	Group formation	`sub_2DDBF40`	After candidates are identified and hashed into buckets, this callback decides which candidates within a bucket should form a group. It checks address proximity, common base registers, and compatible access patterns.
3	Merge execution	`sub_2DDB3F0`	Applies the actual merge transformation. Replaces individual texture loads with a single grouped load instruction, rewrites operands, and updates dependency edges.
4	Cleanup	`sub_2DDB400`	Frees temporary data structures (group tables, member arrays, hash buckets) after merging is complete.

Additional helper functions in the texture group merge:

Function	Role
`sub_2DDD850`	Node insertion into group table
`sub_2DDDD70`	Resize/grow scheduling data
`sub_2DDD530`	Scheduling iteration over groups
`sub_2DDDAB0`	Node analysis (profitability check)
`sub_2DDB710`	Data dependency edge creation
`sub_2DDE490`	Grouping operation (merge groups)
`sub_2DDBC50`	Constraint application
`sub_2DDBBA0`	Constraint application (secondary)
`sub_2DDBA80`	Finalize group (seal and emit)

The grouper framework is designed to be reusable: by registering different callback tuples, the same framework can group surface loads, shared memory accesses, or other coalescing-friendly instruction patterns.

Scheduling Mode: The usedessa Knob

The usedessa knob (dword_4FD26A0, default 2) controls the scheduling pass pipeline configuration despite its name suggesting deSSA (de-Static Single Assignment) method selection. Pre-RA scheduling dispatches through sub_2165850; post-RA through sub_21668D0.

Mode 1 (simple): Pre-RA scheduling is skipped entirely. Post-RA runs only unk_4FCE24C (the post-RA scheduler). This minimal configuration is useful for debugging or when scheduling is harmful to performance.

Mode 2 (full, default): Pre-RA scheduling runs unk_4FC8A0C. Post-RA scheduling runs three passes sequentially:

unk_4FC8A0C -- pre-RA pass (disabled/noop in post-RA context).
unk_4FCE24C -- post-RA scheduler.
unk_4FC9D8C -- extra scheduling pass.

After scheduling completes, the framework prints "After Machine Scheduling", optionally runs sub_21F9D90, then runs unk_4FCAC8C and prints "After StackSlotColoring".

The "disabled" passes in mode 2 are registered but gated internally, allowing the framework to maintain a uniform pass list while selectively activating passes based on the current compilation phase.

Cross-Cutting Observations

Register pressure tracking appears in three distinct places within the scheduling infrastructure, each serving a different consumer:

Tracker	Consumer	Update Frequency
MRPA incremental (`sub_2E5A4E0`)	MCSE decisions	Per instruction move/elimination
ScheduleDAGMILive (`sub_355F610`)	Scheduling decisions	Per scheduling region
MachinePipeliner stage tracking	II feasibility	Per pipeline stage

All three maintain per-register-class pressure arrays but with different granularities. The MRPA tracker uses incremental delta updates for efficiency; the scheduler computes ASAP/ALAP bounds per region; the pipeliner tracks pressure per modulo stage.

The DenseMap hash function (ptr >> 9) ^ (ptr >> 4) is shared across both the 32-bit value variant (sub_1DFB9D0) and 64-bit value variant (sub_1DFB810), indicating a common template instantiation pattern consistent with LLVM's DenseMap<K, V> template.

Contrast with ptxas scheduling: ptxas has its own instruction scheduling subsystem with 195 knobs (including scoreboard-aware scheduling via the AdvancedSB* family, SchedDisableAll, SchedForceReverseOrder, and the GemmPipeliner* family of 8 knobs for matrix multiply detection and pipelining). CICC's scheduling operates at the MachineInstr level before PTX emission; ptxas re-schedules at the SASS level after PTX assembly. The two scheduling layers are independent but complementary.

What Upstream LLVM Gets Wrong for GPU

Upstream LLVM's instruction scheduling framework was designed for CPU cores with out-of-order execution, branch prediction, and deep reorder buffers. On a GPU SM, these hardware features do not exist:

Upstream assumes out-of-order hardware will hide scheduling mistakes. Modern CPUs have 200+ entry reorder buffers that dynamically reorder instructions, making compiler scheduling a second-order optimization. GPU SMs execute instructions in-order within each warp -- every scheduling decision is final. A poorly ordered instruction stream on GPU means stalls that no hardware can recover from.
Upstream optimizes for pipeline hazards and port pressure. CPU schedulers model execution port contention (e.g., port 0 vs. port 1 on Intel), dispatch group rules, and pipeline bubble avoidance. GPU scheduling targets register pressure minimization (nvptx-sched4reg) because the SM's warp scheduler handles instruction-level parallelism through warp interleaving, not through instruction reordering within a single thread.
Upstream assumes a single scheduling pass produces the final order. On CPU, LLVM's ScheduleDAGMILive emits the final instruction sequence. On NVPTX, cicc's scheduling is the first of two layers -- ptxas re-schedules the entire program at the SASS level with its own 195-knob subsystem (including scoreboard-aware scheduling via the AdvancedSB* family). CICC's scheduler optimizes for ptxas consumption, not for direct hardware execution.
Upstream has no concept of texture instruction grouping. CPU scheduling never considers grouping memory operations for hardware coalescing units. NVIDIA adds a dedicated Texture Group Merge pass (sub_2DDE8C0, 74KB) that groups texture load instructions by base address for the hardware texture unit -- an entirely GPU-specific optimization absent from upstream.
Upstream does not track register pressure incrementally during CSE. Upstream LLVM recomputes register pressure from scratch after each Machine CSE transform. NVIDIA's MRPA subsystem (sub_2E5A4E0, 48KB) maintains running pressure state through delta updates, because on GPU the pressure-to-occupancy relationship makes every CSE decision a potential occupancy cliff crossing that must be evaluated cheaply.

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Scheduling subsystems	`ScheduleDAGMILive` + optional `MachinePipeliner`; no incremental pressure tracker	Three distinct subsystems: MRPA incremental tracker, Swing Modulo Scheduler, `ScheduleDAGMILive`; plus texture group merge pass
MRPA (incremental pressure)	Not present; pressure recomputed from scratch after each CSE transform	`sub_2E5A4E0` (48 KB) + backend variant `sub_1E00370` (78 KB) maintain running pressure state through delta operations during MCSE
Texture group merge	No concept of texture instruction grouping	Dedicated pass (`sub_2DDE8C0`) groups texture load instructions for hardware coalescing; scheduling-adjacent optimization absent from upstream
Scheduling target	Optimize for hardware pipeline hazards and port pressure	Optimize the MachineInstr stream for `ptxas` consumption; focus on register pressure reduction (`nvptx-sched4reg`) rather than hardware pipeline timing
Two-level scheduling	Single scheduling pass produces final instruction order	CICC scheduling is first layer; `ptxas` re-schedules at SASS level with its own 195-knob subsystem
Register pressure model	Per-register-class pressure sets from TRI	Same model but with GPU occupancy awareness; pressure arrays used to detect occupancy cliff crossings
Scheduling mode switch	Configured at pipeline construction time	Runtime mode switch between pre-RA (`sub_2165850`) and post-RA (`sub_21668D0`) with different heuristic weights

ptxas Interaction

cicc's instruction scheduling operates at the MachineInstr level and produces a PTX instruction order that is not final. ptxas re-schedules the entire program at the SASS level using its own 195-knob scheduling subsystem, including scoreboard-aware scheduling (AdvancedSB* family), the GemmPipeliner* family for matrix multiply detection and software pipelining, and SchedForceReverseOrder for debugging. cicc's scheduler therefore optimizes for ptxas consumption rather than direct hardware execution: its primary goal is minimizing register pressure (nvptx-sched4reg) so that ptxas starts from a low-pressure baseline. The two scheduling layers are independent but complementary -- cicc controls the virtual register count visible to ptxas, and ptxas maps the resulting instruction stream onto the SM's hardware pipeline with full knowledge of scoreboard latencies and functional unit availability.

LiveRangeCalc

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Based on LLVM 17.x LiveRangeCalc.cpp (the page's own diff table cites LLVM 17.x as baseline). NVIDIA adds dual-bitvector GP/predicate tracking, a small-function bypass (instruction count <= 15), an enlarged 296-byte segment structure with inlined SmallVectors, and a 4/5 active-block fraction not present in any upstream version.

LiveRangeCalc is the low-level engine inside LLVM's CodeGen that turns def/use information into live intervals -- contiguous [SlotIndex, SlotIndex) segments describing when each virtual register holds a value. It sits between the SlotIndexes numbering pass and the LiveIntervals analysis, performing the actual iterative dataflow computation that propagates liveness backward through the CFG and inserts PHI-def value numbers at merge points. In CICC v13.0 the implementation at sub_2FC4FC0 is structurally based on upstream LLVM's LiveRangeCalc::extend / calculateValues but carries several NVIDIA-specific modifications: a dual-bitvector tracking scheme that separates general-purpose and predicate register liveness, a small-function bypass that skips the full dataflow for trivial kernels, and an enlarged per-segment structure (296 bytes) that inlines four separate SmallVector buffers to avoid heap allocations on the hot path.


Main entry	`sub_2FC4FC0` (12,900 bytes, 78KB decompiled)
Stack frame	504 bytes (`0x1F8`)
Callers	`sub_2FC8470` (LiveIntervals::computeRegUnitRange), `sub_2FC8230` (createDeadDef/addSegment), self-recursive
SlotIndexes pass	`sub_1F10BF0` (11KB), registered as `"slotindexes"` / `"Slot index numbering"`
LiveIntervals analysis	pipeline entry `"live-intervals"` (analysis ID `unk_4F96DB4`)
Address range	`0x2FBF390` -- `0x2FC8470` (full LiveRangeCalc cluster)
Returns	`bool` -- whether any live range was extended

SlotIndex Infrastructure

Before LiveRangeCalc can operate, every MachineInstr must have a SlotIndex -- a monotonically increasing integer that encodes both the instruction's position and a sub-slot discriminator (early-clobber, register, dead, etc.). The SlotIndexes pass at sub_1F10BF0 walks the MachineFunction and assigns these numbers. CICC's implementation matches upstream LLVM: each MachineBasicBlock owns a contiguous range [StartIdx, EndIdx), and the mapping from SlotIndex back to MachineBasicBlock* is maintained in a sorted array that supports binary search.

The sentinel values found in the binary confirm standard LLVM DenseMap usage:

Sentinel	Value	Meaning
Empty key	`0xFFFFFFFFFFFFF000`	Slot has never been occupied
Tombstone	`0xFFFFFFFFFFFFE000`	Slot was occupied, then erased

These appear throughout the segment hash table, the pending-def table, and the VNInfo chain, always as DenseMap<SlotIndex, ...> sentinels.

Segment Structure Layout

Each live range segment in CICC is 296 bytes (0x128), substantially larger than upstream's LiveRange::Segment (which is 24 bytes). The inflation comes from four inlined SmallVector buffers that avoid separate heap allocations for the common case:

Segment (296 bytes / 0x128):
  +0x00   u64   status / SlotIndex start (sentinel if free)
  +0x08   ptr   endpoint buffer (or inline at +0x18)
  +0x18   [16]  inline endpoint buffer
  +0x28         additional metadata (segment flags, subrange info)
  +0x50   ptr   register mask buffer (or inline at +0x60)
  +0x60   [56]  inline register mask buffer
  +0x98   ptr   kill-set buffer (or inline at +0xA8)
  +0xA8   [48]  inline kill-set buffer
  +0xD8   u32   kill count
  +0xE0   ptr   use-def chain buffer (or inline at +0xF0)
  +0xF0   [48]  inline use-def chain buffer
  +0x120  u32   total instruction count covered

Each pointer field follows the LLVM SmallVector convention: if the pointer equals the address of the inline buffer immediately following it, the data lives inline; otherwise it points to a heap allocation. During cleanup (Phase 1 of the algorithm), each segment's four buffers are freed individually before the segment is marked with the empty sentinel.

VNInfo Structure

Value numbers are tracked via 120-byte (0x78) VNInfo nodes, allocated from a bump-pointer allocator at [this+0x4A0]:

VNInfo (120 bytes / 0x78):
  +0x00   ptr   endpoint buffer (inline at +0x10)
  +0x08   u64   capacity (initial: 0x200000000 = inline cap 2)
  +0x10   [48]  inline endpoint buffer
  +0x40   ptr   kill-set buffer (inline at +0x50)
  +0x48   u64   capacity for kill-set
  +0x60   ptr   sub-chain pointer (phi resolution)
  +0x68   ptr   sub-chain pointer 2
  +0x70   u32   block number
  +0x74   u32   value number (initially unassigned)

The allocator is a classic bump allocator: a cursor at [this+0x4A0] advances by 0x10 per allocation, checked against capacity at [this+0x448]. When the arena fills, a slow-path reallocation grows the backing store. Deallocation chains through sub_2FBF390, which walks sub-chains and calls free with size 0x38 (56 bytes) per intermediate node and 0x78 (120 bytes) for the VNInfo itself.

Algorithm

The computation in sub_2FC4FC0 proceeds in eight phases. It is self-recursive: when iterative refinement discovers new work, the function calls itself to converge.

Phase 1 -- Initialization and Cleanup (0x2FC4FC0 -- 0x2FC50C2)

Links the SlotIndex base ([rdi] = [rsi+0x30]), increments the iteration counter at [this+0x10], and walks the existing segment table (stride 0x128) freeing stale entries. Segments marked with the empty sentinel (0xFFFFFFFFFFFFF000) are skipped; tombstoned entries (0xFFFFFFFFFFFFE000) and live entries both have their four internal buffers freed and are then marked empty.

The cleanup loop at 0x2FC5040--0x2FC50AE iterates with stride 0x128 over the segment array beginning at rbx. For each entry it checks [rbx+0x00] against both sentinels. If the entry is live or tombstoned, it frees four inlined SmallVector buffers in reverse allocation order:

[rbx+0xE0] -- use-def chain buffer (freed if pointer differs from inline region at rbx+0xF0).
[rbx+0x98] -- kill-set buffer (freed if pointer differs from inline region at rbx+0xA8).
[rbx+0x50] -- register mask buffer (freed if pointer differs from inline region at rbx+0x60).
[rbx+0x08] -- segment endpoint buffer (freed if pointer differs from inline region at rbx+0x18).

After freeing, the entry is stamped with the empty sentinel: mov qword [rbx], 0xFFFFFFFFFFFFF000. The old segment count stored at [rdi+0x20] is loaded into r15d at entry and used to bound the cleanup iteration.

Phase 2 -- Auxiliary Table Cleanup (0x2FC50C2 -- 0x2FC52A3)

Resets the old segment count, increments the auxiliary sequence counter, and walks three secondary tables:

Pending-def table at [this+0x40] (16-byte stride): cleared with empty sentinels.
VNInfo chain at [this+0xA0]: walked back-to-front, freeing each node through sub_2E0AFD0 (getRegInfo) and sub_2FBF390. The walk reads count from [r13+0xA8], loads each entry at [r12-8], decrements r12. For each VNInfo: frees sub-chains via sub_2FBF390 (size 0x38 = 56 bytes per intermediate node), then frees the VNInfo itself (size 0x78 = 120 bytes) via j_j___libc_free_0.
Auxiliary tables at offsets 0x130 (48-byte stride) and 0x480 (16-byte stride): freed/resized via sub_C7D6A0 (realloc).
Checks [r13+0x458] for additional pending work from a previous iteration.

Phase 3 -- Block Count and Threshold Check (0x2FC52A3 -- 0x2FC53F4)

Computes the active block count from the MBB array: active = (total_blocks * 4/5) - dead_block_count. The * 4/5 fraction is computed via the classic imul 0xCCCCCCCD trick for unsigned division by 5 on x86. If the result is zero, the function returns immediately.

The precise x86 idiom:

mov   rax, [rdx+10h]
sub   rax, [rdx+8]          ; pointer diff on MBB array
sar   rax, 3                ; divide by sizeof(pointer) = 8
imul  eax, 0xCCCCCCCD       ; unsigned multiply by magic constant
shr   eax, 2                ; result = total_blocks * 4 / 5 (rounded down)
sub   eax, [rdx+20h]        ; subtract dead_block_count

Two bitvectors are allocated on the stack for the live-in set. Initial inline capacity is 8 words (512 registers); if the block count exceeds 8, SmallVector::grow at sub_C8D5F0 expands them. The pre-allocated capacity at [r13+0xAC] is also checked; if insufficient, sub_2FC1040 (grow per-block segment table) is called.

Small-function bypass: If the total instruction count is 15 or fewer, OR the block count is 1 or fewer, OR the global flag qword_5025F68 is set (-Ofast-compile mode [LOW confidence] -- the flag triggers a compile-time shortcut consistent with a fast-compile option, but no string or CLI mapping for this global has been recovered; it could also be a debug-only override or an internal tuning knob), the function skips the full dataflow and returns early. This is an NVIDIA addition not present in upstream LLVM -- it avoids the quadratic cost of bitvector dataflow on trivial kernel bodies where liveness is obvious from local analysis alone.

Phase 4 -- Per-Block Segment Allocation (0x2FC538D -- 0x2FC55E7)

Calls sub_2FC1A70 (ensureCapacity) to prepare per-block storage, then loops over all non-dead blocks summing instruction counts. For each block:

Allocates a 120-byte VNInfo via the bump allocator (sub_22077B0). If allocation fails, jumps to error path at 0x2FC7E1C.
Initializes inline buffers with capacity markers (0x200000000 -- encodes inline capacity 2 in the high 32 bits with size 0 in the low 32 bits, the standard LLVM SmallVector representation).
Sets [vn+0x00] = pointer to inline endpoint buffer (rax+0x10), [vn+0x40] = pointer to inline kill-set buffer (rax+0x50).
Clears sub-chain pointers: [vn+0x60] = 0, [vn+0x68] = 0.
Records the block number at [vn+0x70] = ebx and clears the value number [vn+0x74] = 0.
Advances the bump-pointer allocator at [r14+0x4A0] by 0x10 to allocate a "pending use" object. The allocator checks against capacity at [r14+0x448] and falls back to a slow-path reallocation when the arena fills.
Inserts the VNInfo into the [this+0xA0] vector (grows if needed via sub_C7D6A0).
Registers the block number in the [this+0xC0] map (grows if needed).
Frees old VNInfo if it was a placeholder from a previous iteration.

Phase 5 -- Liveness Propagation via Bitvector Dataflow (0x2FC5656 -- 0x2FC5CC6)

This is the core computation -- a standard backward-dataflow fixed-point iteration, operating on 64-bit word bitvectors. It implements the classic liveness equation:

LiveIn(B) = (LiveOut(B) \ Kill(B)) | Def(B)
LiveOut(B) = Union over all successors S of LiveIn(S)

The iteration continues until no bitvector word changes across a complete pass over all pending blocks. The changed flag (var_1B0 on the stack) is cleared at the top of each outer iteration and set whenever any bitvector word is modified.

Detailed dataflow pseudocode

// Phase 5 reconstructed from sub_2FC4FC0 at 0x2FC5656--0x2FC5CC6
//
// State:
//   segment_table[]    -- hash table, stride 0x128, keyed by block ID
//     .gp_bv   (+0x98) -- general-purpose register bitvector (live set)
//     .pred_bv (+0xE0) -- predicate register bitvector (live set)
//     .kill_set(+0xA8) -- inline kill-set buffer
//     .kill_cnt(+0xD8) -- number of killed registers
//     .def_bv  (+0x08) -- def-set bitvector
//   worklist           -- pending blocks at [r13+0x50]
//   bv_words           -- number of 64-bit words = ceil(num_regs / 64)
//   changed            -- var_1B0 on stack

fn liveness_propagation(this: &mut LiveRangeCalc) -> bool {
    let bv_words: usize = (this.num_regs + 63) / 64;
    loop {
        let mut changed: bool = false;

        for block in this.worklist.iter() {
            // --- Step 1: Hash lookup for block's segment entry ---
            // Hash function: h = ((block.id >> 4) ^ (block.id >> 9))
            //                     & (capacity - 1)
            // Linear probing until key match or empty sentinel
            let entry = this.segment_table.lookup(block.id);

            // --- Step 2: Accumulate kill bitvector from kill set ---
            // The kill set at entry.kill_set contains register IDs
            // that are killed (last-use) within this block.
            // For each killed register, look up its own segment entry
            // and OR its kill bitvector into a local accumulator.
            let mut kill_accum: [u64; bv_words] = [0; bv_words];
            for i in 0..entry.kill_cnt {
                let killed_reg = entry.kill_set[i];
                let kill_entry = this.segment_table.lookup(killed_reg);
                // x86: OR [kill_accum + rdx*8], [kill_entry.kill_bv + rdx*8]
                for w in 0..bv_words {
                    kill_accum[w] |= kill_entry.gp_bv[w];
                }
            }

            // --- Step 3: Compute live_in for general-purpose registers ---
            // Standard backward dataflow: live_in = (live_out & ~kills) | defs
            // live_out is the current content of entry.gp_bv (propagated
            // from successors in previous iterations or initialization)
            let mut src: [u64; bv_words];
            for w in 0..bv_words {
                // x86: rax = NOT [kill_accum + w*8]
                //       rax = AND rax, [entry.gp_bv + w*8]    -- live_out & ~kills
                //       rax = OR  rax, [entry.def_bv + w*8]   -- | defs
                src[w] = (entry.gp_bv[w] & !kill_accum[w]) | entry.def_bv[w];
            }

            // Boundary mask: clear unused high bits in last word
            // x86: ecx = num_regs & 63
            //       shl rdx, cl; not rdx; and [src + (bv_words-1)*8], rdx
            if this.num_regs % 64 != 0 {
                let tail_bits = this.num_regs % 64;
                let mask = (1u64 << tail_bits) - 1;
                src[bv_words - 1] &= mask;
            }

            // --- Step 4: Interference check against allocated set ---
            // Compares computed live_in against the segment's "allocated"
            // bitvector at +0x98. Any bit set in src but NOT in allocated
            // indicates a new live register that extends the range.
            // x86 at 0x2FC5B86:
            //   rax = NOT [entry.gp_bv + rdx*8]   -- ~allocated
            //   rax = AND rax, [src + rdx*8]       -- new bits
            //   test rax, rax / jnz -> extend
            for w in 0..bv_words {
                let new_bits = src[w] & !entry.gp_bv[w];
                if new_bits != 0 {
                    entry.gp_bv[w] |= src[w];   // extend coverage
                    changed = true;
                }
            }

            // --- Step 5: Repeat identically for predicate register bv ---
            // The predicate bitvector at entry offset +0xE0 is processed
            // with exactly the same kill-accumulate / dataflow / interference
            // sequence. Predicate registers (%p0, %p1, ...) occupy a
            // physically separate register file in NVPTX hardware, so they
            // get their own independent bitvector to avoid inflating the
            // interference graph of the main register namespace.
            // [identical loop over pred_bv words omitted for brevity]

        } // end for each block

        if !changed {
            break;  // Fixed point reached
        }
        // Otherwise: var_1B0 was set to 1, loop back to top
    }
}

Convergence criteria

The fixed-point iteration terminates when a complete pass over all pending blocks produces no change to any bitvector word. Formally, convergence is guaranteed because:

Monotonicity. Each bitvector word can only gain bits (the |= operation in the interference-check step is monotone). Bits are never cleared during the iteration.
Finite lattice. The bitvector domain is a finite lattice of height num_regs. Each word can change at most 64 times (once per bit), so the total number of changes across all words and all blocks is bounded by N * W * 64 where N = block count and W = bitvector width in words.
Worst-case iterations. In practice, the iteration converges in O(D) passes where D = maximum loop nesting depth of the CFG. Each pass propagates liveness information one level deeper through nested loops. The theoretical worst case is N iterations for a pathological CFG with a chain of N blocks each feeding into the next, but CUDA kernels rarely exhibit such structure.

The changed flag (var_1B0) is a single byte on the stack. It is zeroed with mov byte [rbp+var_1B0], 0 at the top of each outer iteration and set with mov byte [rbp+var_1B0], 1 whenever the interference check finds new bits. The outer do { ... } while (changed) loop tests this byte at 0x2FC5CC0 with cmp byte [rbp+var_1B0], 0; jne back to the loop head at 0x2FC5656.

Kill and Def computation

The kill and def sets are not computed inside sub_2FC4FC0 itself. They are pre-populated by callers before invoking the dataflow engine:

Kill set (+0xA8 inline buffer, count at +0xD8): Populated by sub_2FC8470 (LiveIntervals::computeRegUnitRange) which walks each MachineBasicBlock's instruction list. A register is added to the kill set when an instruction has a use operand that is the last use before the next def (or end of block). The kill set is stored as a flat array of register IDs, not a bitvector -- the dataflow loop then expands it into a bitvector accumulator by looking up each killed register in the hash table.
Def set (+0x08 endpoint buffer): Populated by the same caller. A register is added when a MachineInstr defines it (operand flag isDef). For NVPTX, since all registers are virtual, every def creates a fresh value number. The def set is stored as a bitvector where bit i is set if virtual register i is defined in the block.
Initial live-out (+0x98 for GP, +0xE0 for predicate): Initialized to the empty set for all blocks. The dataflow iteration propagates liveness backward: when a use is found in a successor block with no preceding def, the register becomes live-out in the current block. The first iteration seeds liveness from the use/def information; subsequent iterations propagate it through the CFG.

This separation means the hash table must be fully populated with per-block kill and def information before sub_2FC4FC0 enters Phase 5. The hash table at sub_2FC0880 supports insert, lookup, and resize operations with open addressing.

Bitvector word-at-a-time implementation

All bitvector operations operate on 64-bit words with standard x86-64 bitwise instructions:

Operation	x86 pattern	Semantics
Union (OR)	`or [rdx+rax*8], rcx`	`bv[w]
Difference (AND-NOT)	`mov rax, [rsi+rdx8]; not rax; and rax, [rdi+rdx8]`	`new = src[w] & ~allocated[w]`
Boundary mask	`mov ecx, count_mod_64; mov rdx, -1; shl rdx, cl; not rdx; and [ptr+last_word], rdx`	Clear unused high bits
Zero test	`test rax, rax; jnz target`	Any bit set?

The boundary mask is critical for correctness: without it, garbage bits in the padding region of the last word would create phantom interference. The mask is computed once per iteration entry and applied after every live-in computation. The instruction sequence shl rdx, cl; not rdx creates a mask with count % 64 low bits set and the rest cleared.

Hash table for segment lookup

The segment hash table (sub_2FC0880) uses the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192) and an entry stride of 0x128 (296 bytes), matching the full segment structure size. See Hash Table and Collection Infrastructure for the hash function, probing, and growth policy.

During the dataflow iteration, each block requires two hash lookups per killed register (one for the block entry, one for each killed register's entry), so the total hash table traffic per iteration is O(N * K_max) where K_max is the maximum kill-set size across all blocks. Since NVPTX virtual register counts are typically in the hundreds (bounded by -maxreg, default 70), the hash table remains small and cache-friendly.

Phase 6 -- PHI Value Resolution (0x2FC5ED8 -- 0x2FC5F95)

After the dataflow converges, resolves PHI-def values at block boundaries. For each block, walks the predecessor chain at [block+0x30] and calls sub_2FBF8B0 (resolvePhiValue / findReachingDef) with four arguments: the LiveRangeCalc*, predecessor MBB, current bitvector, and a stack-allocated phi resolution buffer. This is the same algorithm as upstream LiveRangeCalc::updateSSA -- it propagates live-out values down the dominator tree and inserts PHI-def VNInfo nodes where multiple values reach a merge point.

The var_181 byte is initialized to 0 before each block as a "phi_resolved" flag. If sub_2FBF8B0 returns true, control jumps to 0x2FC710C for phi merge handling -- this path allocates a new VNInfo, links it into the sub-chain at [vn+0x60]/[vn+0x68], and updates the block's value number at [vn+0x74]. The temporary phi resolution buffer is freed after each block regardless of the outcome.

Phase 7 -- Segment Endpoint Fixup (0x2FC5FA8 -- 0x2FC6021)

For each word in the destination bitvector that has bits set (masked with 0xFFFFFFFFFFFFFFF8 to skip low tag bits), looks up the block's SlotIndex via [r14+0x18] shifted and indexed into the SlotIndex table at [rcx+0x98], retrieves the segment's use-def chain at [rdi+0x40], and calls sub_2E0F080 (addSegment / extendInBlock) to materialize the [start, end) segment in the LiveRange object. After processing all pending blocks, advances to the next MBB in the linked list via [r14+8], continuing until hitting the sentinel at [rbp+var_1F0].

Phase 8 -- Finalization and Return (0x2FC5974 -- 0x2FC59E6)

If no interference was found across all iterations, frees pending blocks from the [this+0x4A8] array (via sub_2E88E20), sets the pending count to zero ([r13+0x4B0] = 0), frees any dynamically-allocated bitvectors, and returns bool indicating whether any live range was extended. The return value is derived from var_1F0 = (count != 0).

Dual Bitvector Tracking

The most significant NVIDIA-specific modification is maintaining two independent bitvectors per segment:

Offset	Register class	Purpose
`+0x98`	General-purpose registers	`%r`, `%rd`, `%f`, `%fd`, `%h`, `%fh` liveness
`+0xE0`	Predicate registers	`%p` liveness

Both bitvectors are processed by identical code paths in Phase 5, but independently -- kills in one class do not affect the other. This separation reflects NVPTX's hardware architecture where predicate registers occupy a physically separate register file from data registers. Upstream LLVM's LiveRangeCalc handles all register classes through a single unified mechanism; CICC's split avoids interference-graph inflation by keeping the small predicate namespace out of the main bitvector.

The two bitvectors are processed sequentially within the same iteration body (not in separate passes). For each pending block, the general-purpose bitvector at +0x98 is processed first, then the predicate bitvector at +0xE0 is processed with structurally identical code. The changed flag is shared between both -- a change in either bitvector triggers another iteration of the outer loop. This means the predicate register dataflow rides for free on the same convergence pass, and the two bitvectors converge simultaneously.

The register coalescer at sub_34A46B0 also maintains a bitvector-per-block structure (a 12,336-byte stack buffer v90[12336] at offset 0x270 used as a bitmap for tracking live-through blocks during range rebuild after coalescing). That coalescer bitvector feeds updated information back into the LiveRangeCalc segment table when live intervals are modified by register coalescing.

Differences from Upstream LLVM

CICC v13.0's LiveRangeCalc diverges from upstream LLVM LiveRangeCalc (as of LLVM 17.x) in these specific ways:

Dual bitvector tracking. Upstream uses a single mechanism for all register classes. CICC splits GP and predicate into independent bitvectors to exploit the physical separation in NVPTX hardware.
Small-function bypass. The instruction-count threshold of 15 and the block-count threshold of 1 are NVIDIA additions. Upstream always runs the full dataflow. This optimization is significant because CUDA kernels frequently contain tiny __device__ helper functions that are inlined by the optimizer.
Global fast-compile flag. The qword_5025F68 check that bypasses the entire dataflow loop has no upstream equivalent. It is likely tied to the -Ofast-compile or -O0 optimization level in cicc.
Enlarged segment structure. Upstream's LiveRange::Segment is 24 bytes (start SlotIndex, end SlotIndex, VNInfo pointer). CICC's segment is 296 bytes (0x128), inlining four SmallVector buffers to avoid heap allocations on the hot path. This is a performance optimization for the common case where segments have small kill sets and few endpoints.
Active-block fraction. The * 4/5 computation in Phase 3 (via imul 0xCCCCCCCD) to determine the active block count is not present in upstream. Upstream counts all blocks equally. CICC discounts approximately 20% of blocks, likely accounting for unreachable or dead blocks that StructurizeCFG may have created but not yet eliminated.
PhysReg parameter always zero. Upstream's findReachingDefs takes a Register PhysReg parameter for physical register interference. Since NVPTX has no physical registers (all registers are virtual and hardware-mapped at launch time), this parameter is always Register() (zero). The binary confirms: sub_2E0FDD0 (isAllocatable) is called but its return value never gates segment creation.

GPU-Specific Considerations

Virtual-only register file. NVPTX has no physical registers in the LLVM sense -- all registers are virtual (%r0, %f0, %p0, ...) and the hardware thread scheduler maps them at launch time. This means LiveRangeCalc never needs to handle physical register liveness, live-in lists for calling conventions, or register unit interference. The PhysReg parameter in upstream's findReachingDefs is always Register() (zero). The binary confirms this: sub_2E0FDD0 (isAllocatable / reserved register check) is called but its return value is never used to gate segment creation.

Pressure-driven analysis. The live intervals produced by LiveRangeCalc feed directly into the greedy register allocator's interference cache (at selectOrSplit offset +648). Since NVPTX allocation is pressure-driven rather than assignment-driven, the intervals primarily serve to detect which virtual registers are simultaneously live, not to assign physical registers. The total count of simultaneously-live intervals at any program point determines the register pressure, which the allocator compares against the -maxreg limit (default 70).

Small-kernel bypass. The threshold check in Phase 3 (instruction count <= 15 OR block count <= 1) is absent from upstream LLVM. CUDA kernels frequently contain tiny helper device functions that are inlined into the caller; computing full dataflow liveness for a 10-instruction single-block function is pure overhead. The bypass returns immediately, letting the register allocator fall back to local analysis.

Configuration

Knob	Default	Effect
`early-live-intervals`	`false`	Runs LiveIntervals analysis earlier in the pipeline, before the standard scheduling pass
`join-liveintervals`	`true`	Master enable for register coalescing over live intervals
`qword_5025F68` (global flag)	`0`	When nonzero (likely `-Ofast-compile`), skips the full dataflow loop entirely

The instruction-count threshold of 15 and the block-count threshold of 1 are hardcoded constants, not configurable via LLVM cl::opt flags.

LiveRangeCalc Object Layout

The LiveRangeCalc object (this pointer passed in rdi) is reconstructed from register offsets observed throughout sub_2FC4FC0:

LiveRangeCalc (approx 0x4C0 bytes):
  +0x00   ptr    SlotIndex base (set from [rsi+0x30] in Phase 1)
  +0x08   ptr    VNInfo* / MBB* parameter (set from rsi in Phase 1)
  +0x10   u32    iteration counter (incremented each call)
  +0x14   u32    (padding / alignment)
  +0x20   u32    old segment count (r15d loaded in Phase 1)
  +0x30   u32    auxiliary sequence counter (incremented in Phase 2)
  +0x40   ptr    pending-def table (16-byte stride)
  +0x50   ptr    worklist (pending blocks array)
  +0xA0   ptr    VNInfo chain (vector of VNInfo*)
  +0xA8   u64    VNInfo chain count
  +0xAC   u32    pre-allocated capacity for per-block segment table
  +0xC0   ptr    block-number-to-VNInfo map
  +0x130  ptr    auxiliary table (48-byte stride)
  +0x440  ptr    bump allocator arena base
  +0x448  u64    bump allocator capacity
  +0x458  ptr    additional pending work (checked in Phase 2)
  +0x480  ptr    secondary auxiliary table (16-byte stride)
  +0x4A0  ptr    bump allocator cursor (advances by 0x10 per allocation)
  +0x4A8  ptr    pending-blocks array (freed in Phase 8)
  +0x4B0  u64    pending block count (zeroed in Phase 8)

Complexity

Per iteration: O(N * W) where N = number of basic blocks, W = bitvector width in words (ceil(num_regs / 64)). Both GP and predicate bitvectors are processed per iteration, so the actual cost is O(N * (W_gp + W_pred)), but since predicate register counts are small (typically < 64, fitting in a single word), the predicate contribution is O(N).
Kill-set expansion per iteration: O(N * K_max * W) where K_max = maximum kill-set size per block. For each of the N blocks, up to K_max hash lookups and W-word OR operations are performed.
Convergence: Typically O(D) iterations where D = maximum loop nesting depth. The monotonicity of the OR-based bitvector union guarantees termination. Worst case is O(N) iterations for a pathological single-predecessor chain, but CUDA kernels (especially after StructurizeCFG) have bounded nesting depth.
Total: O(N * W * D) for the core liveness computation, plus O(N * K_max * W * D) for kill-set expansion.
Hash table operations: O(1) amortized per lookup. Load factor is maintained below 75% by the DenseMap rehash policy.
Memory: O(N * W) for bitvectors + O(S * 296) for the segment table where S = number of live segments + O(V * 120) for VNInfo nodes where V = number of value numbers.
Phase 1 cleanup: O(S_old) where S_old = segment count from previous iteration. Each segment requires checking four buffer pointers and potentially freeing four allocations.

Function Map

Function	Address	Size	Role
LiveRangeCalc::extend / calculateValues -- main entry, self-recursive (12,900 bytes, 78KB decompiled)	`sub_2FC4FC0`	--	--
LiveIntervals::computeRegUnitRange (caller, populates kill/def sets)	`sub_2FC8470`	--	--
LiveIntervals::createDeadDef / addSegment (caller)	`sub_2FC8230`	--	--
ensureCapacity / resetLiveRanges (per-block storage preparation)	`sub_2FC1A70`	--	--
grow per-block segment table (called when `[r13+0xAC]` insufficient)	`sub_2FC1040`	--	--
interval building helper (called from `sub_2FC1040`)	`sub_2FC1190`	--	--
hash table operations: insert/lookup/resize with open addressing	`sub_2FC0880`	--	--
segment creation / initialization (296-byte struct setup)	`sub_2FC0040`	--	--
resolvePhiValue / findReachingDef (PHI resolution, 4 args)	`sub_2FBF8B0`	--	--
free VNInfo chain (frees 0x38-byte intermediate nodes, 0x78-byte VNInfo)	`sub_2FBF390`	--	--
segment merge / extend (interference update)	`sub_2FBFCC0`	--	--
live range query	`sub_2FC3C20`	--	--
live range intersection test	`sub_2FC3A50`	--	--
getRegInfo / MachineRegisterInfo query	`sub_2E0AFD0`	--	--
isAllocatable / reserved register check (return value unused in NVPTX)	`sub_2E0FDD0`	--	--
addSegment / extendInBlock (materializes `[start, end)` segments)	`sub_2E0F080`	--	--
MachineFunction helper	`sub_2E76F70`	--	--
eraseFromParent (MachineInstr deletion, used in Phase 8 cleanup)	`sub_2E88E20`	--	--
register property check (called with flags `0x80000`, `0x100000`)	`sub_2E88A90`	--	--
operator new (VNInfo allocation, 120 bytes)	`sub_22077B0`	--	--
SlotIndexes::runOnMachineFunction (11KB)	`sub_1F10BF0`	--	--
SlotIndexes pass registration (`"slotindexes"` / `"Slot index numbering"`)	`sub_1F10320`	--	--
SlotIndexes insertion / repair (13KB)	`sub_1F112A0`	--	--
SlotIndex validity check (string: `"invalid"`)	`sub_1F10810`	--	--
computeLiveIntervals (RA integration, called from greedy RA init)	`sub_2F54D60`	--	--
SmallVector::grow (bitvector expansion when block count > 8)	`sub_C8D5F0`	--	--
realloc (SmallVector resize / auxiliary table resize)	`sub_C7D6A0`	--	--
malloc (new allocation)	`sub_C7D670`	--	--

Cross-References

Register Allocation -- consumes live intervals to drive the pressure-based greedy allocator
Register Coalescing -- merges live ranges of copy-connected virtual registers; runs before RA, feeds updated intervals back through LiveRangeCalc
Instruction Scheduling -- the SlotIndexes numbering assigned here is consumed during post-RA scheduling for latency-aware reordering
SelectionDAG -- produces the initial MachineInstr stream that SlotIndexes numbers

Register Coalescing

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Register coalescing in CICC v13.0 eliminates redundant copy instructions by merging the live ranges of their source and destination virtual registers. NVPTX's unlimited virtual register model (PTX has no fixed physical register file) changes the purpose of coalescing compared to CPU targets: rather than reducing physical register pressure to avoid spills, the goal is strictly copy elimination -- fewer mov instructions in the emitted PTX, which in turn gives ptxas a cleaner input with fewer live-range constraints to resolve during its own physical allocation. CICC runs two coalescing passes in sequence: the standard LLVM RegisterCoalescer at sub_2F71140 (which handles generic COPY pseudo-instructions) and a separate NVPTX-specific coalescer rooted at sub_34AF4A0 (which handles NVPTX copy instruction families in the opcode 440--503 range that the generic pass does not recognize). This page documents both, with emphasis on the NVPTX-specific pass where the bulk of the proprietary logic resides.


Standard LLVM RegisterCoalescer	`sub_2F71140` (80KB, 2,190 lines)
NVPTX coalescing driver	`sub_34AF4A0` (67KB, 2,373 lines)
Per-instruction coalesce attempt	`sub_34AE060` (28KB)
Interference check	`sub_34AA450` (11.5KB)
Block-level coalescing	`sub_34BAAF0` (31.7KB)
Live-out / weight computation	`sub_34B7280` (22KB)
Interval tree (red-black BST)	`sub_34A0610` (14.7KB)
Range rebuild after merge	`sub_34A46B0` (13KB)
Opcode -> copy-type mapping	`sub_3494EA0` (12.7KB)
Operand type classification table	`byte_444C4A0` (16-byte entries)
Address range	`0x3494EA0` -- `0x34BF740`
Pass parameters	`(pass_obj, func_info, MF*, copy_limit, coalesce_limit)`
Pass ordering	After TwoAddressInstruction, before greedy RA

Why Coalescing Matters on a Virtual-Register Target

On CPU targets, coalescing reduces register pressure by allowing two virtual registers to share one physical register, potentially preventing a spill. On NVPTX the motivation is different. PTX is a virtual ISA with typed, unlimited registers (%r0, %r1, ... for 32-bit integers; %f0, %f1, ... for 32-bit floats). The "physical" allocation is deferred entirely to ptxas, which maps virtual registers to the hardware register file at kernel launch time based on occupancy targets. CICC's coalescing therefore serves three purposes:

Copy elimination. Every mov instruction that survives into emitted PTX is dead weight -- it costs an issue slot and extends the live range of both source and destination. Coalescing removes these by unifying src and dst into a single virtual register.
Reduced register name count. Even though PTX registers are virtual, ptxas must solve a graph-coloring problem on them. Fewer distinct register names (after coalescing merges equivalents) give ptxas a smaller interference graph and faster compilation.
Cleaner SSA destruction. PHI elimination during the transition from SSA form to machine code inserts copies at every PHI edge. Many of these are immediately coalesceable because the PHI operand's live range does not extend past the copy point. The coalescer cleans up the mechanical output of PHI lowering.

Copies that the coalescer processes arise from three sources: PHI elimination copies, ABI/calling-convention .param register copies for kernel call boundaries, and sub-register operations (EXTRACT_SUBREG, INSERT_SUBREG).

Standard LLVM RegisterCoalescer (sub_2F71140)

CICC includes the stock LLVM RegisterCoalescer at sub_2F71140, registered as pass "register-coalescer" with debug output markers "Before register coalescing" / "After register coalescing". This pass handles the generic COPY pseudo-instruction (LLVM's TargetOpcode::COPY) using the standard worklist-driven algorithm from upstream.

The key LLVM knobs that apply to this instance:

Knob	Default	Effect
`join-liveintervals`	`true`	Master enable for copy coalescing
`join-splitedges`	subtarget	Coalesce copies on split critical edges
`join-globalcopies`	subtarget	Coalesce copies that span basic blocks
`terminal-rule`	`true`	Apply the terminal rule (copies at block ends)
`verify-coalescing`	`false`	Verify MachineInstrs before and after coalescing
`late-remat-update-threshold`	`100`	Batch live-interval updates when a def has many copy uses
`large-interval-size-threshold`	`100`	Intervals with more valnos than this are "large"
`large-interval-freq-threshold`	`256`	Stop coalescing a large interval after this many joins

The standard pass operates on COPY pseudo-instructions only. It does not understand NVPTX-specific move instruction families (opcodes 440--503), which is why the NVPTX-specific pass exists.

NVPTX-Specific Coalescer (sub_34AF4A0)

The proprietary coalescer at sub_34AF4A0 runs after the standard RegisterCoalescer and targets NVPTX copy instruction families that the generic pass skips. It operates on the MachineFunction representation and accepts two limit parameters beyond the standard pass/MF arguments: copy_limit (maximum number of copy instructions to consider) and coalesce_limit (maximum number of successful merges before bailing out). These are compile-time budget controls that prevent quadratic behavior on large functions.

Opcode Classification

The function sub_3494EA0 contains a giant switch statement mapping NVPTX instruction opcodes (range 1--0x12) to copy families in the 440--503 opcode range. Each family represents a distinct copy semantic:

Opcodes 440--443: Type-preserving moves within a single register class (i32-to-i32, f32-to-f32, etc.). These map from internal opcodes 12, 13, 15 in the operand-type classification table.
Opcodes 444--503: Cross-class moves, paired/wide register moves (128-bit pairs for tensor core paths), and ABI-related .param copies.

The return value is an __m128i pair encoding both the copy semantics and the register class constraints, which subsequent stages use to decide whether coalescing is legal.

Operand-type classification happens via sub_34961A0, which reads operands and classifies them through a lookup table at byte_444C4A0. Each entry in this table is 16 bytes:

struct OperandTypeEntry {
    uint8_t type_code;        // +0: 12=i32, 13=i64, 15=f32, etc.
    uint8_t size_class;       // +1: size in register-width units
    uint8_t register_bank;    // +2: bank identifier
    uint8_t constraint_flags; // +3: bit 0x10 = participates in coalescing
    uint8_t reserved[12];     // +4: padding/future use
};

The constraint flag at offset +3 (mask 0x10) gates whether the operand participates in coalescing at all. Operands with this bit cleared are excluded from the worklist.

Register Class Constraints

Coalescing is constrained to same-class merges. The NVPTX register classes are completely disjoint -- an Int32Regs (%r) register cannot coalesce with a Float32Regs (%f) register even though both are 32 bits wide. This is a consequence of PTX's typed register model: .reg .b32 %r0 and .reg .f32 %f0 are distinct storage locations from ptxas's perspective. The complete register class table and coalescing constraint flags are in Register Classes. All eight primary classes are same-class-only; Int128Regs is excluded from the coalescing worklist entirely (constraint flag cleared).

Cross-class copies (e.g., bitcasting an i32 to f32) use distinct cross-class copy opcodes (see the copy opcode table) and are never eliminated by the coalescer -- they must survive as explicit instructions in PTX.

Sub-Register Handling

NVPTX has a flat register file with no sub-register structure in the CPU sense. There are no %eax/%ax/%al hierarchies. The exception is wide register pairs: 128-bit values used by tensor core operations are represented as pairs of 64-bit registers. sub_3497B40 handles paired-register decomposition, and when coalescing the low half of a pair, the high half inherits corresponding constraints. The coalesce candidate record (248 bytes) stores sub-operand arrays at offset +16 (4 entries of 32 bytes each, inline SBO) specifically for tracking these pair relationships.

Coalescing Algorithm

The NVPTX coalescer follows the standard LLVM pattern of worklist-driven interval joining but uses proprietary data structures throughout.

Phase 1: Initialization (lines 494--617)

Load TargetInstrInfo, TargetRegisterInfo, and TargetSubtargetInfo from the MachineFunction vtables. Initialize approximately 15 open-addressing hash maps, 2 min-heaps, 3 interval trees (red-black BSTs), and 2 linked lists. The stack frame is approximately 4.5KB. Walk all basic blocks, filter virtual-register operands via sub_2DADC00 (the isVirtualRegister check), and collect copy instructions into the worklist hash.

Phase 2: Block-Level Scanning (lines 618--857)

For each basic block, walk instructions and identify NVPTX copy instructions (opcode field at instruction offset +68 equals 14 or 15). For each copy:

Validate source type via sub_B10CD0 (extract register class).
Check physical register constraints (vestigial on NVPTX but present in the code).
Build a coalesce pair via sub_34A70E0, creating a 248-byte candidate record.

Track live-through registers per block using bitvectors.

Phase 3: Interference Graph Construction (lines 858--998)

Build the interval tree via sub_2DACB60 and sub_C8CD80. Cross-compare forward and backward interval lists via sub_2E564A0. Flatten into indexed format via sub_2E507D0. The result is a set of live intervals indexed by register number, stored in a red-black BST where each node is 448 bytes (0x1C0).

Phase 4: Worklist-Driven Coalescing (lines 1040--2092)

This is the core loop. Candidates are extracted from a min-heap ordered by register number (lowest first -- a standard LLVM heuristic that processes defs before uses in reverse postorder).

function CoalesceWorklistDriven(heap, intervals, hash_map):
    while heap is not empty:
        candidate = heap.extract_min()
        src_interval = lookup(hash_map, candidate.src_key)
        dst_interval = lookup(hash_map, candidate.dst_key)

        // Same-class check
        if register_class(src_interval) != register_class(dst_interval):
            continue

        // Interference check
        if CheckInterference(src_interval, dst_interval) != 0:
            push candidate to secondary_heap
            continue

        // Pre-coalesce validation
        if not ValidateCopy(candidate):
            push candidate to secondary_heap
            continue

        // Execute the merge
        merged = MergeIntervals(src_interval, dst_interval)
        RewriteOperands(candidate.copy_instr, merged)
        UpdateHashMap(hash_map, merged)

        // Verify and rebuild
        VerifyMergedInterval(merged)
        RebuildRanges(merged)

    // Double-buffer swap: retry with secondary heap
    swap(heap, secondary_heap)
    if secondary_heap was non-empty:
        repeat from top

The double-buffer swap (lines 2073--2093) alternates between two heaps (v373 and v376). After exhausting one worklist, the pass swaps and retries -- implementing the LLVM-style "iterate until convergence" pattern where an earlier merge may resolve interference that blocked a later merge.

Phase 5: Code Patching (lines 2095--2144)

For each coalesced pair, rewrite instruction operands:

sub_349D6E0 -- look up the merged interval's representative register.
sub_349FA50 -- find the instruction position.
sub_2E31040 -- patch the operand's register field.
Fix linked-list pointers using the ptr & 0xFFFFFFFFFFFFFFF8 mask (the low 3 bits encode tags on MachineOperand pointers: 0 = normal, 3 = tied operand, 4 = implicit operand).

Phase 6: Cleanup (lines 2145--2371)

Destroy interval trees (sub_349E8A0), perform final range rebuild (sub_34A46B0), finalize coalescing metadata (sub_34A2530), commit merged intervals (sub_34AA090), and deallocate all hash maps, heaps, and trees (16+ free calls).

Interference Check (sub_34AA450)

The interference check is the critical decision point. Given two intervals (identified by their register keys), it determines whether merging them would create a conflict -- that is, whether both registers are simultaneously live at any program point.

function CheckInterference(interval_A, interval_B) -> {0 = safe, 1 = interfering}:
    for each instruction I in interval_A.instruction_vector:
        if I is in the "already-coalesced" set:
            continue
        reg_class = extract_register_class(I)
        dst_interval = lookup(reg_to_interval_hash, I.dst_reg)
        if dst_interval overlaps with interval_B:
            return 1  // interfering
    return 0  // safe to coalesce

The "already-coalesced" set is an open-addressing hash map (pointer keys, hash (key >> 9) ^ (key >> 4), sentinels -4096/-8192). The sentinel check at a3+8 (a flag byte) determines whether the set uses inline or heap storage (small-buffer optimization for sets under approximately 8 entries).

Since NVPTX has no physical register file, "interference" here means purely that two virtual register live ranges overlap at a program point. On CPU targets this would also involve physical register conflict checks, but on NVPTX that dimension is absent.

Priority and Weight System

The coalescing priority determines the order in which candidates are processed when the min-heap's register-number ordering produces ties.

Weight computation (sub_34B7280):

weight = instruction_count + spill_weight[offset+240] + use_count[offset+252]

The flag at offset+254 & 1 guards weight computation: if set, the interval was pre-weighted by an earlier pass and the coalescer uses the existing weight rather than recomputing.

Higher weight means higher coalescing priority. The overall ordering is:

Primary key: register number (min-heap, lowest first).
Secondary key: weight (higher breaks ties in favor of more-used registers).

Block frequency integration: The pass reads a boolean from TargetPassConfig (via sub_35DDE70 at *(_QWORD*)(pass[4]+256)+856) that controls whether block frequency data influences priority. When enabled, copies in hot blocks receive higher priority, biasing the coalescer toward eliminating copies on the critical execution path.

Data Structures

Hash Maps

All hash maps use the standard DenseMap open-addressing infrastructure described in Hash Table and Collection Infrastructure. Two sentinel variants appear in this pass:

Variant	Key Type	Sentinel pair
Integer-key	`int32_t`	`-1` / `-2` (hash: `key * 37`)
Pointer-key	`int64_t`	`-4096` / `-8192` (hash: `(key >> 9) ^ (key >> 4)`)

Growth policy: next_power_of_2(2 * old_capacity - 1), minimum 64 entries.

Allocator: sub_C7D670(size, alignment=8) / sub_C7D6A0(ptr, size, alignment=8) -- CICC's aligned malloc/free wrappers.

Interval Tree (Red-Black BST)

Managed by sub_34A0610. Each node is 448 bytes (0x1C0):

Offset	Size	Field
+0	24	Tree links (left, right, parent pointers)
+32	8	Interval key (register/slot encoding)
+64	8	Instruction vector pointer
+72	4	Instruction count
+192	16	Debug name (SBO: inline if len <= 15)
+200	4	Sub-operand count
+224	4	Instruction opcode
+240	2	Priority/weight (`uint16`)

Comparator: sub_34A0190 (compares interval start positions). Rebalancing: sub_34A0330. The tree maintains count (a2[5]) and cached min/max (a2[3]/a2[4]).

Coalesce Candidate Record (248 bytes)

Built by sub_349AB40 for each potential coalescing opportunity:

Offset	Size	Field
+0	8	Source interval key
+8	8	Destination interval key
+16	128	Sub-operand array (SBO, 4 entries x 32 bytes)
+64	112	Type-constraint array (SBO, 2 entries x 56 bytes)
+192	32	Debug name (SBO string)
+224	4	Opcode classification (1--6: copy, subreg, extract, ...)
+232	4	Copy source register
+240	2	Priority (default: 1)

MachineOperand Pointer Encoding

Throughout the coalescing code, MachineOperand pointers use low-bit tagging (8-byte alignment guarantees 3 unused low bits):

Tag (ptr & 7)	Meaning
0	Normal operand
3	Tied operand (requires special coalescing -- both operands must map to same register)
4	Implicit operand (flag bit at operand offset +44, bit 3)

The code consistently masks with & 0xFFFFFFFFFFFFFFF8 before dereferencing and checks (ptr & 7) == 3 or (ptr & 4) != 0 for branching decisions.

CSSA Coalescing (PHI-Specific)

Separate from the two coalescing passes above, CICC includes a CSSA (Conventional SSA) coalescing stage controlled by the cssa-coalesce knob (constructor at ctor_705, address 0x5BD430). This pass operates at the SSA level rather than the machine level, coalescing PHI operands before PHI elimination to reduce the number of copies that PHI lowering generates. Associated knobs:

Knob	Effect
`cssa-coalesce`	Enable/disable PHI operand coalescing
`cssa-verbosity`	Verbosity level for CSSA debug output
`dump-before-cssa`	Dump IR before CSSA coalescing
`usedessa`	Select deSSA method (alternative to CSSA)

Knobs and Thresholds Summary

Knob	Source	Default	Effect
`join-liveintervals`	LLVM	`true`	Master enable for standard RegisterCoalescer
`join-splitedges`	LLVM	subtarget	Coalesce on split critical edges
`join-globalcopies`	LLVM	subtarget	Coalesce cross-block copies
`terminal-rule`	LLVM	`true`	Terminal rule for block-end copies
`verify-coalescing`	LLVM	`false`	Pre/post verification
`late-remat-update-threshold`	LLVM	`100`	Batch remat update threshold
`large-interval-size-threshold`	LLVM	`100`	Large interval valno threshold
`large-interval-freq-threshold`	LLVM	`256`	Large interval coalesce limit
`twoaddr-reschedule`	LLVM	--	Coalesce copies by rescheduling in TwoAddress
`copy_limit`	NVPTX	runtime	Max copies to consider in NVPTX pass
`coalesce_limit`	NVPTX	runtime	Max merges before bailout in NVPTX pass
`cssa-coalesce`	NVPTX	--	PHI operand coalescing
`cssa-verbosity`	NVPTX	--	CSSA debug verbosity
block frequency flag	NVPTX	config	Weight copies by block hotness

The copy_limit and coalesce_limit parameters are passed into sub_34AF4A0 at call time (not static cl::opt knobs). Their values come from the pass pipeline configuration and serve as compile-time budget caps to avoid quadratic worst-case behavior on functions with thousands of copies.

Impact on ptxas

The quality of CICC's coalescing directly affects ptxas's register allocation phase:

Fewer virtual registers means a smaller interference graph for ptxas to color, reducing its compilation time.
Eliminated copies reduce instruction count, giving ptxas's scheduler more freedom and fewer false dependencies.
Preserved type invariants (no cross-class coalescing) ensure ptxas never encounters type-inconsistent register usage, which would require additional conversion instructions.
Wide register pair tracking ensures tensor core instruction patterns remain intact -- ptxas expects specific register pair relationships for mma and wmma instructions.

A pathological case is over-aggressive coalescing that creates very long live ranges spanning many basic blocks. On NVPTX this does not cause spills (there is no physical register file to spill from), but it can increase ptxas's reported register usage, reducing occupancy. The coalesce_limit parameter and the large-interval frequency threshold exist partly to avoid this scenario.

Function Map

Function	Address	Size	Role
Main NVPTX coalescing driver	`sub_34AF4A0`	67KB	--
Per-instruction coalesce attempt	`sub_34AE060`	28KB	--
Pre-coalesce validation (opcode 14/15 check)	`sub_34AB5C0`	16KB	--
Post-coalesce update (rewrite def-use chains)	`sub_34AC810`	19KB	--
Constrained-copy validation variant	`sub_34AD8B0`	8.5KB	--
Interference check	`sub_34AA450`	11.5KB	--
Range rebuild (bitvector `v90[12336]`)	`sub_34A46B0`	13KB	--
Interval equivalence verify	`sub_34A2770`	7.3KB	--
Interval tree insert/rebalance (RB-tree)	`sub_34A0610`	14.7KB	--
Register-to-interval hash lookup	`sub_34A3910`	2.7KB	--
Build worklist from BB operand scan	`sub_34A3D10`	5KB	--
Build worklist from instruction iteration	`sub_34A41A0`	4.8KB	--
Block-level coalescing driver	`sub_34BAAF0`	31.7KB	--
Live-out analysis + weight computation	`sub_34B7280`	22KB	--
Per-register interference build	`sub_34B6620`	17.7KB	--
Operand-type classification	`sub_34961A0`	26.6KB	--
Register-pair decomposition	`sub_3497B40`	16.5KB	--
Opcode -> copy-type mapping (switch)	`sub_3494EA0`	12.7KB	--
Build coalesce candidate list	`sub_349AB40`	24.5KB	--
Merged-interval representative lookup	`sub_349D6E0`	--	--
Instruction position lookup/creation	`sub_349FA50`	7.1KB	--
Interval tree destructor (variant A)	`sub_349E330`	4KB	--
Interval tree destructor (variant B)	`sub_349E500`	4KB	--
Interval tree destructor (variant C)	`sub_349E6D0`	4KB	--
Interval tree destructor (variant D)	`sub_349E8A0`	4KB	--
Interval info populate from instruction	`sub_349F140`	4.7KB	--
Interval structure reset	`sub_349F740`	4KB	--
Generic map cleanup (callback `sub_349D600`)	`sub_34A2010`	--	--
Finalize coalescing metadata	`sub_34A2530`	--	--
Commit merged intervals	`sub_34AA090`	--	--
Secondary coalesce commit	`sub_34A9A60`	--	--
Register info initializer	`sub_35065A0`	--	--
Standard LLVM RegisterCoalescer	`sub_2F71140`	80KB	--
RegisterCoalescer::getPassName	`sub_2F60C50`	--	--

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Number of passes	Single `RegisterCoalescer` pass handling `COPY` pseudo-instructions	Two passes in sequence: stock LLVM `RegisterCoalescer` (`sub_2F71140`) + NVPTX-specific coalescer (`sub_34AF4A0`)
Opcode coverage	Handles only `TargetOpcode::COPY` (generic copy pseudo)	NVPTX pass handles NVPTX copy instruction families in opcode range 440--503 that the generic pass does not recognize
Coalescing goal	Reduce physical register pressure to prevent spills	Strictly copy elimination (PTX has unlimited virtual registers); goal is fewer `mov` instructions in emitted PTX and smaller interference graphs for `ptxas`
Interference check	Standard `LiveIntervals` query	Custom interference check (`sub_34AA450`, 11.5 KB) with interval tree (red-black BST at `sub_34A0610`) for NVPTX register classes
Block-level coalescing	Part of the unified worklist	Separate block-level coalescing pass (`sub_34BAAF0`, 31.7 KB) processes copies within each block before cross-block coalescing
Operand classification	Generic operand handling	Custom operand type classification table (`byte_444C4A0`, 16-byte entries) maps NVPTX opcode families to copy semantics
Pass parameters	Standard `runOnMachineFunction` with no limits	Parameterized with explicit `(copy_limit, coalesce_limit)` bounds for compile-time control on large kernels

Cross-References

Register Allocation -- the greedy allocator that runs after coalescing; shares the register class table and interference hash pattern.
Instruction Scheduling -- scheduling runs after RA and benefits from reduced copy count; MRPA pressure tracking is affected by coalescing decisions.
LLVM Knobs -- full knob inventory including all coalescing-related flags.
Code Generation -- pipeline ordering showing where coalescing fits relative to other machine passes.

Register Allocation

Prerequisites: Familiarity with NVPTX register classes, the GPU execution model (especially occupancy and register pressure), and Live Range Calculation. Understanding of Register Coalescing is helpful.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Upstream source: llvm/lib/CodeGen/RegAllocGreedy.cpp, llvm/lib/CodeGen/SplitKit.cpp, llvm/lib/CodeGen/RegisterCoalescer.cpp, llvm/lib/CodeGen/LiveRangeEdit.cpp (LLVM 20.0.0). NVPTX register class definitions: llvm/lib/Target/NVPTX/NVPTXRegisterInfo.td.

LLVM version note: CICC v13.0 ships two complete copies of RAGreedy (legacy PM at 0x1EC0400, new PM at 0x2F4C2E0). The new PM variant matches the LLVM 20 RAGreedyPass interface. The PriorityAdvisor/EvictionAdvisor infrastructure matches LLVM 15+ patterns. All NVPTX-specific behavior (pressure-driven allocation, -maxreg ceiling, occupancy-aware rematerialization) is layered on top of stock RAGreedy via TTI hooks and custom knobs.

NVPTX register allocation in CICC v13.0 operates under a fundamentally different model from CPU targets. PTX has no fixed physical register file -- registers are virtual (%r0, %r1, %f0, ...) and the hardware scheduler maps them to physical resources at launch time. The "physical register" concept in LLVM's greedy allocator maps to register pressure constraints rather than actual hardware registers, making the allocator pressure-driven rather than assignment-driven. The primary constraint is the -maxreg limit (default 70), which bounds total live registers across all classes to control occupancy on the SM.


Greedy RA driver	`sub_2F5A640` (466 lines)
selectOrSplit core	`sub_2F49070` (82KB, 2,314 lines)
Live range splitting	`sub_2F2D9F0` (93KB, 2,339 lines)
Register coalescing	`sub_2F71140` (80KB, 2,190 lines)
Register info init (new)	`sub_30590F0`
Register info init (old)	`sub_2163AB0`
Allocation failure handler	`sub_2F418E0`

Dual Greedy RA Instances

CICC contains two complete copies of the Greedy Register Allocator infrastructure, corresponding to the legacy and new LLVM pass managers:

Instance A (legacy, 0x1EC0400 region): registered through the old pass manager pipeline.
Instance B (new, 0x2F4C2E0 region): registered through sub_2F504C0 as the factory function.

Both are registered under the pass name "Greedy Register Allocator" via RAGreedyPass (sub_2342890). The selectOrSplit entry point at sub_2F4BAF0 is a thin wrapper that redirects to sub_2F49070(this + 200, ...). A separate entry at sub_2F4BB00 handles the spill-or-split path with SplitEditor integration.

NVPTX Register Classes

CICC defines nine register classes plus one internal-only class. The complete register class table -- vtable addresses, PTX type suffixes, prefixes, encoded IDs, copy opcodes, and coalescing constraints -- is in Register Classes.

The classes are completely disjoint -- there is no cross-class interference. Each type lives in its own namespace: integer 32-bit values occupy %r registers, 32-bit floats occupy %f registers, and so on. Copy instructions are class-specific, with both same-class and cross-class opcodes dispatched by sub_2162350 (see the copy opcode table).

Greedy selectOrSplit -- Detailed Algorithm

Complexity. Let V = number of virtual registers, R = number of register units, and I = total MachineInstr count. The main allocation loop processes V virtual registers in priority order. For each VReg, selectOrSplit performs: (1) operand scanning in O(operands) with 40-byte stride, (2) interference scanning (scanInterference) in O(R) via the RegAllocMatrix, (3) assignment or eviction attempts in O(R) per candidate. The tryLastChanceRecoloring path is bounded by lcr-max-depth (default 5) and lcr-max-interf (default 8), giving O(8^5) = O(32768) per VReg in the absolute worst case -- though this path is rarely taken. Live range splitting (splitAroundRegion, 93KB) iterates segments in O(S) where S = number of live range segments, with interference analysis per segment in O(R). Overall: O(V * R) for the common case, O(V * R + V * 8^D) when last-chance recoloring is exercised at depth D. The interference cache's open-addressing hash map with 37 * reg provides O(1) amortized lookups. Spill cost computation (setupSpillCosts) is O(V * I_avg) where I_avg is average instructions per VReg's live range. On NVPTX, the completely disjoint register classes mean cross-class interference is zero, reducing the effective R to the per-class register count.

The core allocation algorithm (sub_2F49070, 82KB, 2,314 decompiled lines) follows LLVM's standard RAGreedy::selectOrSplit structure with NVPTX-specific adaptations for pressure-driven allocation. The following pseudocode is reconstructed from the decompiled binary and covers the key phases visible in the new-pass-manager instance.

Initialization (lines 381--484)

fn selectOrSplit(this: &mut RAGreedyState, VirtReg: &LiveInterval) -> PhysReg {
    let TRI     = this.TargetRegisterInfo;
    let NumRegs = TRI[+44];                             // total reg unit count

    // --- RegUnitStates: per-register-unit state array ---
    //     Stored at this+1112, 4 bytes per unit.
    //     Values: 0 = free, 1 = interfering, 2 = reserved
    this.RegUnitStates = alloc_zeroed(NumRegs * 4);     // at this+1112

    // --- Live-through bitvector ---
    //     Stored at this+736, one bit per register unit,
    //     packed into 64-bit words.  Set bits mark units
    //     live across the entire interval.
    let bv_words = (NumRegs + 63) / 64;
    this.LiveThrough = alloc_zeroed(bv_words * 8);      // at this+736

    // --- Interference cache ---
    //     Open-addressing hash map at this+648/656/664.
    //     Key   = register number (unsigned 32-bit)
    //     Hash  = 37 * reg  (mod table_capacity)
    //     Empty = 0xFFFFFFFF (-1),  Tombstone = 0xFFFFFFFE (-2)
    //     Growth: when 4*(count+1) >= 3*capacity, double & rehash.
    this.IntfCache.buckets  = alloc_sentinel(initial_cap); // this+648
    this.IntfCache.count    = 0;                           // this+656
    this.IntfCache.capacity = initial_cap;                 // this+664
    ...

The RegUnitStates array is the central per-unit bookkeeping structure for the entire allocation of a single virtual register. Each 4-byte slot tracks whether that register unit is free, already interfering with the current live range, or reserved by the target. The array is zeroed at the start of every selectOrSplit invocation and released at cleanup (lines 2192--2313).

The interference cache at this+648 is distinct from LLVM's standard InterferenceCache (allocated at 0x2C0 bytes via sub_2FB0E40 during driver setup). This per-invocation cache is a lightweight open-addressing map used to deduplicate interference queries within a single selectOrSplit call. The hash function 37 * reg is a small Knuth-style multiplicative hash chosen for speed over distribution quality -- adequate because register numbers are small consecutive integers.

Operand Scanning (lines 690--1468)

The function walks every MachineOperand attached to the live range's segment list. Operands are stored in a flat array with a 40-byte stride per entry. The type byte at offset +0 of each operand classifies it:

Type Byte	Meaning	Action
`0`	Virtual register	Check copyable/tied flags; record in VReg worklist
`12`	Register mask (call clobber)	Store pointer in regmask list at `this+1176/1184`
other	Physical register	Mark in reserved bitvector; update `RegUnitStates`

For each operand:

    for op in VirtReg.operands(stride=40):
        match op.type_byte:
            0 =>                                        // virtual register
                if op.reg & 0x80000000:                 // negative = virtual
                    check_copyable(op);
                    check_tied(op);
                    update needsRecoloringFlag;          // v321
                else:
                    mark_reserved(op.reg, RegUnitStates);
                    update hasPhysicalAssignment;         // v323
            12 =>                                       // regmask
                append(this.regmask_list, op);
            _ =>
                mark_reserved(op.reg, RegUnitStates);

The 40-byte operand stride is wider than upstream LLVM's MachineOperand (typically 32 bytes) because CICC embeds an additional 8-byte field for NVPTX-specific metadata (likely the register class tag and a flags word). The scanning loop at line 690 uses v321 (needsRecoloringFlag) and v323 (hasPhysicalAssignment) as accumulator flags that gate later phases: if no virtual registers need work, the function returns early.

Interference Processing via sub_2F43DC0 (lines 714--955)

After operand scanning, the allocator calls sub_2F43DC0 (scanInterference) to populate the interference cache:

    scanInterference(this, VirtReg, &IntfCache);
    // IntfCache now contains register units that conflict.
    // Iterate the conflict list at this+1128/1136:

    for conflict in IntfCache.entries():
        if conflict.is_constrained:
            // Tied operand or early-clobber -- must try eviction
            result = tryEviction(conflict);              // sub_2F48CE0
        else:
            // Normal overlap -- try simple direct assignment first
            result = tryAssign(conflict);                // sub_2F47B00

        if result.success:
            record_assignment(result.phys_reg);
            break;
        // else: continue to next candidate

sub_2F43DC0 is the interference scanner. It walks the RegAllocMatrix (set up by sub_3501A90 during driver init) to find live range overlaps. For each physical register unit that overlaps the current virtual register's live range, it inserts an entry into the interference cache using the 37 * reg hash. The scanner distinguishes between two conflict types:

Constrained conflicts (tied operands, early-clobber, regmask kills) -- these route to sub_2F48CE0 (tryEviction), which attempts to evict the conflicting virtual register from its current assignment if the eviction cost is lower than the current candidate's spill weight.
Normal conflicts -- these route to sub_2F47B00 (tryAssign), which attempts a simple recoloring without eviction.

Additional helper functions participate in this phase:

Function	Role
`sub_2F47200`	processConstrainedCopies -- handles operands where a COPY forced a specific register
`sub_2F46530`	tryLastChanceRecoloring -- last-resort recoloring bounded by `lcr-max-depth` (default 5) and `lcr-max-interf` (default 8)
`sub_2F46EE0`	rehashInterferenceTable -- grows/rehashes when load factor exceeds 75%
`sub_2F424E0`	updateInterferenceCache -- inserts a newly discovered conflict
`sub_2F42840`	markRegReserved -- marks a physical register as reserved in `RegUnitStates`

The tryLastChanceRecoloring path (sub_2F46530) is the most expensive fallback. It recursively attempts to reassign conflicting registers, up to lcr-max-depth levels deep and considering at most lcr-max-interf conflicting live ranges at each level. The exhaustive-register-search flag bypasses both cutoffs, trading compile time for allocation quality.

Copy Coalescing Hints -- Kinds 20 and 21 (lines 1060--1163)

During operand scanning, the allocator identifies COPY-like instructions by checking the operand kind field. Two kind values trigger coalescing hint recording:

    for op in VirtReg.operands(stride=40):
        match op.kind:
            20 =>                                       // direct COPY hint
                record_hint(op.source_reg, op.dest_reg);
            21 =>                                       // parent-chain COPY hint
                if op.flags[+44] & (1 << 4):           // "has parent" flag
                    // Walk up the parent live range chain
                    let parent = op.parent_LR;
                    while parent != null:
                        recordCoalescingHint(parent);   // sub_2F41240
                        parent = parent.parent;
                    // Coalescing opportunities tracked at this+832/840

Kind 20 represents a simple register-to-register COPY where the source and destination should ideally receive the same physical register. Kind 21 is more complex: it indicates a COPY from a split sub-range that has a parent live range. The has parent flag at byte +44 bit 4 triggers a chain walk via sub_2F41240 (recordCoalescingHint), which records each parent in a coalescing hint list at this+832/840. The hint list is later consumed by sub_2F434D0 (collectHintInfo) during the allocation priority computation, biasing the allocator toward assigning the same physical register to the entire chain.

This is standard LLVM coalescing hint infrastructure, but on NVPTX it interacts with the complete class separation: hints only apply within a single register class, since cross-class coalescing is impossible.

Virtual Register Assignment (lines 1005--1368)

After interference processing and copy hint collection, the function enters the main assignment loop:

    for vreg in unassigned_vregs:
        // Check the live-through bitvector at this+736
        if is_live_through(vreg, this.LiveThrough):
            // This vreg is live across the entire region -- expensive
            result = tryLastChanceRecoloring(vreg);     // sub_2F46530
        else:
            result = tryAssignFromHints(vreg);

        if result.success:
            recordAssignment(result);                    // sub_2F42240
            refresh_operand_list();                      // re-scan
        else:
            // Allocation failed for this vreg -- proceed to splitting
            add_to_split_worklist(vreg);

The live-through bitvector at this+736 is the key data structure for this phase. A set bit indicates that the register unit is live from the beginning to the end of the current region, making it the hardest case for the allocator because there is no gap in which to insert a split point. These live-through ranges go directly to last-chance recoloring.

Cleanup (lines 2192--2313)

The function releases the RegUnitStates array, clears the interference cache, frees the live-through bitvector, and returns 1 on success (physical register assigned) or 0 on failure (must spill).

Live Range Splitting -- Detailed Algorithm

The splitting engine (sub_2F2D9F0, 93KB, 2,339 lines) implements RAGreedy::splitAroundRegion with SplitAnalysis and SplitEditor integration. This is the largest single function in the register allocation cluster.

Segment Enumeration (40-byte stride, gap/sub-range flags)

The splitting engine iterates the live range's segment linked list using the same 40-byte stride as the operand scanner. Two flag bits in the segment header control splitting decisions:

Flag	Location	Meaning
Gap flag	bit 2 of `byte[0]`	Segment has a gap before it (potential split point)
Sub-range flag	bit 3 of `byte[44]`	Segment is a sub-range of a larger interval

fn splitAroundRegion(this: &mut SplitEditor, MF: &MachineFunction) {
    let SubTarget = MF.vtable[+128];
    let TRI       = SubTarget.vtable[+200];

    // Per-region loop -- worklist at this+320
    for region in this.worklist:

        // (a) Hash table init -- 16-byte entries per tracked register
        clear_and_resize(this.region_hash, initial_cap=16);

        // (b) Segment enumeration
        let seg = region.first_segment;
        while seg != null:
            let is_gap      = (seg[0] >> 2) & 1;       // bit 2 of byte[0]
            let is_subrange = (seg[44] >> 3) & 1;       // bit 3 of byte[44]

            if is_gap:
                // Potential split point -- record in visit set
                record_gap(seg, this.visit_set);         // sub_C8CC70

            if is_subrange:
                // Chain through sub-ranges
                process_subranges(seg);

            seg = seg.next;                              // stride = 40 bytes

The gap flag is the primary signal for split point selection. When the allocator detects a gap between two live segments, it can insert a split there without introducing a new spill -- the value is simply not live during the gap, so the split editor can create two separate live ranges that each get a different physical register. The sub-range flag indicates that the segment belongs to a sub-register lane (e.g., the low half of an Int64Regs value), which requires special handling to avoid breaking the lane structure.

Copy Hint Detection and Local Splitting

For COPY instructions (kind values 68 and 0), the splitter extracts register pairs and builds a conflict set:

        // (c) Copy hint detection
        for inst in region.instructions:
            if inst.kind == 68 || inst.kind == 0:       // COPY variants
                let (src, dst) = extract_reg_pair(inst); // operands at +32, stride 40, reg at +8
                conflict_set.insert(src);
                conflict_set.insert(dst);

                // Try local split first
                if tryLocalSplit(conflict_set):          // sub_2F2A2A0
                    // Success -- materialize the new segments
                    materializeSplitSegment();            // sub_2FDF330
                    continue;

sub_2F2A2A0 (tryLocalSplit) attempts a low-cost split within a single basic block. On success, sub_2FDF330 inserts the new split segments into the live interval data structure. The result entries from a local split use a 24-byte stride, where byte +16 is a quality flag and dwords at +8/+12 are the start/end positions of the split segment.

Interference Analysis for Non-COPY Segments

For non-COPY segments, the splitting engine performs interference analysis using regmasks:

        // (d) Interference analysis (lines 785-914)
        for seg in region.non_copy_segments:
            for op in seg.operands:
                if op.is_def && op.flags[+3] & (1 << 4):
                    check_def_interference(op);          // sub_2F28E80

            // Regmask check -- type 12 operands
            if op.type_byte == 12:
                for entry in region_hash:
                    if bittest(op.mask_data[+24], entry.reg):
                        // Register killed by mask -- tombstone it
                        tombstone(entry);                // set to -2

The _bittest operation on regmask data at offset +24 identifies which registers are killed by call clobber masks. Killed entries are tombstoned in the tracking hash table (sentinel value -2), removing them from further consideration.

Coalescing and Reassignment Dispatch

The splitting engine dispatches through vtable offsets for coalescing:

        // (e) Coalescing / reassignment (lines 917-999)
        if vtable[1064](this, region):                   // tryReassign
            markRegUsed(result_reg);                     // sub_2E88E20
            goto DONE;

        if vtable[1072](this, region):                   // canRecolorVirtReg
            markRegUsed(result_reg);                     // sub_2E88E20
            goto DONE;

        // Also try alternative local split via vtable[480]
        vtable[480](this, region, &SmallVectorArgs);

The vtable-indirect calls at offsets [1064] and [1072] correspond to tryReassign and canRecolorVirtReg in upstream LLVM. The offset [480] call is a fallback local split strategy. On success, sub_2E88E20 (markRegUsed) updates the allocation state.

Register Pressure and the -maxreg Constraint

The real allocation constraint on NVPTX is not register scarcity but register pressure -- higher per-thread register usage reduces occupancy, directly impacting throughput through fewer warps available for latency hiding. The -maxreg CLI flag (parsed at sub_900130, stored at compilation context offset +1192) caps the total live register count. Duplicate -maxreg definitions produce the error: "libnvvm : error: -maxreg defined more than once" (sub_9624D0).

Concrete Occupancy Examples

The occupancy formula and cliff table are documented in the GPU Execution Model. Here the relevant values are shown for the -maxreg settings that the allocator targets:

`-maxreg`	Regs/Warp	Warps (SM 8.0)	Occupancy	Warps (SM 9.0)	Occupancy
32	1,024	64	100%	48	100%
64	2,048	32	50%	32	67%
96	3,072	21	33%	21	44%
128	4,096	16	25%	16	33%
192	6,144	10	16%	10	21%
255	8,160	8	13%	8	17%

The -maxreg flag sets the ceiling, and the remat infrastructure aggressively reduces pressure below the nearest cliff to avoid losing an entire warp slot.

The remat-for-occ knob (default 120) encodes an occupancy target. When set, the IR-level rematerialization pass (sub_1CE7DD0) calls sub_1C01730 to compute an occupancy-based register target. The heuristic applies a scale factor: if the computed occupancy level exceeds 4, it multiplies the target by 3/2 (effectively allowing more registers when occupancy is already high). If the result still exceeds the ceiling, it applies target = 2*target/3 as a tighter bound.

ptxas Register Allocation Knobs

In addition to cicc's LLVM-side allocator, ptxas has its own register allocation stage with 72+ dedicated knobs. These are independent of the LLVM greedy allocator and operate on the ptxas-internal IR after PTX parsing:

ptxas Knob	Description
`RegAllocRematEnable`	Enable ptxas-level rematerialization
`RegAllocEnableOptimizedRemat`	Use optimized remat algorithm
`RegAllocSpillForceXBlockHoistRefill`	Force cross-block spill hoist/refill
`RegAllocSpillValidateDebug`	Validate spill code in debug builds
`RegAllocDebugConflictDetails`	Print conflict details during allocation
`RegAllocPrintDetails`	Print allocation decisions
`RegAllocPerfDiffBackoff`	Back off allocation when perf difference is small
`RegAllocPerfDiffBackoffBegin/End`	Range for perf backoff
`CTAReconfigMaxRegAlloc`	Max registers for CTA reconfiguration
`MaxRegsForMaxWarp`	Register ceiling for maximum warp occupancy
`RegTgtSelHigherWarpCntHeur`	Heuristic favoring higher warp count
`RegTgtSelLowerWarpCntHeur`	Heuristic favoring lower warp count
`CommonCrossBlockRegLimit`	Cross-block register usage limit
`DisableHMMARegAllocWar`	Disable HMMA register allocation workaround

These ptxas knobs are accessed via nvcc -Xptxas "--knob KnobName=Value". The MaxRegsForMaxWarp and RegTgtSel* knobs directly implement the occupancy-aware allocation strategy at the ptxas level, complementing cicc's -maxreg ceiling.

NVIDIA Rematerialization Knobs (cicc)

NVIDIA provides an extensive set of custom rematerialization knobs to reduce pressure below the target threshold:

Knob	Default	Description
`nv-remat-default-max-reg`	70	Default maximum register target
`nv-remat-max-times`	10	Max rematerialization iterations
`nv-remat-block-single-cost`	10	Single live pull-in cost limit
`nv-remat-block-max-cost`	100	Max clone cost for reducing one live
`nv-remat-block-loop-cost-factor`	20	Loop body cost scaling factor
`nv-remat-block-liveout-min-percentage`	70	Minimum live-out percentage for block remat
`nv-remat-block-map-size-limit`	6	Map size limit for block-level remat
`nv-remat-block-load-cost`	10	Load cost in Remat Machine Block
`nv-remat-threshold-for-spec-reg`	20	Threshold for special register remat
`load-remat`	(flag)	Enable load rematerialization
`no-mi-remat`	(flag)	Disable MI remat for specific functions

The greedy allocator itself has additional tuning knobs:

Knob	Default	Description
`split-spill-mode`	1	0=default, 1=size, 2=speed
`lcr-max-depth`	5	Last chance recoloring max depth
`lcr-max-interf`	8	Last chance recoloring max interferences
`exhaustive-register-search`	(flag)	Bypass LCR depth/interference cutoffs
`enable-deferred-spilling`	(flag)	Defer spill code to end of allocation
`grow-region-complexity-budget`	10000	`growRegion()` edge budget
`split-threshold-for-reg-with-hint`	75	Split threshold percentage

Additional rematerialization knobs registered separately include do-remat (default 3), remat-maxreg-ceiling (default 0), remat-single-cost-limit (default 6000), remat-loop-trip (default 20), and remat-for-occ (default 120, targeting higher occupancy).

Spill Cost Computation

Spill costs are computed during driver initialization by sub_2RAD5E0 (step 5 of the driver sequence), which calculates VirtRegAuxInfo spill weights for every virtual register before the main allocation loop begins. The spill weight determines priority in the allocation queue and eviction decisions.

On NVPTX, "spilling" is a misnomer because PTX has no stack spill in the traditional CPU sense -- a spilled value either gets rematerialized (re-computed from inputs) or written to local memory (per-thread DRAM-backed memory, orders of magnitude slower than registers). The cost model therefore heavily penalizes local memory spills and strongly favors rematerialization.

The PriorityAdvisor (looked up via global dword_5023AC8) determines the order in which virtual registers enter the allocation queue. The EvictionAdvisor (looked up via dword_5023BA8) determines when to evict a lower-priority register to make room for a higher-priority one. Both advisors are initialized via vtable [24] calls during driver setup and can be customized via the regalloc-evict and regalloc-priority analysis passes registered in the pipeline parser.

Allocation Failure Handler (sub_2F418E0) -- Three Error Paths

When physical register assignment fails (sub_2F418E0), three error paths exist:

Path 1: Empty Allocation Order

"no registers from class available to allocate"

The register class has zero allocatable registers. This can happen for the internal-only class (off_4A026E0) if the target configuration excludes all environment registers. Diagnostic emitted via sub_B6EB20 (DiagnosticHandler).

Path 2: All Registers Occupied

"ran out of registers during register allocation"

The allocation order exists but all registers are occupied/interfering. This fires when the eviction/split pipeline exhausts all options -- the sequence is: tryAssign -> tryEviction -> tryLastChanceRecoloring -> trySplit -> fail. Uses sub_B2BE50 for source location, sub_B157E0 for DebugLoc, and sub_B158E0 for diagnostic formatting.

Path 3: Inline Assembly Overflow

"inline assembly requires more registers than available"

Special handling for inline asm operands (kind values 1--2 at offset +68). Inline assembly can specify explicit register constraints that consume all available registers in a class, leaving nothing for surrounding code.

FailedRegAlloc Flag

All three paths set the FailedRegAlloc flag (bit 10 in MachineFunction properties, sub_2E78A80). This flag allows downstream passes to handle the failure gracefully rather than crashing. Passes that check this flag can skip optimization or emit degraded but correct code.

The RAGreedy Driver

The top-level driver (sub_2F5A640) orchestrates the full allocation pass:

Store MachineFunction at a1[96], retrieve SubTarget (vtable +128).
Optional debug dump: "Before greedy register allocator".
sub_35B4B20 -- calculate register class info.
sub_2F55040 -- check if any virtual registers need allocation.
sub_2FAD5E0 -- setup spill costs.
sub_2F54D60 -- compute live intervals.
Query vtable +328 for getRegPressureSetLimit (stored at a1[3633]).
Look up EvictionAdvisor (dword_5023BA8) and PriorityAdvisor (dword_5023AC8) via std::map lookups.
Initialize advisors via vtable [24].
Allocate InterferenceCache (0x2C0 bytes, sub_2FB0E40).
Allocate SplitAnalysis (0x738 bytes, sub_2FB1ED0).
sub_3501A90 -- setup RegAllocMatrix.
Initialize PhysRegEntries array (32 entries, 144-byte stride).
sub_2F55730 -- reset priority queue.
sub_35B5380 -- seed queue from virtual registers.
sub_2F58C00 -- main allocation loop.
Optional debug dump: "Before post optimization".
Post-allocation optimization via vtable [24].
sub_2F5A580, sub_2F50510 -- finalize.

Differences from Upstream LLVM

The following table summarizes where CICC's register allocator diverges from upstream LLVM 20.0.0 RAGreedy:

Aspect	Upstream LLVM 20	CICC v13.0
Primary constraint	Fixed physical register set (CPU ISA-defined)	Pressure ceiling via `-maxreg`; no fixed physical registers
Register classes	Often overlapping (e.g., GR32 is a subset of GR64 on x86)	9 completely disjoint classes; no cross-class interference
Spill destination	Stack frame (cheap, L1/L2 latency)	Local memory (DRAM-backed, 100x+ latency) or rematerialization
Rematerialization	LLVM built-in `MachineInstr::isRematerializable()`	Massive custom infrastructure: 11+ `nv-remat-*` knobs, separate IR-level remat pass (`sub_1CE7DD0`), iterative pressure reduction loop
Occupancy awareness	None -- CPU has no occupancy concept	`remat-for-occ` (default 120) drives occupancy-targeted register reduction; `MaxRegsForMaxWarp` ptxas knob
Interference cache hash	Standard LLVM `DenseMap` with `(ptr >> 4) ^ (ptr >> 9)`	Custom open-addressing map with `37 * reg` hash, `-1`/`-2` sentinels
Operand stride	32 bytes (`MachineOperand` size)	40 bytes (8-byte NVPTX extension for class tag + flags)
Dual pass manager	Single implementation used by both old and new PM	Two complete copies: Instance A at `0x1EC0400`, Instance B at `0x2F4C2E0`
Register encoding	LLVM `MCRegister` (16-bit class + index)	32-bit: 4-bit class tag in `[31:28]`, 28-bit index in `[27:0]`
Spill weight formula	`length / (spill_cost * block_freq)`	Same formula, but cost model penalizes local memory heavily; rematerialization candidates get near-zero weight
Last-chance recoloring	Same knobs, but rarely critical	Frequently exercised due to tight `-maxreg` ceilings; `exhaustive-register-search` flag more relevant
Post-RA remat	Minimal	ptxas performs a second register allocation with its own 72+ knobs (`RegAllocRematEnable`, etc.)
Splitting strategy	Region-based splitting (`splitAroundRegion`)	Same algorithm, but gap flag (bit 2) and sub-range flag (bit 3) in 40-byte segment entries use NVPTX-specific encoding
Callee-saved registers	CSR-first-time-cost matters for ABI compliance	NVPTX has no callee-saved convention; `regalloc-csr-first-time-cost` is effectively dead code
Debug strings	`"Before greedy register allocator"`	Same string, but emitted conditionally on `unk_503FCFD` (a debug flag at a fixed BSS address)

What Upstream LLVM Gets Wrong for GPU

Upstream LLVM's register allocation framework was designed for CPU targets where the register file is a fixed, small, physically-interfering resource. Every core assumption breaks on NVPTX:

Upstream assumes spills are cheap (L1/L2 latency). On x86/AArch64, a spill is a store to the stack frame backed by L1 cache (3-5 cycles). On GPU, a "spill" writes to local memory backed by device DRAM at 200-800 cycle latency. This 40-160x penalty makes rematerialization nearly always preferable to spilling, which is why NVIDIA ships 11+ custom nv-remat-* knobs and an iterative remat loop that has no upstream equivalent.
Upstream assumes a fixed physical register set with cross-class interference. CPU ISAs have a static register file (e.g., 16 GPRs on x86-64) where GR32 is a sub-register of GR64 and allocating one constrains the other. NVPTX has no fixed register count and its nine register classes are completely disjoint -- allocating %r5 (Int32Regs) never conflicts with %f5 (Float32Regs). The entire interference-graph framework is solving the wrong problem.
Upstream has no concept of occupancy. CPU register allocation never reduces parallelism -- a function uses N registers and that is the end of the story. On GPU, every additional register per thread can cross an occupancy cliff, losing an entire warp group and halving throughput. The allocator must minimize pressure to a target, not just avoid running out of registers.
Upstream assumes one allocation pass produces the final assignment. On CPU, LLVM's greedy RA emits final machine code. On NVPTX, cicc's allocator emits PTX with virtual registers bounded by -maxreg, and then ptxas performs an entirely separate second allocation pass with its own 72+ knobs to map virtual PTX registers to hardware resources. The LLVM allocator is half the pipeline, not the whole thing.
Upstream's callee-saved register convention is irrelevant. CPU ABIs define callee-saved sets (e.g., rbx, rbp on SysV x86-64) that the allocator must respect. NVPTX has no callee-saved convention at all -- there is no hardware call stack for registers. The regalloc-csr-first-time-cost knob is dead code on this target.

Common Pitfalls

These are mistakes a reimplementor is likely to make when building a register allocator for an NVPTX-like GPU target.

1. Treating register allocation as an assignment problem instead of a pressure problem. On CPU targets, the allocator must map N virtual registers to K physical registers, and the problem is coloring a fixed interference graph. On NVPTX, there is no fixed physical register file -- PTX registers are virtual and unlimited. The real constraint is the -maxreg ceiling, which controls occupancy. A reimplementation that tries to assign physical registers will produce correct but meaningless output; the correct approach is to minimize peak live register count below the -maxreg threshold, and let ptxas handle the final hardware mapping.

2. Ignoring occupancy cliffs when setting the register target. Going from 64 to 65 registers per thread crosses an occupancy cliff that halves the number of active warps on SM 8.0 (from 32 warps at 50% to 21 warps at 33%). A reimplementation that treats the register ceiling as a hard binary constraint (under = good, over = bad) will miss the fact that reducing from 65 to 64 is worth enormous effort (doubles throughput), while reducing from 63 to 62 is nearly worthless. The remat-for-occ knob (default 120) exists specifically to drive rematerialization toward the nearest cliff boundary, not just toward the ceiling.

3. Using CPU-calibrated spill costs. On x86, a spill is a store to L1-cached stack memory at 3-5 cycle latency. On GPU, a "spill" writes to per-thread local memory backed by device DRAM at 200-800 cycle latency -- a 40-160x penalty. A reimplementation that uses upstream LLVM's default spill cost formula without recalibrating for GPU memory latency will spill aggressively when it should rematerialize. NVIDIA's 11+ nv-remat-* knobs and the iterative rematerialization loop exist because rematerialization is almost always cheaper than spilling on GPU.

4. Assuming cross-class register interference exists. NVPTX's nine register classes are completely disjoint: Int32Regs (%r) never conflicts with Float32Regs (%f), Int64Regs (%rd) never conflicts with Float64Regs (%fd), and so on. A reimplementation that builds a global interference graph spanning all classes will waste significant compile time computing interference relationships that are always empty. The correct approach is per-class allocation with independent pressure tracking.

5. Forgetting that cicc's allocation is only half the pipeline. The LLVM greedy allocator in cicc emits PTX with virtual registers bounded by -maxreg. Then ptxas performs an entirely separate second allocation pass with its own 72+ knobs to map virtual PTX registers to hardware resources. A reimplementation that tries to produce final hardware register assignments at the LLVM level is solving the wrong problem -- the output should be well-pressure-managed virtual registers, not hardware assignments.

Diagnostic Strings

Diagnostic strings recovered from the register allocation binary region (p2c.5-01-register-alloc.txt) and the rematerialization passes (p2b.2-01-remat-ir.txt, p2b.2-02-remat-machine.txt).

Allocation Failure Diagnostics

String	Source	Category	Trigger
`"no registers from class available to allocate"`	`sub_2F418E0` path 1	Error	Register class has zero allocatable registers; emitted via `sub_B6EB20` (DiagnosticHandler)
`"ran out of registers during register allocation"`	`sub_2F418E0` path 2	Error	All registers occupied/interfering after tryAssign -> tryEviction -> tryLastChanceRecoloring -> trySplit exhausted
`"inline assembly requires more registers than available"`	`sub_2F418E0` path 3	Error	Inline asm explicit register constraints consume all available registers in a class
`"libnvvm : error: -maxreg defined more than once"`	`sub_9624D0`	Error	Duplicate `-maxreg` CLI flag definitions

Debug/Trace Diagnostics

String	Source	Category	Trigger
`"Before greedy register allocator"`	`sub_2F5A640` step 2	Debug	Conditional on `unk_503FCFD` debug flag
`"Before post optimization"`	`sub_2F5A640` step 17	Debug	Post-allocation debug dump
`"Before register coalescing"`	`sub_2F60C50`	Debug	Register coalescer debug dump
`"After register coalescing"`	`sub_2F60C50`	Debug	Register coalescer debug dump

Rematerialization Diagnostics (nv-remat-block)

String	Source	Category	Trigger
`"Skip machine-instruction rematerialization on <name>"`	`sub_1CE7DD0` region	Debug	Function name matches `no-mi-remat` skip list
`"Max-Live-Function(<num_blocks>) = <max_live>"`	remat-block step 10	Debug	Reports maximum live register count across all blocks
`"live-out = <count>"`	remat-block step 7	Debug	Per-block live-out register count
`"Pullable: <count>"`	remat-block step 5	Debug	Number of pullable (rematerializable) instructions
`"Total Pullable before considering cost: <count>"`	remat-block step 8	Debug	Total pullable candidates before cost filtering
`"Really Final Pull-in: <count> (<total_cost>)"`	remat-block step 11	Debug	Final rematerialization candidate count and total cost
`"After pre-check, <N> good candidates, <M> given second-chance"`	remat two-phase selection	Debug	Two-phase candidate selection with second-chance
`"ADD <N> candidates from second-chance"`	remat two-phase selection	Debug	Candidates recovered from second-chance pass
`"\treplaced"`	remat code emission	Debug	Rematerialized instruction replacement confirmation

Pass Registration Strings

String	Source
`"Greedy Register Allocator"`	Pass name for both Instance A (`0x1EC0400`) and Instance B (`0x2F4C2E0`)
`"Register Coalescer"`	`sub_2F60C50` pass registration
`"nv-remat-block"`	`ctor_361_0` at `0x5108E0` -- machine-level remat pass registration
`"Legacy IR Remat"`	`sub_1CE7DD0` region -- IR-level remat pass display name
`"nvvmrematerialize"`	IR-level remat pass pipeline ID

Function Map

Function	Address	Size	Role
`RAGreedy::runOnMachineFunction`	`sub_2F5A640`	--	Top-level driver (466 lines)
`RAGreedy::selectOrSplit`	`sub_2F49070`	--	Core allocator (82KB, 2,314 lines)
selectOrSplit thunk	`sub_2F4BAF0`	--	Redirects to `sub_2F49070(this+200)`
selectOrSplit + SplitEditor	`sub_2F4BB00`	--	Spill-or-split path
`SplitEditor::splitAroundRegion`	`sub_2F2D9F0`	--	Live range splitting (93KB)
tryLocalSplit	`sub_2F2A2A0`	--	Local split within single BB
materializeSplitSegment	`sub_2FDF330`	--	Insert split segments
scanInterference	`sub_2F43DC0`	--	Populate interference cache
tryAssign	`sub_2F47B00`	--	Simple assignment path
tryEviction	`sub_2F48CE0`	--	Evict conflicting VReg
tryLastChanceRecoloring	`sub_2F46530`	--	Recursive recoloring fallback
processConstrainedCopies	`sub_2F47200`	--	Handle tied-operand COPYs
rehashInterferenceTable	`sub_2F46EE0`	--	Interference cache rehash
rehashCoalescingTable	`sub_2F46A90`	--	Coalescing hint table rehash
markRegReserved	`sub_2F42840`	--	Mark unit as reserved
recordAssignment	`sub_2F42240`	--	Record successful assignment
updateInterferenceCache	`sub_2F424E0`	--	Insert conflict entry
recordCoalescingHint	`sub_2F41240`	--	Record parent-chain hint
collectHintInfo	`sub_2F434D0`	--	Gather all hints for priority
assignRegFromClass	`sub_2F418E0`	--	Allocation failure handler
hasVRegsToAllocate	`sub_2F55040`	--	Pre-flight check
computeLiveIntervals	`sub_2F54D60`	--	Build live interval data
resetPriorityQueue	`sub_2F55730`	--	Clear and re-init queue
mainAllocationLoop	`sub_2F58C00`	--	Per-VReg dispatch loop
finalize	`sub_2F50510`	--	Post-allocation cleanup
setupSpillCosts	`sub_2FAD5E0`	--	Compute VirtRegAuxInfo weights
`InterferenceCache::init`	`sub_2FB0E40`	--	Allocate 0x2C0-byte cache
`SplitAnalysis::init`	`sub_2FB1ED0`	--	Allocate 0x738-byte analysis
setupRegAllocMatrix	`sub_3501A90`	--	Build the global interference matrix
calculateRegClassInfo	`sub_35B4B20`	--	Pre-compute class sizes/orders
seedQueueFromVRegs	`sub_35B5380`	--	Initial queue population
`RegisterCoalescer::runOnMachineFunction`	`sub_2F71140`	--	Register coalescing (80KB)
printMachineProperties	`sub_2E78A80`	--	Includes `FailedRegAlloc` flag
encodeVirtualReg	`sub_21583D0`	--	`CLASS_BITS \
emitCopyInstruction	`sub_2162350`	--	Class-specific copy opcodes

Reimplementation Checklist

Pressure-driven allocation model. Replace the standard assignment-to-physical-registers model with a pressure-tracking model: PTX registers are virtual, so the allocator must track and bound total live register count per class against the -maxreg ceiling (default 70) rather than assigning to a finite physical register set.
Nine disjoint register classes. Define the nine NVPTX register classes (Int1Regs, Int16Regs, Int32Regs, Int64Regs, Float32Regs, Float64Regs, Int16HalfRegs, Int32HalfRegs, Int128Regs) with complete cross-class disjointness -- no interference between classes, class-specific copy opcodes, and per-class pressure tracking.
Greedy selectOrSplit with NVPTX adaptations. Implement the core allocation loop: per-unit RegUnitStates array (free/interfering/reserved), interference cache with 37 * reg hash, 40-byte-stride operand scanning, copy coalescing hints (kinds 20/21), and live-through bitvector for detecting worst-case live ranges.
Live range splitting with SplitKit. Implement splitAroundRegion (93KB equivalent): identify split points at block boundaries and within blocks, create sub-ranges with new virtual registers, insert copies at split points, and update the interference cache.
Eviction and last-chance recoloring. Implement tryEviction (compare spill weights to decide whether evicting a conflicting VReg is cheaper) and tryLastChanceRecoloring (recursive reassignment bounded by lcr-max-depth=5 and lcr-max-interf=8).
Occupancy-aware spill cost computation. Weight spill costs by occupancy impact: spills to local memory (device DRAM, 200--800 cycle latency) must account for the GPU-specific penalty, and the register ceiling must respect occupancy cliff boundaries.
Dual pass manager instances. Register the allocator for both legacy and new pass managers, ensuring both instances share the same NVPTX-specific hooks (custom rematerialization interaction, pressure-driven priority queues, maxreg enforcement).

Architectural Uniqueness

NVPTX's register allocation differs from all other LLVM targets in several fundamental ways:

Unlimited virtual registers: PTX has no fixed register count. The allocator manages pressure, not assignment to a finite set of physical registers.
Complete class separation: The nine register classes are fully disjoint. An Int32Regs allocation never conflicts with a Float32Regs allocation.
Pressure as the primary constraint: The -maxreg ceiling and NVIDIA's custom rematerialization infrastructure (nv-remat-* knobs) exist specifically to control occupancy, which has no equivalent in CPU register allocation.
Two-stage allocation: cicc performs LLVM greedy RA to emit PTX with virtual registers bounded by -maxreg, then ptxas performs a second allocation pass with its own 72+ knobs to map virtual PTX registers to hardware resources.
Dual implementation: Two complete RA copies exist (old at 0x1E*--0x1F*, new at 0x2F*--0x35*), one per pass manager generation.

ptxas Interaction

Register allocation in cicc is the first of two allocation stages. cicc's greedy RA assigns virtual PTX registers (%r0, %f3, etc.) bounded by the -maxreg ceiling to control occupancy, but these are not hardware registers -- they are symbolic names in the PTX text. ptxas then performs its own complete register allocation pass, mapping cicc's virtual registers onto the SM's physical register file (e.g., 255 32-bit registers per thread on SM 80+). ptxas has 72+ RA-related knobs (RegAllocScheme, DynamicRegAlloc, RegUsageOpt, etc.) and may split, coalesce, or spill registers differently than cicc anticipated. The -maxreg value cicc enforces serves as a hint to ptxas about the desired occupancy target, but ptxas makes the final hardware binding decision.

PrologEpilogInserter & Frame Layout

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

NVIDIA GPUs have no hardware stack pointer. There is no push, no pop, no %rsp — the entire concept of a "stack frame" is a compiler fiction. When a CUDA kernel needs local storage (spill slots, alloca, local arrays), cicc allocates a byte array called __local_depot in PTX .local address space and computes all offsets at compile time. The PrologEpilogInserter (PEI) pass is responsible for this: it takes abstract MachineFrameInfo frame indices produced by register allocation and earlier lowering, assigns concrete byte offsets within the depot, emits the two-instruction prologue that sets up the %SP/%SPL pseudo-registers, and rewrites every frame-index operand in the MachineFunction to [%SP + offset] form. At 68 KB and ~2,400 decompiled lines, cicc's PEI is a heavily modified monolith — the upstream open-source NVPTX backend replaces LLVM's standard PEI with a stripped-down 280-line NVPTXPrologEpilogPass that handles only offset calculation and frame-index elimination. cicc restores and extends nearly all of the standard PEI's functionality: callee-saved register handling, register scavenging, bitmap-based frame packing, categorized layout ordering, and a stack-size diagnostic system.

Property	Value
Binary address	`sub_35B1110` (0x35B1110)
Binary size	68,332 bytes (~2,388 decompiled lines)
Pass identity	`PrologEpilogInserter::runOnMachineFunction`
Pass position	Post-register-allocation, before NVPTXPeephole
Stack frame	0x490 bytes of local state (~400 variables)
Upstream equivalent	`NVPTXPrologEpilogPass` (280 lines) + `NVPTXFrameLowering` (101 lines)
Key strings	`"warn-stack-size"`, `"stack frame size"`
Knobs	`warn-stack-size` (function attribute), `nvptx-short-ptr`, `nv-disable-mem2reg`

The GPU "Stack" Model

__local_depot: The Frame Array

Every PTX function that needs local storage declares a .local byte array:

.local .align 16 .b8  __local_depot0[256];

This is the entire "stack frame." The alignment value is the maximum alignment of any object in the frame. The size is the total frame size computed by PEI. The suffix number (0, 1, ...) is the function index within the module.

There is no call stack in the CPU sense. GPU threads have a fixed local memory allocation (typically 512 KB per thread on modern architectures). The .local directive reserves a region within this per-thread memory. Recursive functions and dynamic allocations are legal in PTX but the driver/ptxas resolves their addresses — cicc only needs to produce a statically-sized depot for each function's fixed-size locals.

%SP and %SPL: The Two Frame Pseudo-Registers

PTX declares two pseudo-register pairs for frame access:

.reg .b64  %SP;     // generic address space pointer to the frame
.reg .b64  %SPL;    // local address space (AS 5) pointer to the frame

In 32-bit mode these are .reg .b32. The distinction exists because NVIDIA GPUs use address space qualification:

%SPL (Stack Pointer Local) — points directly into the .local address space (PTX address space 5). Loads/stores using %SPL emit ld.local/st.local instructions, which ptxas can optimize for the L1 cache local-memory path. This is the efficient pointer.
%SP (Stack Pointer) — a generic address space pointer obtained by converting %SPL via cvta.local. Loads/stores using %SP go through generic address resolution, which adds a TLB lookup to determine the address space at runtime. This is slower but required when the address escapes to code that expects generic pointers (e.g., passing a local variable's address to a called function).

The prologue sequence is:

mov.u64   %SPL, __local_depot0;     // MOV_DEPOT_ADDR_64
cvta.local.u64  %SP, %SPL;          // cvta_local_64

The cvta.local (Convert Address) instruction is the key: it takes a .local pointer and produces the equivalent generic-space pointer. When nvptx-short-ptr is enabled, %SPL is 32 bits (sufficient for the per-thread local memory window, always < 4 GB) while %SP may still be 64 bits on 64-bit targets.

Upstream's NVPTXFrameLowering::emitPrologue implements this directly. It checks MachineRegisterInfo::use_empty for each register — if %SP has no uses, it skips the cvta.local; if %SPL has no uses, it skips the mov.depot. The NVPTXPeephole pass runs immediately after PEI and rewrites LEA_ADDRi64 %VRFrame64, offset followed by cvta_to_local_64 into LEA_ADDRi64 %VRFrameLocal64, offset, eliminating the generic-to-local conversion when the address stays in local space.

Frame Index Resolution

During instruction selection and register allocation, local memory references use abstract frame indices: %stack.0, %stack.1, etc. Each maps to a MachineFrameInfo frame object with a size, alignment, and (after PEI) a byte offset.

Frame-index elimination in upstream is simple — NVPTXRegisterInfo::eliminateFrameIndex replaces the frame-index operand with VRFrame (which prints as %SP) and sets the immediate offset:

MI.getOperand(FIOperandNum).ChangeToRegister(getFrameRegister(MF), false);
MI.getOperand(FIOperandNum + 1).ChangeToImmediate(Offset);

The VRDepot physical register (prints as %Depot internally) serves as the canonical frame base in getFrameIndexReference. For debug info, %Depot is remapped to %SP since cuda-gdb resolves stack frames via the generic pointer.

Frame Layout Algorithm

cicc's PEI executes in ten sequential phases within a single monolithic function. The algorithm is significantly more sophisticated than upstream's linear scan.

Phase 1–2: Setup and Callee-Saved Registers (lines 443–566)

Retrieves the TargetFrameLowering and TargetRegisterInfo from the MachineFunction's subtarget. If callee-saved registers exist (determined by vtable(FrameLowering, +480)), allocates a 0xA8-byte callee-save info structure at PEI state offset +200 containing two inline SmallVectors for register indices.

On GPU targets, callee-saved registers are unusual — PTX functions use a fully virtual register file, so there is no hardware register saving in the CPU sense. However, cicc models device-function calling conventions that may require preserving certain virtual registers across calls, and this mechanism handles that.

Phase 3: Fixed Object Collection (lines 567–730)

Initializes a chunk table (deque-like structure) with -4096 sentinel values. Collects prolog/epilog insertion points from the PEI state arrays at offsets +216 (prolog points, count at +224) and +264 (epilog points, count at +272).

When callee-saves exist and optimization level is not 20 (a special threshold), manually inserts save/restore instructions:

Simple saves: storeRegToStackSlot(MBB, MI, reg, kill=1, FI, RC, TRI)
Compound saves: handles sub-register decomposition via sub_2F26260 when byte+9 == 1 in the callee-save info.

Phase 4: Offset Assignment — The Core Layout Engine (lines 733–1070)

This is the heart of PEI. It assigns byte offsets within __local_depot to every frame object.

MachineFrameInfo layout:
  StackDirection:    1 = grows-negative (toward lower addresses)
                     0 = grows-positive (toward higher addresses)
  LocalFrameSize:    initial offset base
  NumFixedObjects:   count of pre-positioned objects
  MaxAlignment:      tracks largest alignment seen

Fixed objects are laid out first. Each frame object is a 40-byte record:

Offset	Type	Field
+0	i64	Byte offset (written by PEI)
+8	i64	Object size in bytes
+16	u8	Alignment (log2)
+20	u8	isDead flag
+32	u8	isSpillSlot flag
+36	u8	Category (0–3)

The alignment formula appears ~20 times throughout the pass:

// Round up 'value' to next multiple of (1 << align_log2):
aligned = -(1 << align_log2) & (value + (1 << align_log2) - 1);
// Equivalent to: aligned = (value + mask) & ~mask  where mask = (1<<n) - 1

For grows-negative direction, offsets are stored as negative values; for grows-positive, they accumulate upward.

Callee-saved region is laid out next, iterating frame indices in range [PEI+208 .. PEI+212]. Each CSR object gets an aligned offset using the same formula.

Separate stack area: if MachineFrameInfo+665 flag is set, NVIDIA supports a physically separate stack region with its own alignment at +664 and total size at +656. This likely corresponds to a distinct .local segment for shared-memory scratch or ABI-reserved zones.

Phase 5: Categorized Local Variable Layout (lines 1060–1600)

This is cicc's most significant divergence from upstream PEI. Objects are classified into three priority buckets by a category byte at frame-object offset +36:

Category	Bucket	Typical contents	Layout order
3	`v427`	Vector/tensor spills (high alignment)	First
2	`v419`	Medium-aligned objects	Second
1	`v412`	General locals	Third
0	—	Skip (already placed or dead)	—

Each bucket is processed by sub_35B0830 which assigns aligned offsets. The ordering minimizes alignment waste: laying out large-alignment objects first avoids padding gaps.

Objects are skipped if:

They are spill slots in a separate stack area
They fall within the callee-saved index range
Their size is -1 (sentinel for dynamic-size objects)
They are the frame-pointer object
They are dead

Bitmap-Based Packing — When register count is nonzero and canUseStackBitmap returns true (frame size <= 0x7FFFFFFF), cicc builds a bitset representing every byte of the frame:

// Bitmap size in qwords:
bitmap_size = (frame_size + 63) >> 6;

// Mark all bytes as free (bits set to 1)
// Then clear bits for fixed objects and CSR objects
for each placed_object:
    clear bits [offset .. offset + size)

For each unassigned general object, the algorithm scans the bitmap using tzcnt (trailing zero count) to find contiguous runs of set bits that match the object's size and alignment:

for each unassigned_obj in v412:
    candidate = tzcnt_scan(bitmap, obj.size);
    if (candidate != NOT_FOUND):
        // Verify alignment
        if aligned(candidate, obj.alignment):
            // Verify all bits available (inner loop)
            if all_bits_set(bitmap, candidate, candidate + obj.size):
                assign_offset(obj, candidate);
                clear_bits(bitmap, candidate, candidate + obj.size);
                continue;
    // Fallback: linear allocation at end of frame
    offset = align(running_offset);
    assign_offset(obj, offset);
    running_offset += obj.size;

This is substantially more aggressive than both upstream LLVM PEI (which does a single linear pass) and the upstream NVPTX PrologEpilogPass (which has no packing at all). It enables reuse of "holes" left by fixed objects, callee-saves, and dead objects.

Phase 6: Final Alignment and Frame Size (lines 1688–1795)

After all objects are laid out:

If targetHandlesStackFrameRounding returns true, skip to finalization.
Add MaxCallFrameSize to the running offset if the function adjusts the stack.
Choose alignment: StackAlign (from TFI.getStackAlign()) for functions with calls or alloca, or TransientStackAlign for leaf functions. The subtarget stores these at FrameLowering[12] and [13] respectively.
Round up: final = align(running_offset, max(StackAlign, MaxAlignment)).
If alignment changed the total and direction is grows-negative, shift all callee-save offsets by the delta to maintain correct relative positions.
Write FrameInfo.StackSize = final_offset - initial_offset.

This value becomes the SIZE in .local .align ALIGN .b8 __local_depotN[SIZE].

Phase 7: Prologue/Epilogue Insertion (lines 1803–1872)

Executed when optimization level is not at threshold 20. For each prolog insertion point, calls emitPrologue(MF, MBB) via RegisterInfo vtable at +96. For each epilog point, calls emitEpilogue(MF, MBB) at +104.

Post-fixup via sub_35AC7B0, then a second pass over prolog points for insertPrologueSaveCode (vtable +152, if not a null stub).

Architecture-specific extension: checks (*(Module+2) >> 4) & 0x3FF == 0xB (SM arch code 11). When matched, calls an additional prolog handler at vtable +176. This likely targets an early or internal SM variant.

Phase 8–9: Frame Index Elimination (lines 1873–2268)

Two strategies selected by vtable(FrameLowering, +616):

Forward elimination (Path A): walks each MBB's instruction list forward. For each instruction, checks the opcode against FRAME_SETUP and FRAME_DESTROY pseudos — these adjust the SP offset tracker. For other instructions, scans operands for type-5 (FrameIndex), then calls sub_35ABF20 to attempt elimination or falls back to the target-specific handler.

Backward elimination (Path B): same logic but iterates instructions in reverse order. Handles FRAME_SETUP/FRAME_DESTROY with different SP adjustment accumulation.

This dual-path approach is unique to cicc — upstream NVPTX PrologEpilogPass only does a single backward walk. The forward path may be needed for instructions where the SP adjustment at a given point depends on preceding pseudo-ops.

Phase 10: Diagnostics and Cleanup (lines 2270–2388)

Stack size warning: default threshold is 0xFFFFFFFF (4 GB, effectively disabled). If the function has a "warn-stack-size" attribute, it parses the value via strtoul(str, &end, 10). When the total frame size (plus optional regspill area at MF+86*wordsize if opt-level flag 55 is set) exceeds the threshold, emits a "stack frame size" diagnostic.

Stack annotation: if annotation output is enabled (checked via sub_B6EA50/sub_B6F970), formats and writes stack-size metadata to the analysis output for the NVVM container.

Cleanup frees the 0xA8 callee-save info structure, resets prolog/epilog point counts, resets frame metadata, and walks the chunk table to free non-inline instruction arrays.

Dynamic Stack Allocation (alloca)

PTX supports alloca semantics at the LLVM IR level — the alloca instruction lowers to a local memory reservation. However, truly dynamic-sized allocations (variable-length arrays, runtime alloca(N)) are constrained:

MachineFrameInfo.hasVarSizedObjects (flag at +36) tracks whether the function contains VLA-style allocations.
When present, PEI selects StackAlign (the full stack alignment) rather than TransientStackAlign for final frame rounding.
ptxas ultimately resolves dynamic allocations at JIT time, not cicc. cicc's role is to set up the frame pointer correctly so that dynamic objects can be addressed relative to it.
The FramePointerIndex (at MachineFrameInfo+68) is laid out last among general objects, ensuring the frame pointer anchors the top of the fixed frame with dynamic objects growing beyond it.

For fixed-size allocas, SROA (Scalar Replacement of Aggregates) typically promotes them to SSA registers before PEI ever runs. When SROA succeeds for all allocas, MachineFrameInfo has no stack objects and PEI emits no __local_depot at all — the function runs entirely in registers.

Spill Slots

Register spills are the primary consumer of __local_depot space. When the register allocator cannot fit a virtual register's live range into the available physical registers, it creates a spill slot — a frame object marked with isSpillSlot = 1 (byte at frame-object +32).

Spill-slot frame objects are created during register allocation. PEI does not create them; it only assigns their offsets. In cicc, spill slots interact with the categorized layout:

Spill slots in a separate stack area (when hasSeparateStackArea is set) are excluded from the general layout and handled in Phase 4's separate-area processing.
Remaining spill slots are classified into categories 1–3 based on their alignment requirements and register class — vector register spills (e.g., 128-bit %rq registers) end up in category 3, scalar spills in category 1.

After PEI assigns offsets, the spill loads/stores reference [%SP + offset] or [%SPL + offset] directly. The post-PEI NVPTXPeephole pass optimizes these: when a LEA_ADDRi64 %VRFrame64, offset feeds directly into cvta_to_local_64, the peephole collapses this to LEA_ADDRi64 %VRFrameLocal64, offset, saving the generic address conversion.

Interaction with SROA

SROA runs early in the optimization pipeline (see SROA) and aggressively promotes alloca instructions to SSA values. For many GPU kernels — especially those that avoid taking addresses of locals — SROA eliminates all allocas, resulting in an empty MachineFrameInfo. In this case:

PEI's frame size computes to 0.
The PTX emitter (sub_2158E80) checks FrameInfo.StackSize; if zero, it emits no .local directive and no %SP/%SPL declarations.
The function runs entirely in the virtual register file — the ideal case for GPU performance.

When SROA cannot promote (address-taken locals, aggregates too large for SROA's threshold controlled by sroa-size-limit, or when sroa-skip-mem2reg is set), PEI becomes essential. Additionally, cicc has a custom MI Mem2Reg pass (nv-disable-mem2reg controls it) that runs post-register-allocation and promotes MachineIR local-memory accesses back to registers — effectively a second chance at eliminating __local_depot usage after regalloc.

Comparison with Upstream

Aspect	Upstream `NVPTXPrologEpilogPass`	cicc `sub_35B1110`
Size	280 lines	~2,400 lines
Callee-saved regs	Not handled	Full save/restore infrastructure
Register scavenging	Not used	Both forward and backward paths
Layout algorithm	Single linear pass over all objects	Categorized 3-bucket layout + bitmap packing
Frame packing	None — objects placed sequentially	tzcnt-accelerated bitmap hole-finding
Stack direction	Supports both, simple	Supports both, with per-direction callee-save adjustment
Diagnostics	None	`warn-stack-size` attribute + annotation output
Separate stack area	Not supported	Full support (flag at MFI+665)
Arch-specific prolog	None	SM arch code 0xB extension
Optimization gating	None	opt-level 20 skips prolog/epilog emission
Frame-index elimination	Single backward walk	Dual forward/backward strategies

The upstream pass explicitly disables LLVM's standard PrologEpilogCodeInserterID and replaces it. cicc's version is closer to the full standard LLVM PEI but with GPU-specific extensions — it re-enables callee-saved handling, register scavenging, and the frame-rounding logic that upstream strips out.

Configuration

Knob	Type	Default	Effect
`warn-stack-size`	Function attribute (string→int)	0xFFFFFFFF (disabled)	Emit diagnostic when frame size exceeds threshold
`nvptx-short-ptr`	`cl::opt<bool>`	false	Use 32-bit pointers for local/const/shared address spaces; affects `%SPL` width
`nv-disable-mem2reg`	`cl::opt<bool>`	false	Disable post-regalloc MI Mem2Reg pass (more objects remain for PEI to lay out)
`sroa-size-limit`	`cl::opt<int>`	(varies)	Max aggregate size SROA will promote; larger values reduce PEI workload
Opt-level flag 20	Internal	—	Skips prolog/epilog instruction emission and callee-save handling
Opt-level flag 55	Internal	—	Includes regspill area in stack-size diagnostic total
`FrameLowering[12]`	Subtarget	arch-dependent	Stack alignment for functions with calls/alloca
`FrameLowering[13]`	Subtarget	arch-dependent	Stack alignment for leaf functions (TransientStackAlign)

Key Data Structures

MachineFrameInfo (at MachineFunction+48)

Offset  Type   Field
+8      ptr    Objects array base pointer (40-byte records)
+16     ptr    Objects array end pointer
+32     i32    NumFixedObjects
+36     u8     hasVarSizedObjects
+48     i64    StackSize  ← WRITTEN by PEI
+64     u8     MaxAlignment (log2)
+65     u8     hasCalls / needsStackAlignment
+68     i32    FramePointerIndex (-1 if none)
+80     i64    MaxCallFrameSize (-1 if unknown)
+96     ptr    Separate-area array base
+104    ptr    Separate-area array end
+120    u8     hasCalleeSaves  ← SET by PEI
+128    ptr    Extra-area array pointer
+136    i64    Extra-area count
+656    i64    Separate area total size
+664    u8     Separate area alignment
+665    u8     hasSeparateStackArea flag

PEI State (pass object, offset from `a1`)

Offset  Type   Field
+8      ptr    Analysis list (tagged analysis pointers)
+200    ptr    Callee-save info (0xA8-byte struct, or null)
+208    u32    First CSR frame index
+212    u32    Last CSR frame index
+216    ptr    Prolog insertion points array
+224    u32    Prolog point count
+264    ptr    Epilog insertion points array
+272    u32    Epilog point count
+312    u8     hasReservedCallFrame flag
+313    u8     requiresRegisterScavenging flag
+320    ptr    Stack-size annotation analysis pointer

Frame Object Record (40 bytes each)

Offset  Type   Field
+0      i64    Byte offset in __local_depot (assigned by PEI)
+8      i64    Object size in bytes
+16     u8     Alignment (log2 encoding)
+20     u8     isDead flag
+32     u8     isSpillSlot flag
+36     u8     Category: 0=skip, 1=general, 2=medium, 3=large

Diagnostic Strings

String	When emitted
`"warn-stack-size"`	Function attribute name — read and parsed as an integer threshold
`"stack frame size"`	Diagnostic message when total frame size exceeds the `warn-stack-size` threshold

Function Map

Function	Address	Size	Role
`PrologEpilogInserter::runOnMachineFunction` — main entry (68 KB)	`sub_35B1110`	--	--
PEI pre-setup: initialize frame object tracking	`sub_35AC440`	--	--
Record frame object into chunk table	`sub_35AFAD0`	--	--
Determine CSR frame index range (writes PEI+208, +212)	`sub_35AEEB0`	--	--
Post-save fixup	`sub_35AE230`	--	--
Insert restore instructions at epilog points	`sub_35ADBC0`	--	--
Assign offsets to categorized frame object bucket	`sub_35B0830`	--	--
Push frame object index into categorized bucket	`sub_35B0B10`	--	--
Post-prolog/epilog fixup	`sub_35AC7B0`	--	--
Try to eliminate a single frame index operand	`sub_35ABF20`	--	--
Initialize register scavenger for a MBB	`sub_35C5BD0`	--	--
Advance register scavenger	`sub_35C5C00`	--	--
Post-scavenging callee-save cleanup	`sub_35C6D20`	--	--
Format stack-size annotation	`sub_35AE7D0`	--	--
PTX emitter: `emitFunctionFrameSetup()` (__local_depot + %SP/%SPL)	`sub_2158E80`	--	--
Local depot helper	`sub_214C040`	--	--
Local depot helper	`sub_2154370`	--	--
Collect callee-saved registers	`sub_2E77EA0`	--	--
Get register class for physical register	`sub_2FF6500`	--	--
Build sub-register decomposition list	`sub_2F26260`	--	--
Insert compound save instruction	`sub_2E8EAD0`	--	--
Check optimization level flag	`sub_B2D610`	--	--
Check function attribute existence	`sub_B2D620`	--	--
Get function attribute value	`sub_B2D7E0`	--	--
Build stack-size diagnostic message	`sub_B15960`	--	--

Differences from Upstream LLVM

Aspect	Upstream LLVM (NVPTX open-source)	CICC v13.0
Implementation	Stripped-down `NVPTXPrologEpilogPass` (~280 lines); handles only offset calculation and frame-index elimination	Full 68 KB PEI monolith with callee-saved register handling, register scavenging, bitmap-based frame packing, categorized layout ordering
Stack concept	No hardware stack; minimal `__local_depot` offset assignment	Same `__local_depot` model but with full-featured offset assignment: categorized frame objects, alignment-based bucketing, dead frame object elimination
Callee-saved registers	Skipped entirely (no function calls in typical kernels)	Restored: full callee-saved register scan, compound save/restore instruction insertion for non-inlined device function calls
Register scavenging	Absent	Included: `sub_35C5BD0`/`sub_35C5C00` initialize and advance a register scavenger per MBB for emergency spill resolution
Frame packing	Sequential offset assignment	Bitmap-based packing with categorized buckets; objects sorted by alignment to minimize padding waste
Stack-size diagnostics	No diagnostic system	Annotation system (`sub_35AE7D0`) formats stack-size remarks; integrates with `-Rpass-analysis` for occupancy tuning
Prologue emission	Two-instruction `%SP`/`%SPL` setup	Same two-instruction prologue (`sub_2158E80`) but with additional `__local_depot` sizing logic for complex frame layouts

Cross-References

Register Allocation — creates spill slots that PEI lays out; the number and alignment of spills directly determines frame size.
Register Coalescing — reduces register pressure, which reduces spills, which reduces frame size.
SROA — SROA eliminates allocas before they reach MachineIR; when fully successful, PEI has nothing to do.
AsmPrinter & PTX Body Emission — sub_2158E80 emits the .local directive and %SP/%SPL declarations that PEI computed.
Instruction Scheduling — runs before PEI; scheduling decisions affect register pressure and thus spill count.
Pipeline & Ordering — PEI runs post-regalloc, followed immediately by NVPTXPeephole for %VRFrame to %VRFrameLocal optimization.

BranchFolding & TailMerge

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Based on LLVM 20.0.0 BranchFolding.cpp. The critical divergence is that cicc removes the requiresStructuredCFG() gate that upstream uses to disable tail merging for GPU targets, and compensates with a reserved-register merge safety check not present in any upstream version.

BranchFolding is LLVM's post-register-allocation CFG optimizer. It runs after block placement and performs three transformations in a fixed-point loop: tail merging (extracting identical instruction tails from multiple blocks into a shared block), branch optimization (eliminating redundant or unreachable branches, merging single-predecessor blocks into predecessors), and common-code hoisting (lifting identical instructions from successors into a shared predecessor). In cicc v13.0 the pass lives at sub_2F336B0 (the OptimizeBlock / TailMergeBlocks core, 11,347 bytes) with pass entry at sub_2F36310. The NVPTX version carries one critical divergence from upstream LLVM: tail merging is not disabled by requiresStructuredCFG(). Instead, cicc keeps tail merging enabled but gates individual merge decisions on a reserved-register check that prevents merging when NVPTX special registers (%tid.x, %ntid.x, etc.) cross the merge boundary.

Key Facts

Property	Value
Core function	`sub_2F336B0` (`OptimizeBlock` / `TailMergeBlocks`)
Function size	11,347 bytes (792-byte stack frame)
Pass entry point	`sub_2F36310` (iterates all MBBs)
Pass ID (upstream)	`"branch-folder"` / `BranchFolderPassID`
Pipeline position	After register allocation, after block placement
Disable knob	`-disable-branch-fold` (global at `qword_5022CC8`)
Tail-merge gate	`enable-tail-merge` (tri-state: unset/true/false)
Tail-merge threshold	`-tail-merge-threshold` (default 150)
Minimum tail length	`-tail-merge-size` (default 3 instructions)
Knob constructor	`ctor_346`
Required property	`NoPHIs` -- SSA phi nodes must already be eliminated

Upstream vs. NVPTX Behavior

In stock LLVM, BranchFolderPass::run checks requiresStructuredCFG() on the TargetMachine and, if true, disables tail merging entirely:

bool EnableTailMerge = !MF.getTarget().requiresStructuredCFG()
                       && PassConfig->getEnableTailMerge();

NVPTX returns true from requiresStructuredCFG(), so upstream LLVM would completely suppress tail merging for GPU targets. cicc removes this gate. The binary evidence is the vtable check at 0x2F337A3 (cmp rax, offset sub_2DAC790), which verifies that the NVPTXInstrInfo vtable supports analyzeBranch -- if it does, tail merging proceeds. The structured-CFG check is absent. This makes sense: StructurizeCFG has already run by this point and guaranteed reducible control flow; tail merging two blocks that share a common successor preserves reducibility because it only introduces a new unconditional branch to the merged tail, which does not create irreducible cycles.

However, cicc compensates with three safety mechanisms that upstream does not need:

Reserved-register check. At 0x2F3427B, the pass calls sub_2E88A90 with flag 0x200 (isReservedReg) on every register live across the proposed merge boundary. NVPTX special registers (%tid.x, %ntid.x, %ctaid.x, etc.) are reserved and cannot be live-in to a newly created shared tail block because their values are implicitly defined by the hardware. If any reserved register is detected, the merge is rejected. See the Reserved-Register Safety Mechanism section below for full detail.
Priority ordering for conditional branches. The pattern or ecx, 2 at 0x2F33B1C assigns priority >= 2 to conditional branch terminators and lower priority to unconditional branches. This ensures unconditional-branch tails are merged first, because those merges never alter branch conditions and are always safe within structured CFG. Conditional tail merges are attempted only after unconditional ones are exhausted.
NVPTXInstrInfo vtable validation. The vtable check at 0x2F337A3 (cmp rax, offset sub_2DAC790) verifies that the TargetInstrInfo object supports analyzeBranch before any merge is attempted. This is a guard against running the pass on a MachineFunction whose InstrInfo does not implement branch analysis -- a scenario that cannot occur in the normal NVPTX pipeline but could if the pass were invoked from an unexpected context. The check loads the vtable pointer from [TII], compares against the known NVPTXInstrInfo vtable base, and short-circuits to "no merge" if the match fails.

Algorithm

The pass entry sub_2F36310 calls OptimizeFunction, which runs a fixed-point loop:

OptimizeFunction(MF):
    repeat:
        changed  = TailMergeBlocks(MF)
        changed |= OptimizeBranches(MF)
        changed |= HoistCommonCode(MF)
    until !changed
    // clean up dead jump tables

TailMergeBlocks

TailMergeBlocks operates in two phases.

Phase A -- return/exit blocks. Collect all blocks with no successors (return blocks, noreturn calls) into MergePotentials, capped at tail-merge-threshold (150). Hash each block's tail via sub_2F26260 (HashEndOfMBB), which computes HashMachineInstr on the last non-debug instruction. If two or more candidates share a hash, call TryTailMergeBlocks to attempt the merge.

Phase B -- multi-predecessor blocks. For each block IBB with >= 2 predecessors, collect the predecessors into MergePotentials. For each predecessor PBB:

Skip self-loops (PBB == IBB), EH-pad successors, inline-asm-br blocks.
Call AnalyzeBranch (sub_2E09D00) on PBB. If PBB conditionally branches to IBB, reverse the condition so the unconditional fall-through to IBB is removed, leaving only the conditional branch to the "other" target. This normalization enables tail comparison.
Hash the tail of the normalized PBB and push it into MergePotentials.

Then call TryTailMergeBlocks(IBB, PredBB, MinCommonTailLength):

TryTailMergeBlocks(SuccBB, PredBB, MinTail):
    sort MergePotentials by hash
    for each group of candidates sharing a hash:
        for each pair (MBB1, MBB2) in the group:
            tail_len = ComputeCommonTailLength(MBB1, MBB2)
            if tail_len >= MinTail:
                // check reserved-register constraint (NVPTX addition)
                for each reg live across merge point:
                    if hasProperty(reg, 0x200):  // isReservedReg
                        reject merge; continue
                // perform the merge
                create new MBB "CommonTail"
                splice tail instructions from MBB1 into CommonTail
                ReplaceTailWithBranchTo(MBB2, CommonTail)
                UpdateTerminator on both blocks
                update live-ins for CommonTail
                merged = true
    return merged

ComputeCommonTailLength walks backwards from both block ends, comparing instructions via isIdenticalTo. It skips debug and CFI instructions. Inline asm is never merged (hard-coded rejection in upstream). The cicc binary performs this comparison at 0x2F33B0F--0x2F33BDD, extracting opcode from [ptr+18h] and comparing sub-fields via sar/and arithmetic on the instruction encoding.

HashEndOfMBB -- sub_2F26260

The hash function at sub_2F26260 computes a 32-bit hash of a block's tail for fast merge-candidate matching. The algorithm:

HashEndOfMBB(MBB):
    iter = MBB.rbegin()        // last instruction
    // skip debug instructions
    while iter != MBB.rend() && iter.isDebugInstr():
        iter++
    if iter == MBB.rend():
        return 0               // empty block (or all-debug)
    // skip terminator branches -- hash the last non-branch
    while iter != MBB.rend() && iter.isTerminator():
        iter++
    if iter == MBB.rend():
        return 0               // block contains only terminators
    return HashMachineInstr(*iter)

HashMachineInstr (at sub_2E89C70) hashes the instruction's opcode, number of operands, and the first two operands' register/immediate values. It does not hash memory operands or metadata -- this is intentional, because the hash is only used to bucket candidates for pairwise comparison. False collisions are resolved by the subsequent ComputeCommonTailLength call. The hash uses a simple multiply-and-XOR scheme:

HashMachineInstr(MI):
    h = MI.getOpcode()
    h = h * 37 + MI.getNumOperands()
    if MI.getNumOperands() >= 1:
        h = h * 37 + hashOperand(MI.getOperand(0))
    if MI.getNumOperands() >= 2:
        h = h * 37 + hashOperand(MI.getOperand(1))
    return h

The * 37 constant is standard LLVM hashing (the same multiplier used in DenseMapInfo). The hash is deliberately coarse -- it accepts false positives (two different instructions hashing to the same value) but never produces false negatives (two identical instructions hashing differently), which is the correct tradeoff for a merge-candidate filter.

ComputeCommonTailLength -- Detailed Binary Walkthrough

The comparison loop at 0x2F33B0F proceeds as follows:

ComputeCommonTailLength(MBB1, MBB2):
    iter1 = MBB1.rbegin()     // walk backwards from end
    iter2 = MBB2.rbegin()
    count = 0

    // skip debug instructions at tails
    skip_debug(iter1, MBB1)
    skip_debug(iter2, MBB2)

    while iter1 != MBB1.rend() && iter2 != MBB2.rend():
        MI1 = *iter1
        MI2 = *iter2

        // extract opcode from [MI + 0x18]
        opc1 = *(uint32_t*)(MI1 + 0x18)
        opc2 = *(uint32_t*)(MI2 + 0x18)

        // reject if either is inline asm (opcode check)
        if is_inline_asm(opc1) || is_inline_asm(opc2):
            break

        // reject if either is a CFI pseudo-instruction
        if is_cfi(opc1) || is_cfi(opc2):
            skip to next non-CFI; continue

        // full comparison: opcode, operand count, each operand
        if !isIdenticalTo(MI1, MI2):
            break

        count++
        iter1++; iter2++
        skip_debug(iter1, MBB1)
        skip_debug(iter2, MBB2)

    return count

The isIdenticalTo comparison at the binary level extracts fields from the MachineInstr layout:

[MI + 0x18]: opcode (32-bit)
[MI + 0x08]: operand list pointer
[MI + 0x10]: operand count (16-bit at +0x10, flags at +0x12)
Each operand at stride 40 bytes: [operand + 0x00] = type tag, [operand + 0x08] = register/immediate value

Two instructions are identical if and only if: same opcode, same number of operands, and for each operand pair: same type tag and same value. Memory operands (MachineMemOperand) are not compared -- two loads from different memory locations with the same register operands will compare as identical. This is correct for tail merging because if the instructions are in the tail of two blocks that reach the same successor, their memory operands must be equivalent by construction (they access the same state at the same program point).

Merge Candidate Ordering and the Priority System

The or ecx, 2 pattern at 0x2F33B1C implements a priority-based ordering within hash groups. When building the MergePotentials list, each entry is annotated with a priority value:

Priority	Condition	Meaning
0	Block ends with unconditional branch only	Safest merge -- no condition changes needed
1	Block ends with fallthrough (no explicit branch)	Safe -- may need a branch inserted
2+	Block ends with conditional branch	Riskier -- merge may require condition reversal

The sort at TryTailMergeBlocks sorts first by hash (grouping candidates), then within each hash group by priority (ascending). This ensures that the O(K^2) pairwise comparison within each hash group tries unconditional-only pairs first. If a merge succeeds for a low-priority (safe) pair, the modified block may no longer be a candidate for a higher-priority (conditional) pair, reducing the number of conditional merges attempted.

On NVPTX, this ordering is particularly important because conditional branch reversal (sub_2E09D00 AnalyzeBranch + condition inversion) can alter the fall-through layout. In a structured CFG, the fall-through direction often corresponds to the "then" path of an if-then-else, and reversing the condition flips which path falls through. While this does not change correctness, it can change the reconvergence point's distance from the branch, affecting I-cache locality. By preferring unconditional-only merges, the pass minimizes layout disruption.

OptimizeBranches

OptimizeBranches (sub_2F36310 inner loop) walks every MBB and calls OptimizeBlock to perform local branch simplifications:

Empty-block elimination. If MBB contains only debug instructions, redirect all predecessors to the fallthrough successor.
Unconditional-to-same-target folding. If the previous block's conditional and unconditional branches both target the same block, replace with a single unconditional branch (or fallthrough).
Single-predecessor merge. If MBB has exactly one predecessor and that predecessor falls through unconditionally, splice MBB's instructions into the predecessor and remove MBB.
Redundant branch removal. If the previous block branches only to MBB (the natural fallthrough), remove the branch entirely.
Condition reversal. If the previous block conditionally branches to MBB on true and somewhere else on false, reverse the condition to create a fallthrough.
Tail-block relocation. If MBB has no successors (return/noreturn) and the predecessor could fall through to the next block instead, move MBB to the end of the function and reverse the predecessor's condition.

Each transformation triggers goto ReoptimizeBlock to re-analyze the modified block. Dead blocks (no predecessors after optimization) are removed via sub_2E790D0 (RemoveBlock).

HoistCommonCode

For each block with exactly two successors, if both successors begin with identical instructions, hoist those instructions into the predecessor. This is the inverse of tail merging -- it reduces code size when two divergent paths start with the same setup sequence. The EnableHoistCommonCode flag (always true in cicc) controls this phase.

Reserved-Register Safety Mechanism

This section documents the NVIDIA-specific reserved-register check that gates tail merging in cicc. This mechanism has no equivalent in upstream LLVM because upstream disables tail merging entirely for structured-CFG targets.

Why Reserved Registers Cannot Cross Merge Boundaries

NVPTX "special registers" (%tid.x, %ntid.x, %ctaid.x, %nctaid.x, %laneid, %warpid, and the SM 90+ cluster registers) are not stored in the virtual register file. They are hardware-defined, read-only values whose definitions are implicit -- there is no MachineInstr that defines %tid.x. Instead, these registers appear as implicit uses on instructions that read thread/block/grid coordinates.

When tail merging creates a new shared tail block CommonTail, LLVM's infrastructure computes the live-in set for CommonTail from the union of live-outs of the merged predecessors. For a normal virtual register, this is safe: the register has a concrete definition (a MachineInstr somewhere in the function), and the live-in annotation tells the downstream passes that the value is available at block entry.

For a reserved register, there is no concrete definition. The value is implicitly available at every point in the function -- it is defined by the hardware thread context, not by any instruction. Creating a new block with a reserved register in its live-in set is semantically meaningless but causes three concrete problems:

LiveIntervals confusion. The LiveIntervals analysis (already computed by this point) has no interval for reserved registers. Adding a live-in for a reserved register to CommonTail would require creating a new LiveInterval that spans from CommonTail's entry to the last use within CommonTail. But reserved registers do not participate in LiveIntervals -- they are excluded during interval construction at sub_2F5A640. The resulting inconsistency triggers assertions in debug builds and can silently corrupt the interference matrix in release builds.
Register pressure miscounting. The greedy register allocator tracks pressure per register class. Reserved registers belong to the internal-only class at off_4A026E0 (the "!Special!" class documented in Register Classes). This class has no encoded ID, no PTX declaration, and is excluded from pressure accounting. If a reserved register appeared as a live-in, the pressure tracker would attempt to look up its class and fail -- or worse, miscount it against one of the nine real classes.
Emission failure. During PTX emission, sub_21583D0 (the register encoding function) maps each register to its 4-bit class tag via vtable comparison. Reserved registers use the off_4A026E0 vtable, which triggers the fatal "Bad register class" error. A reserved register in a live-in set could propagate to a point where the emitter attempts to declare it, causing an unconditional abort.

The hasProperty Check -- sub_2E88A90

sub_2E88A90 is a multi-purpose property query function used across several subsystems in cicc:

Call site	Flag	Meaning
BranchFolding (`0x2F3427B`)	`0x200`	`isReservedReg` -- register is a hardware-defined special register
StructurizeCFG (`sub_2E88A90` in structurize)	`0x80000` / `0x100000`	Uniformity/divergence classification
InstrEmitter (`sub_2E88A90` in emitter)	`0x1000000000` (bit 36)	NVPTX-specific implicit-use flag

The function takes three arguments:

sub_2E88A90(context_ptr, register_or_operand, flag_mask) -> bool

For the BranchFolding call at 0x2F3427B, the calling convention is:

; rdi = TargetRegisterInfo*  (from MachineFunction->getSubtarget().getRegisterInfo())
; esi = register ID           (physical register number from live-in set)
; edx = 0x200                 (isReservedReg flag)
; returns: al = 1 if reserved, 0 if not

The function internally indexes into a per-register property table at [TRI + 0x58]. This table is initialized during NVPTXRegisterInfo construction (sub_2163AB0 for legacy PM, sub_30590F0 for new PM) and contains one entry per physical register. Each entry is a 64-bit bitmask of properties. The 0x200 bit (bit 9) is set for every register in the NVPTX special/environment register set.

Which Registers Are Marked Reserved (flag 0x200)

The following registers have bit 9 (0x200) set in the property table and will cause a merge rejection if live across the merge boundary:

Register Group	PTX Names	Emission Function	Count
Thread ID	`%tid.x`, `%tid.y`, `%tid.z`	`sub_21E86B0` (opcodes `0x26`--`0x28`)	3
Block dimensions	`%ntid.x`, `%ntid.y`, `%ntid.z`	`sub_21E86B0` (opcodes `0x29`--`0x2B`)	3
Block ID	`%ctaid.x`, `%ctaid.y`, `%ctaid.z`	`sub_21E86B0` (opcodes `0x2C`--`0x2E`)	3
Grid dimensions	`%nctaid.x`, `%nctaid.y`, `%nctaid.z`	`sub_21E86B0` (opcodes `0x2F`--`0x31`)	3
Warp/lane ID	`%warpid`, `%laneid`	`sub_21E86B0` (opcodes `0x5E`--`0x5F`, via `sub_3958DA0`)	2
Cluster registers (SM 90+)	`%cluster_ctarank`, `%cluster_nctarank`, `%cluster_ctaid.{x,y,z}`, `%cluster_nctaid.{x,y,z}`, `%clusterid.{x,y,z}`, `%nclusterid.{x,y,z}`, `%is_explicit_cluster`	`sub_21E9060` (values 0--14)	15
Stack pointer	`%SP`, `%SPL`	inline in frame setup	2
Environment regs	`ENVREG0`--`ENVREG31`	internal (not emitted to PTX)	32

Total: 63 reserved registers. These correspond to the physical register set in NVPTX -- recall that NVPTX has no general-purpose physical registers, so the only physical registers are the special hardware-defined ones plus the stack pointer pair.

The environment registers (ENVREG0--ENVREG31) are used internally by the CUDA runtime to pass kernel arguments and configuration data. They are read-only from the kernel's perspective and never appear explicitly in emitted PTX. Their presence in the reserved set is a safety measure against internal IR manipulations that might introduce them as explicit operands.

The Check in Context: Full Merge Decision Sequence

The reserved-register check is the third of four gates in the merge decision path. The complete sequence at 0x2F33B0F--0x2F34300 is:

MergeDecision(MBB1, MBB2, MinTail):
    // Gate 1: Instruction comparison
    tail_len = ComputeCommonTailLength(MBB1, MBB2)
    if tail_len < MinTail:
        return REJECT

    // Gate 2: Branch analysis feasibility
    ok = AnalyzeBranch(MBB1, ...)
    if !ok:
        return REJECT    // unanalyzable terminator (inline asm, etc.)

    // Gate 3: Reserved-register check (NVPTX-specific)
    for each reg in LiveIns(MBB1[split_point:]) ∪ LiveIns(MBB2[split_point:]):
        if sub_2E88A90(TRI, reg, 0x200):
            return REJECT    // reserved register crosses merge boundary

    // Gate 4: Profitability (code size)
    overhead = 1     // one branch instruction to CommonTail
    if MBB1 needs UpdateTerminator:
        overhead += 1
    if tail_len <= overhead:
        return REJECT    // no net code-size reduction

    return ACCEPT

Gate 3 iterates every register that would be live-in to the proposed CommonTail block. The live-in set is computed by walking the tail instructions backwards and collecting register uses that have no definition within the tail. If any register in this set has the 0x200 property, the entire merge is rejected -- there is no fallback or partial merge.

Interaction with computeLiveIns -- sub_2E16F10

After a merge is accepted and the CommonTail block is created, sub_2E16F10 (computeLiveIns) populates the new block's live-in set. This function must agree with the pre-merge reserved-register check: if the check passed (no reserved registers), then computeLiveIns will produce a live-in set containing only virtual registers and non-reserved physical registers. The function at sub_2E16F10 performs its own filtering:

computeLiveIns(CommonTail):
    for each reg in upward_exposed_uses(CommonTail):
        if isReserved(reg):
            continue    // redundant safety -- already filtered by Gate 3
        addLiveIn(CommonTail, reg)

The double-check (once in the merge decision, once in computeLiveIns) is a defense-in-depth pattern. The merge decision check prevents the merge from happening at all; the computeLiveIns filter prevents a reserved register from entering the live-in set even if the merge decision check were somehow bypassed (e.g., by a future code change that added a new merge path).

GPU-Specific Considerations

Tail Merging and Warp Divergence

Tail merging on GPU does not interact with warp divergence in the way that branch duplication does. When two blocks A and B both end with the same instruction sequence and share a common successor C, merging the tails into a shared CommonTail block that falls through to C does not change which warps execute which instructions. Every warp that previously executed the tail of A now executes the same instructions in CommonTail; similarly for B. The branch from A (or B) to CommonTail is unconditional and therefore non-divergent by definition.

However, there is one subtle interaction: if A and B are the two sides of a divergent branch, and the tail merge creates CommonTail between them and their common successor C, the reconvergence point may shift. Previously, warps reconverged at C's entry. After the merge, warps reconverge at CommonTail's entry -- which is equivalent but changes the block numbering. StructurizeCFG has already inserted any necessary reconvergence tokens before BranchFolding runs, and those tokens are block-relative. The UpdateTerminator call at sub_2FAD510 and the ReplaceUsesOfBlockWith call at sub_2E0E0B0 update all references, so the reconvergence semantics are preserved.

Code Size vs. Instruction Cache

On GPU, the primary motivation for tail merging is code size reduction, which translates directly to reduced instruction cache pressure. NVIDIA GPUs have small instruction caches per SM partition (32--128 KB depending on architecture generation). Tail merging reduces the number of unique instructions the I-cache must hold.

The tail-merge-size default of 3 reflects the GPU's branch cost: one bra instruction to redirect flow to CommonTail, plus one additional instruction if the predecessor's terminator needs rewriting. With a minimum tail length of 3, the merge always saves at least one instruction's worth of I-cache footprint. On a GPU where each instruction occupies 8--16 bytes (PTX instructions vary in encoding width, but ptxas expands them to fixed-width SASS), a 3-instruction merge saves 24--48 bytes of I-cache per merge site.

The tail-merge-threshold of 150 is generous compared to upstream LLVM's default (also 150 in upstream, but upstream disables the entire mechanism for GPU targets). In practice, GPU kernels rarely have blocks with 150+ predecessors -- the threshold exists primarily to prevent pathological compile times on machine-generated code with massive switch tables.

Structured CFG Preservation Proof

The claim that tail merging preserves structured (reducible) control flow deserves a rigorous argument, since this is the justification for NVIDIA removing the requiresStructuredCFG() gate.

Claim: If the input CFG is reducible, then the CFG after tail merging is also reducible.

Proof sketch: Tail merging performs one operation: it takes two blocks A and B that share a common tail instruction sequence, creates a new block T containing the tail, and replaces the tail portions of A and B with unconditional branches to T. The successors of A and B in the tail (which were the same for both, by construction) become successors of T instead.

Consider the back-edge structure. In a reducible CFG, every cycle has a single entry point (the loop header). Tail merging cannot create a new cycle because:

T is a new block with no incoming edges except from A and B.
T's outgoing edges are a subset of the original outgoing edges of A and B's tails.
No edge into T can form a back-edge of an existing cycle unless A or B was already a back-edge target, in which case the cycle's entry point was A or B, not T.
The only new edges are A->T and B->T (unconditional). These cannot create a new cycle because T does not dominate A or B (it was just created).

Therefore, no new irreducible cycle is introduced. The disable-nvptx-require-structured-cfg knob (at qword_5022CC8 in NVPTXTargetMachine) provides a backdoor to disable the structured-CFG requirement entirely, but it is false by default and should never be set in production.

Interaction with EH and Cleanup Pads

NVPTX does not support C++ exceptions in the traditional sense -- there is no stack unwinding on GPU. However, cicc does handle cleanup semantics for CUDA cooperative groups and destructor calls. The branch folding pass skips blocks that are EH landing pads (isEHPad() check at the start of OptimizeBlock). On NVPTX, this check is typically a no-op because no blocks are marked as EH pads, but the check remains active because the same binary serves both CUDA and non-CUDA compilation paths.

Interaction with Convergence Control Tokens

On SM 90+ (Hopper and later), cicc emits convergence control pseudo-instructions (bra.convergent, .pragma "convergent") that are consumed by ptxas to guide reconvergence behavior. These pseudo-instructions are MachineInstrs with specific opcodes that BranchFolding must not merge or reorder. The isIdenticalTo comparison in ComputeCommonTailLength considers opcode, operands, and flags, so two convergence control instructions with different target blocks will not compare as identical and will naturally terminate the common-tail scan. This prevents the tail merger from accidentally merging convergence annotations that belong to different reconvergence points.

Data Structures

The MBBInfo structure passed via rdi to sub_2F336B0:

Offset	Type	Field
`+0x00`	`MachineFunction*`	Parent function / block list head
`+0x08`	`MachineBasicBlock*`	Fallthrough candidate block
`+0x10`	`BranchAnalysisResult*`	Cached result from `AnalyzeBranch`
`+0x28`	`DenseMap<uint, list>`	Hash-to-candidate-list merge table

The pass allocates a 792-byte stack frame holding:

Stack variable	Purpose
`var_2E0`	`merge_count` (number of merges performed)
`var_309`	`modified` flag
`var_30A`	`should_try_fold` flag (initialized to 1)
`var_224`	Hash table allocated flag
`var_1E4`	Operand table allocated flag

Configuration

Knob	Type	Default	Effect
`disable-branch-fold`	bool	false	Skips the entire pass
`enable-tail-merge`	tri-state	unset (uses target default)	Force-enable or disable tail merging
`tail-merge-threshold`	unsigned	150	Max predecessors considered per merge round; caps `MergePotentials` size
`tail-merge-size`	unsigned	3	Minimum common tail length (in instructions) to justify a merge
`branch-fold-placement`	bool	true	Enables branch folding within MachineBlockPlacement (separate invocation)
`ifcvt-branch-fold`	bool	true	Enables branch folding within the if-converter pass

The tail-merge-threshold of 150 exists purely as a compile-time throttle. For a block with N predecessors, the pass performs O(N^2) pairwise comparisons within each hash group. Setting the threshold to 0 effectively disables tail merging for blocks with many predecessors while keeping branch optimization active.

The tail-merge-size of 3 is the break-even point: creating a new shared block plus a branch instruction costs roughly 2 instructions of overhead, so merging fewer than 3 common instructions produces no net code-size reduction.

Function Map

Function	Address	Size	Role
`BranchFolder::OptimizeFunction`	`sub_2F36310`	--	Pass entry; fixed-point loop over TailMerge + OptimizeBranches + HoistCommonCode
`BranchFolder::OptimizeBlock` / inner logic	`sub_2F336B0`	11,347B	Per-block optimization + tail merge core (792-byte stack frame)
`HashEndOfMBB`	`sub_2F26260`	--	Tail hash computation; hashes last non-debug non-terminator instruction
`isBranchFoldable`	`sub_2F31250`	--	Checks if operand represents a foldable branch target
Merge candidate map lookup	`sub_2F33020`	--	Hash table lookup in `MergePotentials` DenseMap
`TryTailMergeBlocks`	`sub_2E2B9F0`	--	Attempts merge across candidate set; calls Gates 1--4
`AnalyzeBranch`	`sub_2E09D00`	--	NVPTXInstrInfo branch analysis: type, targets, conditions
`RemoveBranch`	`sub_2E0C3B0`	--	Removes terminator branch instructions from MBB
`InsertBranch`	`sub_2E0F080`	--	Inserts new branch instruction to redirect flow
`ReplaceTailWithBranchTo`	`sub_2E0A600`	--	Splices tail into shared block, inserts unconditional redirect
`ReplaceUsesOfBlockWith`	`sub_2E0E0B0`	--	Updates phi nodes and predecessor lists after merge
`getBlockNumbered`	`sub_2E192D0`	--	MBB number to pointer lookup
`UpdateTerminator`	`sub_2FAD510`	--	Fixes terminators after CFG modification
`RemoveBlock`	`sub_2E790D0`	--	Removes dead MBB from function; updates predecessor/successor lists
`computeLiveIns`	`sub_2E16F10`	--	Updates live-in register sets for merged block; filters reserved registers
`getVRegDef`	`sub_2EBEE10`	--	Virtual register definition lookup
`hasProperty(flag)`	`sub_2E88A90`	--	Multi-purpose register/operand property query (flag `0x200` = reserved, `0x80000` = uniform, `0x100000` = divergent)
`HashMachineInstr`	`sub_2E89C70`	--	Instruction hash for merge candidate bucketing (`* 37` multiply-XOR scheme)
`SpliceBlock`	`sub_2E31080`	--	Unlinks MBB from doubly-linked list
`NVPTXInstrInfo` vtable	`sub_2DAC790`	--	Vtable base checked at `0x2F337A3` to validate InstrInfo supports analyzeBranch
Dynamic special register resolver	`sub_3958DA0`	--	Resolves opcodes `0x5E`/`0x5F` to `%warpid`/`%laneid`
Special register emission	`sub_21E86B0`	--	Emits `%tid`, `%ctaid`, `%ntid`, `%nctaid` (opcodes `0x26`--`0x31`)
Cluster register emission (SM 90+)	`sub_21E9060`	--	Emits 15 cluster registers (`%cluster_ctarank`, `%clusterid`, etc.)

Interaction with StructurizeCFG

StructurizeCFG runs during the IR-level pipeline (before SelectionDAG), while BranchFolding runs after register allocation at the machine level. By the time BranchFolding executes, all control flow is already structured and reducible. The key interaction:

StructurizeCFG may insert "Flow" blocks that serve as reconvergence points. These are often empty or contain only an unconditional branch. BranchFolding's empty-block elimination (step 1 of OptimizeBranches) can remove these if they have become redundant after code generation.
Tail merging never introduces irreducible control flow because it only adds unconditional branches to a new shared tail block. The new block post-dominates the merged tails, preserving reducibility.
The branch-fold-placement knob controls a separate invocation of branch folding logic embedded within MachineBlockPlacement. That invocation runs before the standalone BranchFolding pass and performs a limited subset of the same transformations during layout decisions.

Complexity

The hash-based matching makes the typical case efficient. For N blocks and average predecessor count M, the overall complexity is O(N * M) for hash computation, plus O(K^2 * T) for pairwise comparison within hash groups, where K is the number of blocks sharing a hash and T is the common tail length. The tail-merge-threshold caps K at 150. The recursive self-call pattern (the pass re-invokes itself when a merge creates new opportunities) means worst-case is O(N^2) iterations, but this is rare in practice -- most functions converge in 2-3 iterations.

Differences from Upstream LLVM

Aspect	Upstream LLVM 20	cicc v13.0
Tail merge for structured-CFG targets	Disabled (`requiresStructuredCFG()` returns true -> tail merge off)	Enabled -- structured-CFG gate removed
Reserved-register merge gate	Not present (unnecessary -- tail merge disabled for GPU)	Gate 3: `sub_2E88A90` with flag `0x200` rejects merges when special registers are live across the boundary
Priority ordering	Candidates sorted by hash only	Additional priority sort within hash groups: unconditional branches first (priority 0), then conditional (priority 2+)
NVPTXInstrInfo vtable check	Not present	`cmp rax, offset sub_2DAC790` at `0x2F337A3` validates InstrInfo before merge attempts
computeLiveIns filtering	No reserved-register filter	Double-filters reserved registers (once at merge decision, once at live-in computation)
Convergence control awareness	Not present (no convergence tokens in upstream)	`isIdenticalTo` naturally prevents merging convergence pseudo-instructions with different targets
MachineInstr stride	32-byte operand stride	40-byte operand stride (extra 8 bytes for NVPTX-specific metadata)
Upstream source	`llvm/lib/CodeGen/BranchFolding.cpp`	Binary at `0x2F336B0`--`0x2F36310` range

Cross-References

Block Placement -- runs before BranchFolding; its branch-fold-placement knob triggers inline branch folding during layout.
StructurizeCFG -- guarantees structured control flow before BranchFolding runs; inserts Flow blocks that BranchFolding may later eliminate. Uses the same sub_2E88A90 for divergence queries.
Register Allocation -- BranchFolding requires NoPHIs property, meaning it runs post-regalloc in the NVPTX pipeline. The greedy RA at sub_2F5A640 excludes reserved registers from pressure tracking.
Instruction Scheduling -- scheduling runs after BranchFolding; the final CFG shape from branch folding determines scheduling regions.
Register Classes -- documents the internal-only off_4A026E0 class ("!Special!") that holds reserved/environment registers. The register encoding function sub_21583D0 fatally aborts on this class.
PTX Emission -- special register emission functions sub_21E86B0 and sub_21E9060 that handle the 63 reserved registers.
NVPTX Target Infrastructure -- the disable-nvptx-require-structured-cfg knob that controls the structured-CFG requirement.
Machine-Level Passes -- pipeline context showing BranchFolding's position after register allocation and before instruction scheduling.
InstrEmitter -- another consumer of sub_2E88A90 that uses flag bit 36 for NVPTX-specific implicit-use detection.

MachineBlockPlacement for GPU

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

MachineBlockPlacement decides the physical ordering of basic blocks in a MachineFunction. On CPU, it is primarily an I-cache optimization. On GPU, block ordering has deeper consequences: PTX is a structured ISA where every taken branch stalls the SM instruction fetch pipeline, warp divergence must reconverge at post-dominators, and instruction cache capacity is measured in tens of kilobytes per SM partition. cicc carries two separate instances of this pass -- a stock LLVM copy for internal use and an NVPTX-pipeline copy at sub_3521FF0 that participates in GPU-specific analysis. The NVPTX instance queries a divergence flag on the MachineFunction to decide whether tail duplication is profitable, and adds an alternative layout proposal path (sub_34BEDF0 / sub_34C7080) that is absent from upstream LLVM.

Key Facts

Property	Value
Entry point	`sub_3521FF0` (82 KB decompiled, 2435 lines)
Pass name	`"Branch Probability Basic Block Placement"`
Pass ID	`"block-placement"`
Registration (NVPTX)	`sub_350FE30` (pass), `sub_350FEE0` (stats)
Registration (generic)	`sub_1DE8060` (pass), `sub_1DE8500` (stats)
Stats pass ID	`"block-placement-stats"`, callback `sub_3517680`
Knob constructor	`ctor_671_0` at `0x5A0470`
Required analyses	MachineBlockFrequencyInfo, MachineBranchProbabilityInfo, MachinePostDominatorTree, MachineLoopInfo, TargetPassConfig

Why Block Placement Matters on GPU

Three properties of GPU execution make block ordering non-trivial.

Instruction fetch pipeline. GPU SMs fetch instructions sequentially. A taken branch introduces a fetch bubble -- the warp scheduler cannot issue from the new target until the instruction cache services the request. Every fall-through edge is free; every taken branch costs at least one cycle of fetch latency. The misfetch-cost (default 1) and jump-inst-cost (default 1) knobs model this cost. Maximizing fall-through sequences directly reduces warp stall cycles at branch points.

Instruction cache pressure. GPU instruction caches are small (typically 32-128 KB per SM partition). Code duplication through tail-dup increases I-cache working set. The tail-dup-placement-penalty (default 2%) penalizes code copies that improve fall-through at the expense of I-cache pressure. The ext-TSP model, when enabled, explicitly optimizes for I-cache utilization by modeling forward/backward reference distances.

Warp divergence. When a branch is divergent (different lanes take different paths), all paths must execute serially, and the warp reconverges at the post-dominator. Block ordering cannot eliminate the divergence cost, but it determines which side of the branch falls through vs. takes a jump. The divergence flag at MF+8+688 bit 0 gates whether tail duplication is even attempted: duplicating a tail block that sits below a divergent branch wastes code size because divergent warps execute both paths regardless of which one falls through.

Pass Object Layout

The pass object at a1 is populated during runOnMachineFunction:

Offset	Type	Content
`+488`	ptr	Loop chain working data (cleared by `sub_35142F0`)
`+520`	`MachineFunction*`	Current function being processed
`+528`	ptr	`MachineBlockFrequencyInfo*` (adjusted +169 from raw analysis pointer)
`+536`	ptr	`MachineBranchProbabilityInfo*` (40-byte struct at +200)
`+544`	ptr	`MachinePostDominatorTree*` (+200)
`+552`	u64	Working state (cleared to 0)
`+560`	ptr	`TargetInstrInfo*` (nullptr if default vtable)
`+568`	ptr	`TargetRegisterInfo*` (nullptr if default vtable)
`+576`	ptr	`TailDuplicator*` (from `unk_50209DC` analysis, +200)
`+584`	ptr	`MachineLoopInfo*`
`+592`	ptr	`TargetPassConfig*`
`+600`	inline	Chain-builder state (initialized by `sub_2FD5DC0`)
`+776`	u64	Profile-derived hot threshold
`+784`	i32	Tail-dup threshold (2 or 4)
`+788`	bool	Profile count was explicitly provided
`+792`	ptr	Bump allocator base (for chain node allocation)
`+800`	u64	Bump allocator capacity
`+872`	u64	Bump allocator total allocation counter
`+888`	struct	Chain-map (BB-to-chain DenseMap, queried via `sub_3515040`)

Chain nodes are 64 bytes each, allocated from the bump allocator:

struct ChainNode {          // 64 bytes
    MachineBasicBlock** bb_array;   // +0:  pointer to BB array (initially +16)
    uint32_t count;                 // +8:  number of BBs in chain
    uint32_t capacity;              // +12: capacity (initial: 1)
    MachineBasicBlock* inline_bb;   // +16: inline storage for single-BB chain
    uint8_t  padding[24];           // +24: space for up to 3 more inline BBs
    void*    chain_map;             // +48: pointer to parent chain-map
    uint64_t flags;                 // +56: chain flags
};

Algorithm Overview

The entry point sub_3521FF0 dispatches to one of two layout algorithms: the standard chain-based placement, or the ext-TSP layout when explicitly enabled. The overall flow:

runOnMachineFunction(MF):
    if MF.empty(): return 0

    // Fetch analyses
    MBFI  = getAnalysis<MachineBlockFrequencyInfo>()
    MBPI  = getAnalysis<MachineBranchProbabilityInfo>()
    MPDT  = getAnalysis<MachinePostDominatorTree>()
    MLI   = getAnalysis<MachineLoopInfo>()
    TPC   = getAnalysis<TargetPassConfig>()
    TII   = MF.getSubtarget().getInstrInfo()
    TRI   = MF.getSubtarget().getRegisterInfo()

    // Compute tail-dup threshold
    threshold = computeTailDupThreshold(optLevel, TII)

    // Decide layout algorithm
    if enable-ext-tsp-block-placement AND MF.size() fits:
        applyExtTsp(MF)
    else:
        buildChains(MF)           // sub_3521900
        tailDupPlacement(MF)      // sub_35185B0 (if enabled + not divergent)
        tryAlternativeLayout(MF)  // sub_34BEDF0 + sub_34C7080 (NVIDIA addition)

    // Post-placement
    optimizeBranches()            // flip branches for fall-through
    alignBlocks()                 // sub_3516980
    cleanup()
    return 1

Chain-Based Placement (Standard Path)

sub_3521900 (buildChains) is the workhorse. It operates in four steps.

Step 1 -- Initial Chain Construction

For every BB in the MachineFunction (iterated via the doubly-linked intrusive list from MF+328 to sentinel MF+320), the builder:

Allocates a 64-byte chain node from the bump allocator at pass+792. The node is initialized with count=1, capacity=1, the inline BB pointer set to the current BB, and the chain-map pointer set to pass+888.
Inserts the BB-to-chain mapping into the chain-map via sub_3515040 (DenseMap insert with pointer hash ((ptr >> 9) ^ (ptr >> 4)) & (bucket_count - 1)).
Attempts to extend the chain forward: calls TII->analyzeBranch() (vtable+344) on the current BB. If analyzable and a fall-through successor exists, calls sub_2E32580 to verify the successor is valid for chaining (not already claimed by a different chain, not a landing pad, not the function entry if it would create a cycle). If valid, the successor is appended to the chain's BB array (growing from inline storage to heap allocation via sub_C7D6A0 when needed), and the walk continues from the successor.

The result is a set of maximal fall-through chains -- each chain represents a sequence of BBs where every transition is a fall-through edge according to analyzeBranch.

Step 2 -- Loop Chain Merging

Read the MachineLoopInfo structure at pass+584. Iterate loops from innermost outward. For each loop, call sub_351EBB0 (buildLoopChains), which:

Identifies all chains that contain BBs belonging to the loop.
Merges these chains into a single loop chain, ordering them to maximize fall-through within the loop body.
Applies loop rotation via sub_351C710 (rotateLoop) to place the exiting block at the bottom, making the back-edge a fall-through and the exit a taken branch (or the reverse, whichever minimizes cost according to the profile data).

Cold blocks within the loop (where loop_freq / block_freq > loop-to-cold-block-ratio) are ejected from the loop chain and will be placed at the function's end during the commit step.

Step 3 -- Global Successor Ordering

Call sub_35157A0 (selectBestSuccessor) for each BB to find the globally best successor chain ordering. The selection considers:

Edge probability from sub_2E441D0 (getEdgeProbability)
Whether the successor is already the fall-through (free) or would require a taken branch (cost = misfetch-cost + jump-inst-cost)
Whether chaining the successor would break an existing profitable chain connection

Then sub_351D700 (buildChainForBlock) performs a greedy walk from the function entry, building the top-level chain by repeatedly selecting the best unchained successor and appending it.

Step 4 -- Commit

Walk the final chain's BB array and splice each BB into position using intrusive-list pointer swaps on the MachineFunction's BB list (pointer updates at BB+0 and BB+8 -- the prev/next pointers of the doubly-linked list).

Ext-TSP Layout (Optional Path)

When enable-ext-tsp-block-placement is true (default: false), the pass uses the Extended Travelling Salesman Problem formulation from LLVM's CodeLayout.h. This is a profile-guided model that explicitly optimizes I-cache utilization by penalizing backward references and rewarding fall-through edges.

The ext-TSP path builds a BB index hash-map using LLVM's DenseMap pattern (hash: (ptr >> 9) ^ (ptr >> 4), 75% load factor), computes block frequencies and edge weights, then runs three solver functions:

Function	Role
`sub_29BAF70`	`calcExtTspScore()` -- score the original layout
`sub_29BAC40`	`calcExtTspScore()` -- score the alternative layout
`sub_29BB2B0`	`computeExtTspLayout()` -- reorder chains by ext-TSP objective

The pass compares original vs. reordered cost and commits the better ordering via sub_3519A10 (applyBlockOrder). Additional ext-TSP tuning knobs (registered in ctor_492 at 0x5545a0):

Knob	Description
`ext-tsp-forward-weight-cond` / `uncond`	Weight for conditional/unconditional forward jumps
`ext-tsp-backward-weight-cond` / `uncond`	Weight for conditional/unconditional backward jumps
`ext-tsp-fallthrough-weight-cond` / `uncond`	Weight for fall-through edges
`ext-tsp-forward-distance` / `backward-distance`	Distance thresholds for cache modeling
`ext-tsp-max-chain-size`	Maximum chain size for ext-TSP merging
`ext-tsp-chain-split-threshold`	Threshold for splitting chains
`ext-tsp-max-merge-density-ratio`	Density ratio cap for chain merges
`ext-tsp-apply-without-profile`	Run ext-TSP even without PGO data
`cdsort-cache-entries` / `cache-size`	CDSort cache model parameters
`cdsort-max-chain-size`	CDSort chain size limit
`cdsort-distance-power` / `frequency-scale`	CDSort cost model tuning

NVIDIA-Specific Modifications

Divergence-Gated Tail Duplication

The most significant GPU-specific behavior is the divergence check before tail duplication. At step (G) in the algorithm, the pass reads MF+8+688 bit 0 -- a flag set by earlier divergence analysis passes indicating the function contains warp-divergent branches. When this bit is set, sub_35185B0 (tailDupPlacement) is skipped entirely.

The rationale: tail duplication creates an additional copy of a basic block to convert a diamond-shaped CFG into a straight-line fall-through. On CPU, this eliminates a taken branch on the hot path. On GPU with divergent branches, both sides of the diamond execute regardless (the warp mask simply toggles), so duplicating the tail block doubles code size for zero fall-through benefit. The divergence flag is a conservative gate -- it disables tail-dup for the entire function, not per-branch.

Alternative Layout Proposal Algorithm

When the standard chain-based path is selected (not ext-TSP), and the function has more than 3 basic blocks with profile data and is not marked divergent, the pass runs a complete alternative layout evaluation through a pipeline absent from upstream LLVM. This is one of cicc's most significant code-layout additions.

Activation Gate

if (byte_503C568 is set AND MF.size() > 3):
    evaluator = sub_34BEDF0(state, profile_flag, MBFI, TII, MBPI)
    changed   = sub_34C7080(evaluator, MF, chain_data, ...)
    if changed:
        commit(evaluator_layout)

The gate variable byte_503C568 corresponds to the branch-fold-placement knob (default true). When branch-fold-placement is active and the function has enough basic blocks to justify the extra analysis cost, the alternative path fires.

State Object Initialization -- `sub_34BEDF0` (321 bytes)

sub_34BEDF0 is a constructor that initializes a 0x100-byte evaluator state object. It takes six arguments: (rdi=state, rsi=profile_available, rdx=?, rcx=MBFI*, r8=TII*, r9=MBPI*). The initialization zeroes the majority of the structure and sets up internal storage pointers:

struct LayoutEvaluatorState {      // 0x100 bytes, initialized by sub_34BEDF0
    void*    bb_array_ptr;         // +0x00: BB ordering array (initially null)
    uint64_t bb_array_size;        // +0x08: count
    uint64_t bb_array_cap;         // +0x10: capacity
    uint64_t iteration_count;      // +0x18: cleared to 0
    void*    inline_storage_ptr;   // +0x20: points to +0x38 (inline array)
    uint64_t initial_capacity;     // +0x28: set to 2
    uint32_t current_count;        // +0x30: set to 0
    uint8_t  is_fresh;             // +0x34: set to 1 (first-run flag)
    uint8_t  padding[3];           // +0x35
    uint8_t  inline_array[72];     // +0x38: inline storage for small chains
    uint8_t  profile_available;    // +0x80: bit 0 = profile flag
    uint8_t  force_mode;           // +0x81: set from qword_503AD08 knob
    uint8_t  divergence_aware;     // +0x82: set from dl argument
    uint8_t  needs_reconverge;     // +0x83: cleared to 0, set during evaluation
    uint32_t bb_limit;             // +0x84: from stack argument (BB count cap)
    // +0x88..+0xA8: five qword slots, all zeroed
    void*    bb_ptr_array;         // +0xB0: points to +0xC8 (inline)
    uint64_t bb_ptr_array_pad;     // +0xB8: cleared
    uint64_t bb_ptr_array_cap;     // +0xC0: set to 8
    uint8_t  bb_ptr_inline[24];    // +0xC8: inline BB pointer storage
    uint64_t total_cost;           // +0xD8: cleared
    uint32_t cost_flags;           // +0xE0: cleared
    void*    mbfi_ptr;             // +0xE8: MachineBlockFrequencyInfo*
    void*    tii_ptr;              // +0xF0: TargetInstrInfo*
    void*    mbpi_ptr;             // +0xF8: MachineBranchProbabilityInfo*
};

The force_mode field at offset +0x81 is set based on the global qword_503AD08. When this global equals 0, the force mode takes the profile_available argument. When it equals 1, force mode is unconditionally set to 1 (always evaluate). Any other value causes a straight return (skip evaluation). This provides a three-way override: 0=auto, 1=always, other=never.

Dispatch Wrapper -- `sub_34C7080` (17 bytes)

sub_34C7080 is a thin guard:

// sub_34C7080(rdi=evaluator, rsi=MF, rdx=chain_data, rcx=..., r8=..., r9=changed_flag)
if (rdx == NULL) return 0;     // no chain data -> nothing to evaluate
return sub_34C6AF0(rdi, rsi, rdx, rcx, r8, (bool)r9);

The NULL check on rdx (the chain-data pointer) provides a fast exit when the chain builder produced no intermediate state worth re-evaluating.

Core Layout Evaluator -- `sub_34C6AF0` (1419 bytes)

sub_34C6AF0 is the real body of the alternative layout evaluator. It operates on the evaluator state object (from sub_34BEDF0) and the MachineFunction, performing a complete re-evaluation of the chain-based layout against a different cost model. The algorithm proceeds in six steps:

Step 1 -- Iteration counter and hash table reset. Increment the iteration count at state+0x18. If the hash table at state+0x20 is not fresh (byte at state+0x34 is 0), compute a minimum table size as max(32, 4 * (capacity - count)), and if the current table is undersized, fill it with 0xFF sentinels via memset. This hash table tracks which BBs have been visited during the current evaluation pass.

Step 2 -- State initialization from MachineFunction. Clear the running cost accumulator at state+0x2C..+0x30. Read the first BB from the MachineFunction's chain data. Store the chain data pointer, the iteration limit from state+0x84, and the analysis pointers (MBPI at state+0x98, TII at state+0xA0) into the evaluator's working slots.

Read MF->getSubtarget()->something at offset +0x220 and subtract 0x2A (decimal 42). This produces an SM-generation index (sm_70=0, sm_75=1, sm_80=2, ..., sm_90=6, sm_100=16 based on this encoding). This index determines which cost table row is used for the fetch-penalty model.

Step 3 -- Divergence-aware block scanning. For the first BB in the chain, check bit 2 of the flags at BB[0]+0x158. If set, dispatch to TII->vtable+0x210 (which is compared against sub_2FF52D0 -- the default stub). If the target overrides this vtable slot, call the override with the MachineFunction to determine if the block needs special handling. When the default is in use, set state+0x83 (needs_reconverge) to 1 unconditionally. This appears to be an NVPTX check for whether the block is in a reconvergence region where layout ordering has correctness implications, not just performance.

Step 4 -- Main evaluation loop. Call sub_34BA1B0 to snapshot the current chain state into a temporary structure on the stack. Then enter the main loop:

while (true):
    status = sub_34C4890(state, MF)   // advance to next BB in evaluation order
    changed_bit = (state->profile_available XOR 1) OR status
    if changed_bit == 0:
        // Reached the evaluation boundary without changes
        if (sm_index <= 1):           // sm_70 or sm_75
            check qword_503AA68 knob  // additional gate for older archs
            if set: call sub_34C0690(state, loop) for each loop in MF
        if state->divergence_aware:
            call sub_34C56D0(state, loop) for each loop in MF
        break if no further changes

    // A change was proposed
    call sub_34C2D70(state, MF)       // apply the proposed reordering step
    accumulate changed flags

    if (sm_index <= 1):               // sm_70/sm_75
        check qword_503AA68 knob
        if set: call sub_34C0690 for each loop
    if state->divergence_aware:
        call sub_34C56D0 for each loop

    if changed_this_iteration:
        continue loop

sub_34C4890 advances through the MachineFunction's basic blocks in frequency-priority order, proposing a reordering when a higher-frequency successor is not the current fall-through. sub_34C2D70 performs the actual chain manipulation to implement the proposed swap.

Step 5 -- Loop-level re-evaluation. The calls to sub_34C56D0 (5137 bytes, called from sub_34C6AF0 via the loop-iteration path at 0x34C6E90) perform loop-level cost re-evaluation. This function:

Walks the MachineFunction's loop tree (from MF+0x148, the MachineLoopInfo block list)
For each loop, evaluates whether the proposed layout improves or degrades the loop body's fall-through density
Calls sub_34C0EE0 for block-level cost queries
Calls sub_34BE7F0 for chain adjacency analysis
Queries sub_2E88AF0 (divergence analysis) and sub_2E88FE0 for convergence properties
Uses sub_2FDC710/sub_2FDC700 for target-specific cost overrides via the TII vtable
Calls sub_3509790 for reconvergence point identification

sub_34C0690 (called on the sm_70/sm_75 path gated by qword_503AA68) is a lighter variant that omits the divergence-aware sub-evaluations, appropriate for older SM architectures where divergence reconvergence is handled differently.

Step 6 -- Final cost comparison and bitvector scan. After the evaluation loop terminates, build a bitvector tracking which BBs changed position. The bitvector uses 64-bit words with word index = bb_index >> 6 and bit position = bb_index & 63. Walk the MachineFunction's loop tree blocks (MF+0x148 linked list):

For each block in the loop, walk the instruction list starting at BB+0x20
For each instruction, mask the opcode with 0xFFFFFF and compute opcode * 5 as a stride
If the instruction byte at offset 0 is 0x08 (a branch instruction), set the corresponding bit in the bitvector

Then scan the bitvector against the evaluator's proposed ordering to detect any BB that would need to move. If at least one BB is displaced, set the return flag.

On the final cost-comparison path (at 0x34C6FD3), the evaluator reads TII->vtable+0x5D8 and compares against sub_2FDC810. If the target overrides this slot, the override is called to provide a final accept/reject decision. Otherwise, a default threshold of 3 is used: the proposed layout is accepted only if the cost reduction exceeds the acceptance threshold. The stat-based knobs at dword_503AAC8 and qword_503AB48 provide tuning for the threshold lookup via the sub_C52410/sub_C959E0 statistics infrastructure.

Shared Infrastructure with Register Allocation

A surprising discovery: sub_34BEDF0 and sub_34C7080/sub_34C6AF0 are also called from sub_34ED530 (RegAllocGreedy, 91KB) via sub_34F1190. The register allocator uses the same layout evaluator to assess whether a spill-induced block split would degrade code layout quality. This sharing means the cost model is consistent between register allocation decisions and post-RA block placement, preventing the two passes from working at cross purposes. The evaluator state is separate per invocation (stack-allocated), so there is no state leakage between the two callers.

SM-Generation-Dependent Behavior

The SM index computation ((MF->getSubtarget()+0x220) - 0x2A) creates generation-dependent behavior:

SM Generation	Index	Loop Evaluator	Divergence Sub-Eval
sm_70 (Volta)	0	`sub_34C0690` if `qword_503AA68`	Only if divergence flag
sm_75 (Turing)	1	`sub_34C0690` if `qword_503AA68`	Only if divergence flag
sm_80+ (Ampere+)	2+	Skipped (only `sub_34C56D0`)	Always if divergence flag

This split reflects the hardware difference: Volta and Turing use a stack-based reconvergence mechanism that benefits from the lighter sub_34C0690 analysis, while Ampere and later use the uniform warp scheduler where the more thorough sub_34C56D0 evaluation is worthwhile.

Dual Pass Registration

The binary contains two complete instances of MachineBlockPlacement:

Instance	Registration	Purpose
`sub_350FE30` (NVPTX)	NVPTX backend pipeline	GPU-specific analysis results, divergence-aware
`sub_1DE8060` (generic)	Default LLVM pipeline	Standard pass for any non-GPU path

Having a separate NVPTX instance allows NVIDIA to control pass ordering independently. The NVPTX version is inserted at a specific point in the backend pipeline where divergence analysis results are available.

Target Tail-Dup Threshold Override

The tail-dup threshold (how many instructions a tail block can have before duplication is rejected) is determined by a multi-level decision:

default_threshold = 2                               // tail-dup-placement-threshold
aggressive_threshold = 4                            // tail-dup-placement-aggressive-threshold

if TII->getTailDupThreshold(optLevel) overrides:    // vtable+1488
    threshold = TII_override                        // NVPTX can take full control
elif optLevel > 2 (-O3):
    threshold = aggressive_threshold                // 4
else:
    threshold = default_threshold                   // 2

The default stub at sub_2FDC800 returns 2 * ((optLevel > 2) + 1), i.e., 2 at -O2 and 4 at -O3. If NVPTX's TargetInstrInfo overrides this (the pass explicitly checks whether the vtable slot points to sub_2FDC800), the override takes full control. This allows the NVPTX backend to set a different tail-dup aggressiveness based on SM generation or kernel properties.

Loop Rotation and Header Placement

Loop rotation (sub_351C710, called from buildLoopChains) determines whether the loop header is placed at the top or bottom of the loop chain. The goal is to place the exiting block at the bottom so the back-edge is a fall-through and the exit is a taken branch (or vice versa, whichever is more profitable).

Two rotation strategies exist:

Basic rotation (default): Place the exiting block last. Skip rotation if the header already has a viable fall-through from outside the loop, unless the exit edge frequency exceeds the fall-through frequency. This avoids introducing an unnecessary branch at loop entry.

Profile-guided rotation (precise-rotation-cost): Enumerate all possible rotations, compute fall-through cost for each (missed fall-through from loop entry, missed fall-throughs at exit points, missed back-edge fall-through), and select the rotation with minimum total cost. Controlled by two knobs:

precise-rotation-cost (default false): enable profile-guided rotation cost model
force-precise-rotation-cost (default false): force it even without good profile data

For GPU kernels where loops are the dominant compute pattern, correct loop rotation determines whether the loop body executes as a straight fall-through sequence or requires a taken back-edge branch every iteration. Since the misfetch-cost is low (default 1), the benefit is modest per iteration but accumulates over millions of iterations typical in GPU compute.

Hot/Cold Splitting

cicc does not perform function-level hot/cold splitting. This is expected: GPU kernels are designed for all threads in a warp to execute the same path. There is no equivalent of a CPU "cold" exception handler that should be placed far from hot code. The loop-to-cold-block-ratio knob (default 5) does enable outlining individual cold blocks from loop chains -- moving them to the end of the function -- but this is intra-function block reordering, not function splitting.

The knob force-loop-cold-block (default false) forces cold block outlining from loops regardless of the frequency ratio. When loop_freq / block_freq > loop-to-cold-block-ratio, the block is moved out of the loop chain to reduce the loop body's I-cache footprint.

Post-Placement Passes

After layout is committed, two post-processing steps run:

Branch optimization. Walk the final BB ordering. For each analyzable branch with profile info, check whether reversing the branch direction would improve fall-through. Call TII->reverseBranchCondition() (vtable+880) to flip the condition, then update the branch targets via vtable+360/368. This is controlled by sub_2EE6AD0 which checks profitability by comparing edge costs with sub_2E441D0 (getEdgeProbability).

Block alignment (sub_3516980). Walk each BB and set alignment based on block frequency, loop depth, and whether the block is a fall-through target. Controlled by:

align-all-blocks (default 0): force log2 alignment on every block
align-all-nofallthru-blocks (default 0): force alignment on blocks without fall-through predecessors
max-bytes-for-alignment (default 0): cap padding bytes

On GPU, block alignment is generally not useful -- PTX does not expose alignment constraints on basic blocks, and the hardware instruction fetch unit does not benefit from aligned block boundaries the way a CPU I-cache line does.

Configuration Knobs

All knobs are LLVM-standard with stock defaults. The NVIDIA delta is behavioral, not configurational.

Knob	Type	Default	Effect
`disable-block-placement`	bool	false	Disable the pass entirely
`enable-block-placement-stats`	bool	false	Collect placement statistics
`tail-dup-placement`	bool	true	Enable tail duplication during placement
`tail-dup-placement-threshold`	int	2	Max instructions for tail-dup candidate
`tail-dup-placement-aggressive-threshold`	int	4	Aggressive threshold at -O3
`tail-dup-placement-penalty`	int	2	I-cache pressure penalty (percent)
`tail-dup-profile-percent-threshold`	int	50	Min hot-count percentage for profile-guided tail-dup
`triangle-chain-count`	int	2	Consecutive triangles before triangle heuristic activates
`branch-fold-placement`	bool	true	Fold branches during placement
`misfetch-cost`	int	1	Taken-branch fetch penalty
`jump-inst-cost`	int	1	Cost of a jump instruction
`block-placement-exit-block-bias`	int	0	Frequency percentage for loop exit replacement
`loop-to-cold-block-ratio`	int	5	Ratio threshold for cold block outlining
`force-loop-cold-block`	bool	false	Force outlining cold blocks from loops
`precise-rotation-cost`	bool	false	Profile-guided loop rotation cost
`force-precise-rotation-cost`	bool	false	Force precise rotation cost
`align-all-blocks`	int	0	Force block alignment (log2)
`align-all-nofallthru-blocks`	int	0	Force alignment on non-fall-through blocks
`max-bytes-for-alignment`	int	0	Max padding for alignment
`enable-ext-tsp-block-placement`	bool	false	Enable ext-TSP layout algorithm
`ext-tsp-block-placement-max-blocks`	int	-1	Max BB count for ext-TSP (unlimited)
`apply-ext-tsp-for-size`	bool	false	Use ext-TSP for code size optimization
`renumber-blocks-before-view`	bool	false	Renumber BBs before dot-graph output

DenseMap Implementation Pattern

The pass uses LLVM's DenseMap for BB-to-chain and BB-to-index lookups. The open-addressing hash-map pattern appears 20+ times in the decompiled code:

// Hash function for pointer keys
size_t hash = ((ptr >> 9) ^ (ptr >> 4)) & (bucket_count - 1);

// Probing: linear with increment counter
// Empty sentinel:   0xFFFFFFFFFFFFF000 (-4096)
// Deleted sentinel: 0xFFFFFFFFFFFFE000 (-8192)
// Rehash trigger:   4 * (count + 1) >= 3 * bucket_count  (75% load)
// Rehash function:  sub_2E3E470(map, new_capacity)

GPU-Specific Placement Considerations

Why This Pass Matters More on GPU Than on CPU

On CPU, MachineBlockPlacement is primarily an I-cache optimization -- placing hot blocks contiguously reduces cache misses. On GPU, the stakes are higher for three reasons:

No branch prediction. GPU SMs do not speculate. Every taken branch is a guaranteed fetch stall. The ratio of taken branches to fall-throughs directly translates to warp scheduler utilization. Optimal block placement can eliminate 10-30% of fetch bubbles in branch-heavy kernels.
Instruction cache is tiny and shared. A single SM partition has 32-128 KB of instruction cache shared across all active warps. Code duplication (tail-dup, loop unrolling) competes with warp occupancy for this shared resource. The tail-dup-placement-penalty (2%) is conservative -- on kernels with high warp counts, even small code size increases can cause I-cache thrashing.
Reconvergence is layout-sensitive. On architectures before Ampere (sm_70, sm_75), the stack-based reconvergence mechanism depends on the post-dominator being reachable from both sides of a divergent branch. Block placement that separates a post-dominator from its divergent predecessors can increase the live warp state, consuming scarce convergence stack entries. The alternative layout evaluator's sub_34C0690 path specifically addresses this by evaluating reconvergence distance.

Structured Control Flow Constraint

Unlike CPU backends where block placement has complete freedom, the NVPTX backend runs StructurizeCFG before MachineBlockPlacement. This means:

All irreducible control flow has already been eliminated
Structured regions (loops, if-then-else diamonds) are contiguous in the CFG
Block placement cannot violate structured region boundaries without re-structurizing

This constraint actually simplifies placement in some cases (fewer valid orderings to consider) but eliminates certain profitable reorderings that would be legal on CPU (e.g., outlining a cold exception handler to a distant location that breaks region contiguity).

Interaction with PTX Emission

The final block ordering directly determines which branches in the PTX output are bra instructions (taken) vs. fall-throughs (implicit). The AsmPrinter (see AsmPrinter) emits bra only for non-fall-through edges. Since ptxas performs its own block scheduling on the PTX input, the cicc block ordering serves as a strong hint rather than a final answer -- but ptxas generally respects the input ordering for blocks within the same structured region.

Function Map

Function	Address	Size	Role
`runOnMachineFunction`	`sub_3521FF0`	--	Entry point, 82 KB
`buildChains`	`sub_3521900`	--	Initial chain construction
`tailDupPlacement`	`sub_35185B0`	--	Tail-dup-aware chain merging
`applyBlockOrder`	`sub_3519A10`	--	Commit final BB ordering to MF
`alignBlocks`	`sub_3516980`	--	Post-placement alignment
`buildLoopChains`	`sub_351EBB0`	--	Loop-aware chain merging
`buildChainForBlock`	`sub_351D700`	--	Greedy successor chain walk
`selectBestSuccessor`	`sub_35157A0`	--	Pick best fall-through successor
`chainLookup`	`sub_3515040`	--	DenseMap BB-to-chain lookup
`rotateLoop`	`sub_351C710`	--	Loop rotation heuristic
`mergeTails`	`sub_351A710`	--	Chain tail merge logic
`lowerChain`	`sub_35161F0`	--	Final lowering of chain to BB list
(helper)	`sub_3515CB0`	--	Chain cost model evaluation
(helper)	`sub_3515280`	--	Chain building iteration
(helper)	`sub_3516000`	--	Chain length query
(NVIDIA addition)	`sub_34BEDF0`	--	Layout evaluator state constructor (321 bytes)
(NVIDIA addition)	`sub_34C7080`	--	Layout evaluator dispatch wrapper (17 bytes, guards `sub_34C6AF0`)
(NVIDIA addition)	`sub_34C6AF0`	--	Core layout evaluator body (1419 bytes, SM-aware)
(NVIDIA addition)	`sub_34C4890`	--	Frequency-priority BB advancement
(NVIDIA addition)	`sub_34C2D70`	--	Chain swap application
(NVIDIA addition)	`sub_34C56D0`	--	Loop-level cost re-evaluation (5137 bytes, divergence-aware)
(NVIDIA addition)	`sub_34C0690`	--	Lightweight loop evaluator (sm_70/sm_75 path)
(NVIDIA addition)	`sub_34BA1B0`	--	Chain state snapshot
(NVIDIA addition)	`sub_34C0EE0`	--	Block-level cost query
(NVIDIA addition)	`sub_34BE7F0`	--	Chain adjacency analysis
(NVPTX)	`sub_350FE30`	--	Pass registration
(NVPTX)	`sub_350FEE0`	--	Stats pass registration
(generic)	`sub_1DE8060`	--	Generic LLVM pass registration
(generic)	`sub_1DE8500`	--	Generic LLVM stats registration
cleanup	`sub_3511770`	--	Chain-map teardown
cleanup	`sub_35142F0`	--	Loop chain data teardown
cleanup	`sub_3510940`	--	Bump allocator teardown
`calcExtTspScore`	`sub_29BAF70`	--	Ext-TSP score (original layout)
`calcExtTspScore`	`sub_29BAC40`	--	Ext-TSP score (alternative layout)
`computeExtTspLayout`	`sub_29BB2B0`	--	Ext-TSP chain reordering solver
(helper)	`sub_2EE6520`	--	Ext-TSP enable decision
(helper)	`sub_2EE6AD0`	--	Branch redirect profitability check
`getEdgeProbability`	`sub_2E441D0`	--	Edge probability query
(default stub)	`sub_2FDC800`	--	Default `getTailDupThreshold` implementation
(default stub)	`sub_2FF52D0`	--	Default reconvergence-region query
(default stub)	`sub_2FDC810`	--	Default layout-accept threshold query

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Pass instances	Single `MachineBlockPlacement` per pipeline	Two instances: stock LLVM copy + NVPTX-pipeline copy at `sub_3521FF0`
Divergence awareness	No divergence concept; layout optimizes for I-cache locality	Queries warp divergence flag on MachineFunction; divergent branches affect tail duplication profitability
Alternative layout proposal	Absent; single layout path only	Additional proposal path (`sub_34BEDF0` / `sub_34C7080`) evaluates alternative orderings with SM-aware cost
Tail duplication threshold	`TailDupPlacementThreshold` (default 2)	GPU-specific threshold via vtable query (`sub_2FDC810`); controlled by reconvergence-region analysis
Loop cost evaluation	Frequency-weighted chain cost	Divergence-aware loop cost re-evaluation (`sub_34C56D0`, 5137 bytes) considers warp reconvergence overhead
Ext-TSP scoring	Standard profile-guided layout scoring	Same Ext-TSP solver but gated by NVPTX-specific enable decision (`sub_2EE6520`)
Structured CFG constraint	No structured CFG requirement (targets like x86 have arbitrary CFG)	Must preserve structured regions from StructurizeCFG; contiguous structured blocks cannot be interleaved

Cross-References

StructurizeCFG -- runs before block placement; produces the structured CFG that constrains which block orderings are legal. Structured regions must remain contiguous.
BranchFolding -- runs after placement; performs tail merging and branch folding on the committed layout. See sub_2F336B0.
Instruction Scheduling -- block ordering affects scheduling windows. Post-placement scheduling operates within the committed layout.
Register Allocation -- register pressure is affected by block ordering through live range extent.
AsmPrinter -- emits PTX from the final block ordering, generating bra instructions for taken branches and fall-through for sequential blocks.

MachineOutliner for GPU

The MachineOutliner in CICC v13.0 is the stock LLVM MachineOutliner pass, compiled into the binary at two address ranges: a candidate-finder at sub_3539E80 and a core outlining engine at sub_3537010, totaling approximately 136KB of combined code. A second instance at sub_1E3D600 (62KB) appears in the MIR infrastructure region (0x1E20000--0x1E3FFFF) containing the same diagnostic strings ("NotOutliningCheaper", "OutliningBenefit", etc.) and [MEDIUM confidence] likely represents the runOnModule entry point that delegates to the two primary functions. The runOnModule identification is based on the function's address being in the MIR infrastructure region and its diagnostic string overlap with the primary outliner; it could alternatively be a separate pass-manager wrapper or a legacy code path. The pass extracts repeated MachineInstr sequences across all functions in a module, factors them into shared OUTLINED_FUNCTION_* stubs, and replaces the original sequences with calls. On GPU targets this is significant because code size directly affects the L1 instruction cache (L0/L1i) footprint per SM, and every instruction that survives into PTX also contributes to ptxas compilation time and register pressure during its own allocation pass.

CICC ships the pass as part of its standard LLVM codegen infrastructure, controlled by the enable-machine-outliner TargetPassConfig knob (tri-state: disable, enable, guaranteed beneficial). The binary does not override the upstream default -- meaning the outliner's activation depends on whether the NVPTX backend's TargetPassConfig::addMachineOutliner() enables it. The presence of full outliner infrastructure (pass registration at sub_35320A0, ~136KB of outliner code, the benefit-threshold knob, and the "nooutline" function-attribute check) confirms the pass is callable. The critical question is whether NVIDIA's default pipeline activates it. The evidence is ambiguous but leans toward conditionally enabled: the TargetPassConfig enum includes "guaranteed beneficial" mode, and the NVPTX-specific calling convention 95 (assigned to outlined functions when no special CC is required) would serve no purpose if the pass were dead code.


Pass name	`"Machine Function Outliner"` / `"machine-outliner"`
Registration	`sub_35320A0` -- stores pass ID at `unk_503D78C`
Core outlining engine	`sub_3537010` (77KB, 2,185 decompiled lines)
Candidate finder	`sub_3539E80` (59KB)
Second instance (MIR region)	`sub_1E3D600` (62KB, 0x1E3D600)
Pass factory	`sub_3534A50`
Benefit threshold knob	`qword_503DAC8` = `outliner-benefit-threshold` (default: 1)
Cost mode flag	`qword_503DC88` (loaded into pass state at offset +184)
Debug flag	`qword_503D828` (verbose outliner output)
Options constructor	`ctor_675` at `0x5A2820` (10,602 bytes)
NVPTX outlined-function CC	Calling convention 95 (PTX `.func` linkage)
Outlined function naming	`OUTLINED_FUNCTION_{round}_{index}`
Function attributes applied	`nounwind` (47), `minsize` (18), internal linkage

Suffix Tree Algorithm

The outliner's core algorithm is Ukkonen's suffix tree construction, applied to a flattened sequence of MachineInstr encodings from every eligible basic block in the module. The process proceeds in three stages.

Stage 1: Instruction Mapping

sub_3508720 (buildInstrLegalityMapping) walks each MachineBasicBlock and encodes every instruction as a uint16 alphabet symbol. The encoding incorporates both the opcode and a structurally significant operand pattern, so that two instruction sequences with different register names but identical structure map to the same suffix-tree substring. The helper sub_35082F0 initializes from the MBB's scheduling info (offset +32), and sub_35085F0 populates the actual mapping.

Register-class resolution happens in a second pass via sub_3508F10 (buildRegClassMapping): sub_3508B80 builds register-class bitmask information, and sub_3508890 computes the final mapping. This two-layer encoding is critical because NVPTX has typed register classes (i32, i64, f32, f64, pred, etc.) and an outlined sequence must be valid across all call sites regardless of which specific virtual register names appear.

Instructions that cannot participate in outlining receive a special encoding: unique negative integers starting at -3 (matching upstream's IllegalInstrNumber). Each illegal instruction gets a distinct value so it acts as a suffix-tree terminator, preventing matches from spanning across them. The sentinel value 0xFFFFFFFF (-1 as uint32) in the cost array explicitly marks these.

Stage 2: Suffix Tree Construction and Candidate Extraction

sub_35364E0 (insertIntoSuffixTree) inserts each MBB's encoded instruction sequence into the suffix tree working set. The suffix tree identifies all repeated substrings of length >= 2. For each repeated substring with at least 2 occurrences, the pass creates a candidate group.

Function filtering happens before insertion. sub_3539E80 iterates all MachineFunctions in the module's linked list and applies three gates:

nooutline attribute check -- sub_B2D620 tests whether the function has the "nooutline" string attribute. If present, all MBBs in that function are skipped.
shouldOutlineFrom -- vtable dispatch at offset +1440 on the TargetInstrInfo. The NVPTX backend's implementation of this hook determines whether a given function is eligible based on target constraints.
isFunctionSafeToOutlineFrom -- vtable dispatch at offset +1432, receiving the outliner cost mode byte from qword_503DC88. This is where target-specific safety checks (e.g., functions with special register constraints or inline assembly) can reject outlining.

Additional per-block filters: a block must contain more than one instruction, must not already be marked as outlined (byte at MBB offset +217), and must have no special flag (qword at MBB offset +224 must be zero).

Stage 3: Sorting and Pruning

After suffix-tree extraction, the candidate list is sorted using a hybrid merge sort:

sub_3534120 -- parallel merge sort for large arrays (recursive, splits at midpoint)
sub_3533600 -- in-place merge sort for small arrays (fallback when size < 14 pointers = 112 bytes)
sub_3533450 -- insertion sort for very small partitions (<= 14 elements)

The sorted suffix array is then scanned by sub_3532120 (findIllegalInRange), which performs a 4-way unrolled linear scan searching for the sentinel value 0xFFFFFFFF in the integer cost array. Any candidate whose instruction range contains an illegal sentinel is pruned. The compaction loop copies valid entries forward in place and frees discarded entries' internal string buffers via _libc_free.

Benefit/Cost Model

The outliner accepts a candidate only if the net benefit exceeds the threshold. The formula:

Benefit = NumOccurrences * PerOccurrenceCost - FrameOverheadCost

Where:

NumOccurrences = number of identical sequences found (vtable dispatch at slot 0 on the candidate)
PerOccurrenceCost = bytes saved per replacement (effectively the cost of the call instruction that replaces the inlined sequence, dispatched via vtable slot 0 multiplied by the repeat_count at candidate offset +40)
FrameOverheadCost = cost of the outlined function itself: the function entry/exit, the return instruction, and any callee-saved register saves (vtable dispatch at slot 8)

The decision rule:

int benefit = num_occurrences * per_call_cost - frame_overhead;
if (benefit < 0) benefit = 0;
if (benefit < outliner_benefit_threshold) continue;  // skip candidate

The threshold qword_503DAC8 defaults to 1, meaning any candidate that saves at least one byte is accepted. This is identical to upstream LLVM's default and is intentionally aggressive -- the outliner relies on the cost model's accuracy rather than a conservative threshold to filter bad candidates.

NVPTX Cost Model Considerations

The cost model is dispatched through the TargetInstrInfo vtable, meaning the NVPTX backend supplies its own getOutliningCandidateInfo, buildOutlinedFrame, and insertOutlinedCall implementations. Several factors make the GPU cost model structurally different from CPU targets:

Call overhead in PTX is expensive. A PTX .func call requires .param space declaration, parameter marshaling (each argument is copied to .param memory), the call instruction itself, and result retrieval from .param space. On CPU targets, a call instruction is a single opcode plus a return address push. On NVPTX, the overhead is proportional to the number of live values that must be passed to the outlined function. This means the FrameOverheadCost for NVPTX candidates is significantly higher than on CPU, and only sequences with many occurrences or substantial length achieve positive benefit.

No hardware call stack. PTX function calls are lowered by ptxas into something closer to inlined code with register renaming. The actual "call" may or may not involve a hardware subroutine mechanism depending on the SM architecture and ptxas optimization level. This makes the cost model somewhat speculative from CICC's perspective -- the outlined function may be re-inlined by ptxas.

Calling convention 95. When no candidate entry in a group requires a special calling convention, the outlined function is assigned CC 95 -- an NVPTX-specific calling convention not present in upstream LLVM. CC 95 maps to PTX .func linkage with internal visibility, meaning the function is private to the compilation unit and ptxas has full freedom to inline or optimize it. See Calling Convention 95 below for the complete assignment algorithm and CC comparison table.

Outlined Function Creation

When a candidate group passes the benefit threshold, sub_3537010 creates the outlined function through these steps:

Name generation. The name follows the pattern OUTLINED_FUNCTION_{round}_{index}. The round number (pass counter at state offset +188) is omitted in round 0, producing OUTLINED_FUNCTION_0, OUTLINED_FUNCTION_1, etc. for the first pass and OUTLINED_FUNCTION_2_0, OUTLINED_FUNCTION_2_1, etc. for subsequent reruns. The integer-to-string conversion uses a standard two-digit lookup table ("00010203...9899") for fast decimal formatting.

LLVM Function creation. sub_BCB120 (getOrInsertFunction) creates or retrieves the Function in the LLVM Module. sub_BCF640 creates the function type (void return, no arguments by default). sub_B2C660 creates the corresponding MachineFunction.

Function flags. The flag word at function offset +32 is set to (existing & 0xBC00) | 0x4087. The bit pattern 0x4087 encodes internal linkage, norecurse, and nounwind. The mask 0xBC00 preserves target-dependent alignment and visibility bits. Two explicit attributes are added: nounwind (attribute ID 47) and minsize (attribute ID 18).

Register liveness. A calloc-allocated byte array (one byte per physical register, count from TargetRegisterInfo::getNumRegs() at TRI offset +16) tracks which registers are live-through versus defined-inside the outlined region. sub_35095B0 (populateOutlinedFunctionBody) walks the outlined MBB's instruction stream, checking the TargetRegisterInfo live-in bitmap (offset +48 in the subtarget). Registers not in the live-in set are inserted as phantom definitions. Super-register chains are walked via delta tables at TRI offset +56, following standard LLVM MCRegisterInfo encoding.

Outlined body. The TargetInstrInfo hook buildOutlinedFrame (vtable offset +1408) constructs the actual machine instructions in the outlined function by copying from the candidate entries. The isOutlined flag is set at MachineFunction offset +582.

Call-Site Rewriting

After creating the outlined function, the pass rewrites each call site:

For each candidate entry, insertOutlinedCall (vtable offset +1416) is invoked with the caller's MBB, an insertion point, the outlined Function, and the candidate metadata. This returns the new call MachineInstr.
If the outlined function has callee-saved register information (flag at candidate offset 344), the pass builds live-in/live-out register sets using red-black trees (sub_3536E40 for classification). Registers are classified as defs (implicit-def, flag 0x30000000), uses (implicit-use, flag 0x20000000), or implicitly-defined. These operands are attached to the call instruction via sub_2E8F270.
The original instruction range in the cost array is memset to 0xFF, marking it with illegal sentinels. This prevents future outlining passes (reruns) from attempting to re-outline already-outlined code.

Candidate Entry Structure

Each candidate is a 224-byte structure (56 x uint32 stride):

Offset	Size	Field
`+0x00`	4	`start_index` -- index into module instruction array
`+0x04`	4	`length` -- number of instructions in sequence
`+0x08`	8	`call_info_ptr` -- pointer to MBB or instruction range
`+0x10`	8	`metadata_0`
`+0x18`	8	`metadata_1`
`+0x20`	4	`num_occurrences_field`
`+0x28`	4	`cost_field`
`+0x2C`	48	SSO string data (via `sub_3532560`)
`+0x70`	4	`benefit_or_flags`
`+0x78`	40	Second SSO string field
`+0xA0`	1	`flag_byte_0`
`+0xA1`	1	`flag_byte_1`
`+0xA8`	4	`field_A8`
`+0xAC`	4	`field_AC`
`+0xB0`	4	`field_B0`
`+0xB4`	4	`field_B4`

The two string fields use LLVM's small-string optimization (SSO): strings shorter than the inline buffer are stored directly in the struct; longer strings allocate on the heap. The copy function sub_3532560 handles both cases.

Calling Convention 95: The NVPTX Outlined-Function CC

CICC defines calling convention 95 (0x5F) as an NVPTX-specific calling convention that does not exist in upstream LLVM. It is assigned exclusively to outlined functions and signals to both the AsmPrinter and ptxas that the function is a module-internal device helper with PTX .func linkage.

CC Assignment Algorithm

The CC assignment happens in Phase 5 of sub_3537010 (lines 838--877 of the decompilation), after the outlined MachineFunction is created and before its body is populated. The algorithm:

fn assign_outlined_cc(candidate_group, outlined_fn):
    max_cc = 0
    for entry in candidate_group:
        cc = sub_A746B0(entry)          // extract caller's CC from candidate
        max_cc = max(max_cc, cc)

    if max_cc > 0:
        // At least one call site has a non-default CC.
        // Inherit the highest CC and create a callee-saved register mask.
        sub_B2BE50(outlined_fn, max_cc)         // setCallingConv
        sub_A77AA0(outlined_fn, max_cc)         // create callee-saved mask
    else:
        // All call sites have default CC (0) -- typical case for
        // device functions compiled from __device__ code.
        // Assign the NVPTX-specific outlined-function CC.
        outlined_fn.setCallingConv(95)

sub_A746B0 extracts the calling convention from each candidate entry's source MachineFunction. The "max" selection rule means that if candidates come from functions with different CCs, the outlined function inherits the most restrictive one. In practice, since the outliner only groups structurally identical MachineInstr sequences, all entries in a group typically come from functions with the same CC.

CC 95 vs Other NVPTX Calling Conventions

CC	Decimal	PTX Linkage	Meaning
0	0	`.func`	Default C calling convention (non-kernel device function)
42	0x2A	`.entry`	PTX kernel entry (one of two kernel CCs; used in SCEV budget bypass)
43	0x2B	`.entry`	PTX kernel entry (variant; also bypasses SCEV budget)
71	0x47	`.entry`	Primary CUDA kernel CC (`isKernel` returns true when `linkage == 0x47`)
95	0x5F	`.func`	NVPTX outlined-function CC -- internal, never a kernel

CC 95 functions are emitted as .func by the AsmPrinter (sub_215A3C0). The .entry vs .func branch at line 30--33 of the PTX header emission calls sub_1C2F070 (isKernelFunction), which checks whether the CC is one of the kernel CCs (42, 43, 71) or the nvvm.kernel metadata flag. CC 95 fails all kernel tests, so the function is always emitted as .func.

What CC 95 Communicates

The CC carries three semantic signals:

Internal linkage. CC 95 functions are never externally visible. The flag word 0x4087 applied at function offset +32 encodes internal linkage. Combined with the nounwind (47) and minsize (18) attributes, this tells the backend and ptxas that the function is private to the compilation unit.
No .param-space calling convention overhead. Unlike CC 0 device functions, which must declare .param space for every argument and marshal values through st.param/ld.param sequences (the full sub_3040BF0 LowerCall path with DeclareParam/DeclareScalarParam nodes), CC 95 functions use a simplified call interface. The outlined function takes no explicit arguments -- live values are passed implicitly through the register state, and the TargetInstrInfo::insertOutlinedCall hook (vtable +1416) handles the call-site ABI.
ptxas is free to inline. Because CC 95 functions are internal .func with no special ABI constraints, ptxas can and frequently does inline them back at the call site during its own optimization passes. This makes the outlining decision partially speculative from CICC's perspective -- the code size reduction measured by the benefit model may be undone by ptxas.

Callee-Saved Register Mask Interaction

When max_cc > 0 (the non-default path), sub_A77AA0 creates a callee-saved register mask for the outlined function. This mask determines which registers the outlined function must preserve across its body. For CC 95 (the max_cc == 0 path), no callee-saved mask is created. Instead, the call-site rewriting logic at Phase 11 of sub_3537010 (lines 1469--1968) builds explicit implicit-def (flag 0x30000000) and implicit-use (flag 0x20000000) operands on the call instruction using the RB-tree-based register classifier at sub_3536E40. This makes the register interface fully explicit rather than relying on a convention-defined preserved set.

launch_bounds Interaction and Cross-Kernel Outlining

The MachineOutliner operates at module scope -- it considers all functions in the module simultaneously. On NVPTX, this raises the question of whether sequences can be outlined across functions with different __launch_bounds__ annotations.

How launch_bounds Metadata Flows

The __launch_bounds__ attribute on a __global__ function flows through CICC as follows:

EDG frontend (sub_826060): Validates __launch_bounds__ arguments. Rejects __launch_bounds__ on non-__global__ functions. Detects conflicts with __maxnreg__.
Post-parse fixup (sub_5D0FF0): Converts __launch_bounds__ values into structured metadata.
Kernel metadata emission (sub_B05_kernel_metadata): Stores as LLVM named metadata under nvvm.annotations:
- nvvm.maxntid -- max threads per block (from first __launch_bounds__ argument)
- nvvm.minctasm -- minimum CTAs per SM (from second argument, if present)
- nvvm.maxnreg -- max registers per thread (from __maxnreg__ or third argument)
PTX emission (sub_214DA90): Reads the metadata back and emits .maxntid, .minnctapersm, .maxnreg directives. These are emitted only for .entry functions -- the guard at step (g) of sub_215A3C0 ensures .func functions never receive these directives.

The Outlined Function Inherits Nothing

Because outlined functions are created with internal linkage, void return type, and CC 95 (.func), they are device functions -- never kernels. The function creation code in Phase 5 of sub_3537010 does not copy any metadata from source functions. Specifically:

No nvvm.kernel flag is set.
No nvvm.maxntid metadata is attached.
No nvvm.maxnreg metadata is attached.
No nvvm.minctasm metadata is attached.
No nvvm.cluster_dim or nvvm.maxclusterrank metadata is attached.
The isKernel check (sub_CE9220) returns false: the CC is not 0x47, there is no nvvm.kernel metadata, and there is no "kernel" entry in nvvm.annotations.

The only function-level metadata the outlined function receives is the isOutlined flag at MachineFunction offset +582 and the two attributes nounwind (47) and minsize (18).

Function Eligibility Gating

The candidate finder (sub_3539E80) applies three gates before considering a function's basic blocks for outlining:

fn is_eligible(func, cost_mode):
    // Gate 1: explicit opt-out
    if sub_B2D620(func, "nooutline"):       // has "nooutline" attribute?
        return false

    // Gate 2: target hook -- "should we outline FROM this function?"
    tii = get_target_instr_info(func)
    if !tii.vtable[1440](func):             // shouldOutlineFrom
        return false

    // Gate 3: target hook -- "is it SAFE to outline from this function?"
    if !tii.vtable[1432](func, cost_mode):  // isFunctionSafeToOutlineFrom
        return false

    return true

The NVPTX backend's implementation of shouldOutlineFrom (vtable +1440) and isFunctionSafeToOutlineFrom (vtable +1432) determines whether kernel functions and launch_bounds-constrained functions participate. The evidence does not contain the NVPTX-specific implementation of these hooks, so we cannot state definitively whether kernels with nvvm.maxnreg are rejected. However, the architectural implications are clear:

If the hooks permit outlining from constrained kernels, the outliner may extract a sequence shared between a maxnreg=32 kernel and a maxnreg=64 kernel into a single CC 95 .func. That .func has no register budget. When ptxas processes the maxnreg=32 kernel's call to this .func, it must either:

Inline the call -- absorbing the outlined function's register usage into the kernel's allocation. If the outlined body fits within 32 registers, this is transparent.
Keep the call -- allocating the outlined function's registers within the kernel's 32-register budget. If the outlined function needs more registers than available after the kernel's own allocation, ptxas will spill to local memory.

Both outcomes preserve correctness. The performance risk is that spilling may occur in a kernel that would not have spilled without outlining, because the CICC-side cost model has no visibility into ptxas's register allocation decisions.

If the hooks reject constrained kernels, the outliner only operates on unconstrained device functions (CC 0) and kernels without __launch_bounds__. This is the conservative and likely behavior, given that NVIDIA is aware of the register-pressure implications.

Per-Block Eligibility

Even within an eligible function, individual basic blocks are filtered:

Condition	Check	Effect
Block has <= 1 instruction	`MBB.size() <= 1`	Skipped -- too small to outline
Block already outlined	byte at MBB offset +217	Skipped -- prevents re-outlining
Block has special flag	qword at MBB offset +224 != 0	Skipped -- target-specific block exclusion

The "already outlined" flag at MBB offset +217 is set by the call-site rewriting phase (Phase 11) after replacing a sequence with a call to the outlined function. Combined with the cost-array sentinel memset (0xFF fill), this provides a two-layer defense against re-outlining.

Outlining vs. Inlining Tension

The MachineOutliner and the LLVM inliner operate in opposite directions: the inliner copies callee bodies into call sites (increasing code size, reducing call overhead), while the outliner extracts common sequences out of function bodies (decreasing code size, adding call overhead). In CICC, the two passes do not directly coordinate -- the inliner runs during the IR optimization pipeline (CGSCC pass manager), while the MachineOutliner runs late in the machine codegen pipeline after register allocation and scheduling.

The tension manifests in two ways:

The inliner may create outlining opportunities. Aggressive inlining of small device functions can produce multiple copies of the same instruction sequence in different callers, which the outliner then detects and re-extracts. This round-trip (inline then outline) is wasteful but not incorrect. The net result depends on whether the outliner's shared function is more cache-friendly than the inlined copies.
The outliner may undo inlining benefits. If the inliner carefully decided that inlining a hot function improves performance by eliminating call overhead and enabling cross-function optimization, the outliner may later extract the inlined sequence back out if it appears in multiple callers. The minsize attribute on outlined functions does not prevent this -- it only signals that the outlined function should be optimized for size rather than speed.

The enable-machine-outliner knob's "guaranteed beneficial" mode addresses this partially by only outlining sequences where the cost model is confident the savings are worthwhile, but it cannot reason about the inliner's original intent.

Configuration Knobs

All knobs are LLVM cl::opt command-line options, passable via -Xllc in CICC:

Knob	Type	Default	Effect
`outliner-benefit-threshold`	`unsigned`	1	Minimum net byte savings for a candidate to be accepted. Higher values make outlining more conservative.
`enable-machine-outliner`	enum	target-dependent	Tri-state: `disable`, `enable`, `guaranteed beneficial`. Controls whether the pass runs at all.
`enable-linkonceodr-outlining`	`bool`	`false`	Whether to outline from `linkonce_odr` functions. Off by default because the linker can deduplicate these. Should be enabled under LTO.
`machine-outliner-reruns`	`unsigned`	0	Number of additional outliner passes after the initial run. Each rerun can find new candidates from code modified by previous outlining.
`outliner-leaf-descendants`	`bool`	`true`	Consider all leaf descendants of internal suffix-tree nodes as candidates (not just direct leaf children).
`disable-global-outlining`	`bool`	`false`	Disable global (cross-module) outlining, ignoring codegen data generation/use.

The options constructor at ctor_675 (0x5A2820, 10,602 bytes) registers the outliner-specific options including the linkonce-odr and rerun knobs. The benefit threshold is registered separately in the same constructor.

Diagnostic Strings

The outliner emits LLVM optimization remarks under the "machine-outliner" pass name:

Remark key	Meaning
`"OutlinedFunction"`	A new outlined function was created
`"NotOutliningCheaper"`	Candidate rejected because outlining would not save bytes
`"Did not outline"`	Candidate rejected for other reasons (illegal instructions, safety checks)
`"OutliningBenefit"`	Named integer: net byte savings
`"OutliningCost"`	Named integer: cost of the outlined call sequence
`"NotOutliningCost"`	Named integer: cost of keeping the sequence inline
`"NumOccurrences"`	Named integer: how many times the sequence was found
`"Length"`	Named integer: number of instructions in the sequence
`"StartLoc"` / `"OtherStartLoc"`	Source locations of the outlined regions

The remark message format: "Saved {N} bytes by outlining {M} instructions from {K} locations. (Found at: {loc1}, {loc2}, ...)".

Function Map

Function	Address	Size	Role
Pass registration (name, ID, factory)	`sub_35320A0`	--	--
Pass factory function	`sub_3534A50`	--	--
Core outlining engine (`outline + rewrite`)	`sub_3537010`	77KB	--
Candidate finder / suffix-tree builder	`sub_3539E80`	59KB	--
MachineOutliner `runOnModule` entry (MIR region)	`sub_1E3D600`	62KB	--
`insertIntoSuffixTree` -- adds MBB instruction hashes	`sub_35364E0`	--	--
`SuffixArray::allocateWorkBuffer`	`sub_3535DB0`	--	--
`SuffixArray::parallelMergeSort`	`sub_3534120`	--	--
`SuffixArray::inPlaceMergeSort` (fallback for small arrays)	`sub_3533600`	--	--
Insertion sort for <= 14 elements	`sub_3533450`	--	--
`findIllegalInRange` (4-way unrolled sentinel scan)	`sub_3532120`	--	--
`buildInstrLegalityMapping` -- MBB to suffix alphabet	`sub_3508720`	--	--
`buildRegClassMapping` -- register-class constraint resolution	`sub_3508F10`	--	--
`populateOutlinedFunctionBody` -- instruction insertion	`sub_35095B0`	--	--
`classifyOperandRegisters` -- RB-tree register tracking	`sub_3536E40`	--	--
`RBTree::destroyAll` -- recursive tree deallocation	`sub_3532B90`	--	--
`std::string` constructor (for name generation)	`sub_35323D0`	--	--
SmallString SSO-aware deep copy	`sub_3532560`	--	--
`RemarkBuilder::appendField`	`sub_3534BB0`	--	--
`RemarkBuilder::emitOutlinedFunctionRemark`	`sub_35341F0`	--	--
Extract calling convention from candidate entry's source function	`sub_A746B0`	--	--
Create callee-saved register mask for non-default CC	`sub_A77AA0`	--	--
`hasAttribute("nooutline")` -- function attribute check	`sub_B2D620`	--	--
`isKernel(func)` -- returns true for CC 0x47 or `nvvm.kernel` metadata	`sub_CE9220`	--	--
`isKernelFunction` -- `.entry` vs `.func` emission branch	`sub_1C2F070`	--	--
Kernel attribute emission (`.maxntid`, `.maxnreg`, `.minnctapersm`)	`sub_214DA90`	--	--
PTX function header orchestrator (`.entry` / `.func` branch + params)	`sub_215A3C0`	--	--

Differences from Upstream LLVM

Aspect	Upstream LLVM	CICC v13.0
Activation	Default off for most targets; explicit `-enable-machine-outliner` required	Conditionally enabled via `TargetPassConfig::addMachineOutliner()`; evidence of "guaranteed beneficial" mode for NVPTX
Calling convention	Uses target default CC for outlined functions	Assigns CC 95 to outlined functions -- a dedicated NVPTX convention that bypasses `.param`-space ABI overhead
Kernel interaction	No kernel concept; all functions treated equally	`isKernel(func)` check (`sub_CE9220`) for CC 0x47 / `nvvm.kernel` metadata; kernel attributes (`.maxntid`, `.maxnreg`, `.minnctapersm`) may constrain outlining profitability
`nooutline` attribute	Standard function attribute check	Same check (`sub_B2D620` / `hasAttribute("nooutline")`); kernels with tight `__launch_bounds__` may implicitly disable outlining
Code size motivation	Reduce instruction cache footprint and binary size	Primary motivation is L0/L1i instruction cache pressure per SM partition; every surviving PTX instruction also costs `ptxas` compilation time
Suffix tree/array	Standard suffix array construction	Same algorithm; parallel merge sort (`sub_3534120`) with fallback insertion sort for <= 14 elements

Cross-References

Inliner Cost Model -- the opposing force: inlining decisions that the outliner may partially reverse
AsmPrinter & PTX Body Emission -- how outlined .func functions are emitted as PTX
Register Allocation -- the outliner runs after RA; outlined functions affect register pressure
Register Coalescing -- coalescing happens before outlining; the outliner operates on already-coalesced code
Block Placement -- block layout interacts with code size; the outliner reduces the instruction footprint that placement must arrange
Pipeline & Ordering -- where the outliner sits in the overall pass sequence
NVPTX Call ABI -- the .param-space calling convention that CC 0 device functions use; CC 95 outlined functions bypass this
SCEV Analysis -- SCEV budget bypass for CC 42/43 kernel functions; illustrates CC-based dispatch in CICC

Tensor Core / MMA Code Generation

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

CICC v13.0 contains a complete tensor core code generation pipeline spanning five SM generations (Volta through Blackwell), three distinct MMA instruction families (HMMA/IMMA/BMMA), the SM 90 Warp Group MMA (WGMMA) system, and the SM 100 Tensor Core Generation 5 (tcgen05) engine. The pipeline transforms NVVM intrinsic calls through two parallel lowering paths -- one in the NVVM IR lowering layer (sub_955A70) and one in the SelectionDAG backend (sub_33B0210) -- before reaching a common PTX instruction emission layer that constructs MMA instructions from packed 64-bit descriptors encoding shape, type, layout, rounding, and saturation.

This page documents the code generation mechanics: how MMA operations flow from source-level __hmma_* / __wmma_* / __wgmma_* builtins through LLVM intrinsic selection, SelectionDAG lowering, and PTX string emission. For the builtin-to-intrinsic mapping and per-ID reference, see Tensor / MMA Builtins. For the SelectionDAG infrastructure that hosts this lowering, see SelectionDAG.


NVVM builtin dispatch	`sub_955A70` (105KB) -- main NVVM builtin lowering dispatcher
SelectionDAG intrinsic switch	`sub_33B0210` (343KB, 9,518 lines) -- intrinsic lowering mega-switch, CAT-17
SelectionDAG MMA handler	`sub_33A64B0` -- WMMA/MMA DAG node construction (95 intrinsic IDs)
WMMA load handler	`sub_94CAB0` / `sub_94DCB0` -- fragment load codegen
WMMA MMA handler	`sub_94E0D0` -- matrix multiply-accumulate codegen
MMA PTX string builder	`sub_21E74C0` (AsmPrinter) / `sub_35F3E90` (backend)
tcgen05.mma lowering	`sub_304E6C0` (SelectionDAG) / `sub_36E9630` (instruction emission)
tcgen05 infrastructure	`sub_30462A0` -- fence/wait/alloc/dealloc/cp/commit
Address range	`0x21D0000`--`0x21F0000` (AsmPrinter MMA), `0x304xxxx`--`0x36Fxxxx` (backend)
Upstream	`lib/Target/NVPTX/NVPTXISelLowering.cpp` (no upstream MMA; entirely NVIDIA-proprietary)

Pipeline Overview

MMA code generation follows a three-stage pipeline. The first two stages exist in parallel copies; the third is shared.

CUDA source:  __hmma_m16n16k16_mma_f32f32(d, a, b, c, 0)
                │
    ┌───────────┴───────────┐
    │ NVVM builtin lowering │ SelectionDAG intrinsic lowering
    │ (sub_955A70)          │ (sub_33B0210, CAT-17)
    │                       │
    │ 3-table lookup:       │ sub_33A64B0 -> SDNode construction
    │ dword_3F14840/7E0/7A0 │ 95 case labels (0xA4-0xA8, 0x194-0x1EC)
    │                       │
    │ sub_94E0D0 (MMA)      │
    │ sub_94CAB0 (load)     │
    │ sub_9493D0 (store)    │
    └───────────┬───────────┘
                │
    ┌───────────┴───────────┐
    │ PTX Instruction Emit  │
    │ sub_21E74C0 (printer) │
    │ sub_1D23DE0 (emitter) │
    └───────────────────────┘

The NVVM builtin lowering path handles builtins that arrive as direct function calls from the EDG frontend. The SelectionDAG path handles the same operations when they arrive as LLVM intrinsic calls (the normal path when CUDA C++ compiles through Clang-style IR generation). Both paths converge at the PTX string builder, which reads a packed 64-bit descriptor word and emits text like mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32.

Packed MMA Descriptor

All MMA operations are encoded as a single 64-bit descriptor word stored at *(QWORD*)(*(QWORD*)(a1+16) + 16*a2 + 8). The PTX string builder (sub_21E74C0) queries this descriptor through a string-keyed interface. The caller passes a query string (e.g., "shape", "ety", "mid"), and the builder extracts the relevant bits and emits the corresponding PTX text.

Bit Layout

Bits     Field       Query key   Values
───────  ──────────  ─────────   ──────
[0]      rowcol      "rowcol"    0=row, 1=col
[2:1]    mid         "mid"       0=a, 1=b, 2=c, 3=d
[7:4]    opc         "opc"       0=default, 1=.and.popc, 2=.xor.popc
[2:0]    rnd         "rnd"       0=none, 1=.rn, 2=.rm, 3=.rp, 4=.rz
[15:8]   aty         "aty"       A element type enum (see below)
[23:16]  bty         "bty"       B element type enum
[25:24]  al          "al"        A layout: 0=row, nonzero=col
[27:26]  bl          "bl"        B layout: 0=row, nonzero=col
[28]     satf        "satf"      0=off, 1=.satfinite
[39:32]  shape       "shape"     Shape enum (see below)

The "ety" query reads the result/accumulator element type from bits [27:24], sharing bit positions with al/bl in a context-dependent manner -- the builder dispatches on the query string to select the correct extraction mask.

Type Enum

Value	Type	Bits	PTX string
1	b1	1	`"b1"`
2	s4	4	`"s4"`
3	u4	4	`"u4"`
4	s8	8	`"s8"`
5	u8	8	`"u8"`
6	f16	16	`"f16"`
7	bf16	16	`"bf16"`
8	tf32	19	`"tf32"`
9	f64	64	`"f64"`
10	f32	32	`"f32"`
11	s32	32	`"s32"`

Any other value triggers the fatal error "Wrong MMA element type".

Shape Enum

Value	Shape	PTX string	Notes
0x01	m8n8k4	`"m8n8k4"`	Original Volta HMMA
0x02	m8n8k16	`"m8n8k16"`	Integer MMA (s8/u8)
0x03	m8n8k32	`"m8n8k32"`	Sub-byte (s4/u4)
0x04	m8n8k64	`"m8n8k64"`	Extended sub-byte
0x05	m8n8k128	`"m8n8k128"`	Binary MMA (b1)
0x06	m8n32k16	`"m8n32k16"`	Appears unused in standard paths
0x10	m16n8k4	`"m16n8k4"`	Turing HMMA, Ampere f64
0x11	m16n8k8	`"m16n8k8"`	Turing/Ampere HMMA
0x12	m16n8k16	`"m16n8k16"`	Ampere (bf16, tf32)
0x13	m16n8k32	`"m16n8k32"`	Ampere integer
0x14	m16n8k64	`"m16n8k64"`	Sub-byte integer
0x15	m16n8k128	`"m16n8k128"`	Extended sub-byte
0x16	m16n8k256	`"m16n8k256"`	Largest shape (binary/sub-byte)
0x17	m16n16k16	`"m16n16k16"`	Square shape (Hopper+)
0x18	m32n8k16	`"m32n8k16"`	Tall shape
0x19	m16n16k8	`"m16n16k8"`	WMMA f16 path

Unrecognized shape values hit the default branch and trigger BUG() abort.

PTX String Emission

The string builder uses an optimized emission pattern: short constant strings are stored as integer literals for single-store writes. For example, "m16n8k16" is emitted as:

*(QWORD*)ptr = 0x36316B386E36316DLL;  // "m16n8k16" in little-endian

When the output buffer has sufficient remaining capacity, the builder writes directly via DWORD/WORD/BYTE stores. On buffer overflow, it falls back to sub_16E7EE0 (slow-path string append).

HMMA / IMMA / BMMA Lowering (SM 70--89)

The pre-Hopper MMA families share a common architecture: a three-table builtin-to-intrinsic lookup, per-family handler functions for load/store/MMA, and a consistent operand processing pattern.

Three-Table Intrinsic Lookup

Table	Address	ID Range	Description
`dword_3F14840`	Entries 0--29	678--707	HMMA (FP16, first-gen)
`dword_3F147E0`	Entries 0--23	708--731	IMMA (INT8)
`dword_3F147A0`	Entries 0--12	732--744	BMMA (binary) / INT4

Each table maps (builtin_id - base) to an LLVM intrinsic ID. The first table additionally sets a v43=1 flag indicating "first generation WMMA", which affects fragment size determination.

HMMA Handler Family (SM >= 70)

Four functions implement half-precision MMA operations. All share a common pattern:

Architecture gate: *(target_info + 252) > 0x45 (SM >= 70)
Fetch debug location
Validate rowcol operand is constant (opcode 10 or 32 check)
Resolve address space via sub_21DEF90
Build operands via sub_1D38BB0 calls
Emit instruction via sub_1D23DE0

Function	Address	Operation	Operand Count
`sub_21E0360`	`0x21E0360`	hmmaldab (load A/B)	6
`sub_21E0630`	`0x21E0630`	hmmaldc (load C)	5
`sub_21DFBF0`	`0x21DFBF0`	hmmastc (store C/D)	9 or 13 (shape-dependent)
`sub_21E0870`	`0x21E0870`	hmmamma (MMA)	19 or 23 + 1 metadata

For hmmastc, the operand count depends on the accumulator width: 9 operands for narrow accumulators, 13 for wide (when the a2 shape flag is set).

For hmmamma, the handler loads A fragments (v100 iterations), B fragments (v95 iterations), C fragments (v101 iterations), emits the MMA call via sub_921880, then scatters results through v103 iterations of element-wise stores.

IMMA Handler Family (SM >= 72)

Integer MMA follows the same pattern but with additional SM 72 (Xavier) restrictions:

Function	Address	Operation	SM Gate
`sub_21E1280`	`0x21E1280`	immaldab (load A/B)	SM > 0x47 (>= 72)
`sub_21E15D0`	`0x21E15D0`	immaldc (load C)	SM > 0x47
`sub_21E1830`	`0x21E1830`	immastc (store C)	SM > 0x47
`sub_21E1D20`	`0x21E1D20`	immamma (MMA + saturation)	SM > 0x47

SM 72 special case. Xavier's tensor cores support only basic IMMA shapes (variant 0 or 1). The gate check is:

if (sm_version <= 0x47 || (sm_version == 72 && shape_variant > 1))
    fatal_error("not supported on this architecture");

For immaldc at SM 72, certain intrinsic opcodes (610, 611, 179, 180) are explicitly blocked:

if (sm_version <= 0x47 || ((opcode-610 <= 1 || opcode-179 <= 1) && sm_version == 72))
    fatal_error(...);

The immamma handler includes an explicit satf (saturation-to-finite) constant extraction. The .satfinite modifier is appended to the PTX instruction when bit 28 of the descriptor is set. This clamps infinities and NaNs to the largest representable finite value.

IMMA operand counts vary by opcode:

Opcode	Fragment Count	Shape
584	12	Large integer shape
609	4	Compact integer shape
other	13	Default

BMMA Handler (SM >= 73/75)

Binary MMA (sub_21E2280, 0x21E2280) handles b1 operations with XOR-POPC and AND-POPC modes. Gate: SM > 0x48 (>= 73, in practice SM 75). The handler takes 8+ operands.

Fragment Size Determination

Fragment size (the number of register-width elements per warp fragment) is computed differently per family:

WMMA (first-gen, v43=1):

Condition	Fragment Count
BF16, store operation (`a6==1 && !a5`)	4
Default first-gen	8
Intrinsic 8914 or 8280	2

IMMA (v43=0):

Intrinsic IDs	Fragment Count
0x22B3--0x22B6, 0x22CF	2
0x22BB--0x22BC, 0x22C5--0x22C6	4
0x22BD--0x22BE, 0x22C3--0x22C4, 0x22CB--0x22CE	1
0x22B7, 0x22BF, 0x22C7	8

BMMA: Always 2 fragments, with v101=2, v95=1, v100=1.

MMA Codegen (sub_94E0D0)

The WMMA multiply-accumulate handler processes five input operands:

v102 -- destination fragment pointer (output)
v7 -- A matrix fragment pointer
v93 -- B matrix fragment pointer
v92 -- C accumulator fragment pointer
v8 -- rowcol operand (validated range: 0--3 for MMA)
v9 -- satf flag (validated: 0 or 1; skipped for intrinsic 8279)

Fragment counts for the MMA operation itself:

Family	v95 (A frags)	v100 (B frags)	v101 (C frags)	v103 (D frags)
BMMA	1	1	2	2
IMMA 0x22C0--0x22C1	1	4	8	8
IMMA 0x22B8--0x22B9	2	2	8	8
IMMA 0x22C8--0x22C9	4	1	8	8
WMMA (default)	8	8	varies	4 or 8

For first-gen WMMA, v103 (D fragment count) is determined by a bit test:

if ((0x300C003 >> (intrinsic_id + 127)) & 1)
    v103 = 4;
else
    v103 = 8;

The code generation sequence is:

1. LOAD  A fragments: v100 iterations of sub_94B510 (extract from ptr v7)
2. LOAD  B fragments: v95  iterations (extract from ptr v93)
3. LOAD  C fragments: v101 iterations (extract from ptr v92)
4. EMIT  MMA call:    sub_90A810(tables, intrinsic_id, 0, 0) -> sub_921880
5. STORE D fragments: v103 iterations of sub_94B940 (scatter to ptr v102)

Address Space Resolution (sub_21DEF90)

MMA load/store operations resolve the target memory address space through sub_21DEF90, which checks the instruction opcode at offset +24:

Opcode Range	Condition	Address Space
185--237	Bit test against 0x3FFFFD00000003	varies
44--45	Bit 1 of byte at offset +26	varies
>= 659	unconditional	accepted
default		generic (0)

Return values: 0=generic, 1=global, 2=shared, 3=local, 4=constant, 5=special, 404=special (from value 101).

SelectionDAG Path (sub_33B0210 / sub_33A64B0)

In the SelectionDAG intrinsic lowering mega-switch (sub_33B0210), 95 consecutive case labels (IDs 0xA4--0xA8 and 0x194--0x1EC, corresponding to LLVM intrinsic IDs 164--168 and 404--492) all dispatch to a single helper: sub_33A64B0.

This function handles every WMMA/MMA SelectionDAG intrinsic for SM 70--89:

wmma.load.a / wmma.load.b / wmma.load.c
wmma.store.d
wmma.mma for all shape/type combinations
mma.sync (SM 70+), mma.sp (SM 80+, structured sparsity), mma.f64 (SM 80+)

The SelectionDAG path constructs NVPTXISD target-specific DAG nodes that are later matched by the instruction selection tables. The intrinsic IDs from the mega-switch are distinct from the builtin IDs used in the NVVM path -- the mega-switch IDs are LLVM intrinsic table indices, not CUDA builtin numbers.

WGMMA -- Warp Group MMA (SM 90 Hopper)

WGMMA operates on a warp group (4 warps, 128 threads) instead of a single warp. Four builtin IDs (765--768) expand to over 150 LLVM intrinsic variants through compile-time dimension and type dispatch.

Builtin-to-Intrinsic Expansion

Builtin ID	Builtin	Variants
765 (0x2FD)	`__wgmma_mma_async_f16`	Full 6-operand set (a, b, c, scale, negate, sparsity)
766 (0x2FE)	`__wgmma_mma_async_bf16`	2-operand (no scale/negate)
767 (0x2FF)	`__wgmma_mma_async_tf32`	Reduced operand set
768 (0x300)	`__wgmma_mma_async_f8`	Minimal (2 scale operands only)

The lowering handler (in sub_955A70, cases 0x2FD--0x300, ~800 lines) extracts 7 levels of chained operands:

v263 -- M dimension (constant)
v512 -- accumulator fragments
v528 -- A descriptor
v524 -- B descriptor
v519 -- scale factors
v264 -- layout params
v540 -- element type info

Dimension-to-Intrinsic Mapping

The N dimension (extracted via sub_620FD0 as a constant integer) maps to one of 144 LLVM intrinsic IDs spanning 10654--10779. The mapping forms a dense table with stride 4 per N step:

N	Integer-type Intrinsic	Float-type Intrinsic
8	10774	10775
16	10690	10691
32	10742	10743
64	10758	10759
128	10666	10667
256	10738	10739

For intermediate N values (multiples of 8 from 8 to 256), the mapping continues at stride +4 per N increment. Even intrinsic IDs encode integer-element variants; odd IDs encode float-element variants. The element type is determined by checking whether the LLVM type is an integer with width 10 (i.e., tf32 or bf16 packed as i10 -- a quirk of the NVVM type system).

If constant extraction overflows, the compiler emits:

"unexpected constant overflow in __wgmma_mma_async operand"

If N is not a power of two: (N & (N - 1)) != 0 triggers:

"N only supported for powers of two"

WGMMA 5-Dimensional Intrinsic Grid

The full WGMMA intrinsic table (sub_12B2E10) uses a 144-entry grid spanning IDs 5304--5447:

Dimension	Values	Count
N	16, 32, 64, 128	4
B_shared	false, true	2
is_s64	false, true	2
A_scale/negate	combo	varies
case variant	0x2FD--0x300	4

Each WGMMA call packs mode bits into a single integer:

bit 0:  accumulate flag     (from operand v433)
bit 1:  transpose flag      (from operand v445)
bit 2:  negate-C flag       (from operand v433)
bit 3:  reserved
bit 4:  negate-A flag       (from operand v427)

Combined: v79 = bit0 | (bit1 << 1) | (bit2 << 2) | (bit4 << 4).

WGMMA Parameter Lookup (sub_953BA0)

On first call, sub_953BA0 lazily initializes a red-black tree at ctx+560 with 7 entries encoding per-ID shape, transpose, register count, and type information:

ID	trans_a	shape	a_nregs	b_nregs	a_type	b_type	c_type
745	0	1	1	1	i64	i64	--
746	1	0	9	9	i32	i32	i32x2
747	0	0	8	8	i16x2	i16x2	--
748	0	0	7	7	i32x4	i32x4	i32x8
749	0	0	7	7	i32x4	i32x4	i32x8
750	0	0	7	7	i64	i32x2	i32x8

The output is packed into a 64-bit value:

bits[3:0]    = trans_a
bits[7:4]    = shape << 4
bits[15:8]   = a_nregs << 8
bits[27:16]  = b_nregs << 16
bits[31:28]  = padding << 28
bits[63:32]  = trans_b << 32
bit[25]      = ((rowcol & 2)==0) ? 0x2000000 : 0x1000000
bits[27:26]  = ((rowcol & 1)+1) << 26

WGMMA MMA Async Load (sub_9547E0)

A second red-black tree at ctx+656 holds 12 entries for MMA async load parameters:

ID	Shape	NRegs	Variant	Fragment Type
753	1	9	0	--
754	1	9	1	--
755	1	9	2	i16x2
756	25	8	0	--
757	25	8	1	--
758	25	10	2	i32x8
759	23	7	0	i32x4
760	23	7	1	i32x4
761	24	7	0	i32x4
762	24	7	1	i32x4
763	6	7	0	i32x2/i64
764	6	7	1	i32x2/i64

WGMMA Fence/Store Dispatch

IDs	Operation	Intrinsic	Handler
745--750	fence_aligned	9062 (3 type overloads)	`sub_953BA0` -> `sub_94B510` x3 -> `sub_94B940`
751--752	store	9145 (2 type overloads)	`sub_954350`
753--764	mma_async load	9067 (2 type overloads)	`sub_9547E0`

The fence operations pack A/B/C fragment operands via sub_94B510 and scatter results via sub_94B940 with name hint "mmafrag".

tcgen05 -- Tensor Core Generation 5 (SM 100 Blackwell)

SM 100 introduces tcgen05, a completely new tensor core instruction family with support for MX floating-point formats (MXF4, MXF8F6F4), structured sparsity, weight stationary mode, block scaling, and scaled input accumulators. The tcgen05 system includes both computation (tcgen05.mma) and lifecycle management (alloc, dealloc, fence, wait, commit, cp, relinquish) instructions.

Architecture Gate

All tcgen05 operations require SM >= 100. The gate check reads two architecture fields:

v1 = *(int*)(arch_struct + 340);  // arch_value: 1000=sm100, 1030=sm103, 1200=sm120
v2 = *(int*)(arch_struct + 336);  // ptx_version

// Family-conditional: ptx >= 86
// Arch-conditional: ptx >= 88
if (v1 <= 0x3E8 && v1 <= 0x408)  // neither sm_100 nor sm_103
    fatal_error("tcgen05.mma supported only on arch-conditional "
                "or family-conditional variants from SM100 onwards.");

tcgen05 Infrastructure Operations

All handled by sub_30462A0:

Operation	Intrinsic Opcode	ISD Opcode	Operands
tcgen05.alloc	10080	4765	basic allocation
tcgen05.alloc (multicast)	10083	4770/4771	32-bit flag variant
tcgen05.dealloc	10140	4827	4 operands
tcgen05.commit	10090	4772--4777	multicast mask variants
tcgen05.fence	10143	4830	2 operands
tcgen05.wait	10351	5020	2 operands
tcgen05.relinquish.alloc	10311	4941	2 operands
tcgen05.cp.*	10101	4790	4 operands

Commit operations validate multicast mask size -- only 16-bit and 32-bit masks are supported:

"tcgen05.commit.* supports only 16-bit and 32-bit multicast mask size."

tcgen05.mma Data Types

The "kind" field occupies bits [8:6] of the packed operand word:

Value	Kind	Description
0	mxf4nvf4	MX FP4 with NV FP4
1	f8f6f4	FP8/FP6/FP4 standard
2	mxf8f6f4	MX variant of f8f6f4
3	f16	Half precision
4	i8	8-bit integer (arch-conditional only)
5	tf32	TensorFloat-32
7	mxf4	MX FP4

tcgen05.mma Modifiers

Scale vector size (bits [3:2]):

Value	Modifier	Constraints
0/1	`.scale_vec::1X`	Cannot use for mxf4nvf4 type
2	`.scale_vec::2X`	Cannot use for mxf8f6f4 type
3	`.scale_vec::4X`	Cannot use for mxf8f6f4 or mxf4 type

Block scale alias (bits [10:9]):

Value	Modifier	Constraint
0	`.block16`	Not supported for f16, tf32, f8f6f4, i8
1	`.block32`	Same constraint

Weight stationary (bit 0): .ws flag. Not compatible with cta_group::2, mxf8f6f4, or fp4 types.

CTA group (bits [1:0]): .cta_group::1 (bit 1 clear) or .cta_group::2 (bit 1 set).

Sparsity (bit 5): Adds one extra operand. Restricted for MXF4 and MXF4NVF4 types to arch-conditional variants only.

Scale input accumulator (bit 4): Only usable with f16 and tf32 types. Not supported on sm_100a (v=1001) or sm_103a (v=1033), but supported on sm_100 (v=1000), sm_103 (v=1030), and sm_120+ (v>=1101).

Collector modes (emitted by sub_35F38B0):

Value	PTX modifier
1	`.collector::a::lastuse`
2	`.collector::a::fill`
3	`.collector::a::use`

Cannot use collector::a::use or collector::a::fill with ashift.

tcgen05.mma ISD Opcode Selection (sub_36E9630)

The intrinsic lowering handler (sub_304E6C0) maps 10 shape cases (intrinsic opcodes 10299--10308) to ISD opcodes 4905--4940:

Case	Shape Class	Base ISD	+scaleD	+sparsity	+ws	+scaleInputAccum
10299	Small	4906	--	4907	--	--
10300	Small v2	4908	--	4909	--	--
10301	Medium	4905	4910	4911/4912	4937/4938	yes
10302	Medium v2	4913	4914	4915/4916	--	yes
10303	Large	4917	4918	4919/4920	--	yes
10304	Block-scale small	4922	--	4923	--	--
10305	Block-scale small v2	4924	--	4925	--	--
10306	Block-scale medium	4921	4926	4927/4928	4939/4940	yes
10307	Block-scale medium v2	4929	4930	4931/4932	--	--
10308	Block-scale large	4933	4934	4935/4936	--	--

Operand count varies by variant: small shapes take 5--6 base operands plus optional sparsity operand; medium shapes take 6 base plus optional scale factor; large shapes iterate over additional operands spanning offsets 440--600 (or 440--760 on sm_103 extended variants).

tcgen05.mma Validation Errors

The full set of compile-time validation errors (emitted via sub_C64ED0):

Error Message	Condition
`"INT8 type is supported only on arch-conditional variants."`	kind==i8 on family-conditional SM100
`"MXF4 and MXF4NVF4 types with Sparsity are supported only on arch-conditional variants."`	(type+7)%8 > 5 AND sparsity set, on family-conditional
`"Explicit scale vector size is supported only on arch-conditional variants."`	scale_vec_size 1--3 on family-conditional
`"Scale input accumulator can only be used with f16 and tf32 types"`	bit 4 set but kind not f16 or tf32
`"Scale input accumulator is not supported on this architecture."`	scaleInputAccum on sm_100a or sm_103a
`"Block scale is not supported for f16, tf32, f8f6f4 and i8 types"`	block_scale with incompatible type
`"ashift is not supported with tcgen05.mma.block_scale variants"`	ashift + block_scale
`"cta_group::2 is not supported with weight stationary"`	cta_group::2 + .ws
`"Cannot use weight stationary with mxf8f6f4 and fp4 types"`	.ws + mxf8f6f4 or fp4
`"Cannot use collector::a::use or colletor::a::fill with ashift"`	[sic] collector + ashift
`"Cannot use 2X or 4X as scale vector size for mxf8f6f4 type"`	scale_vec >= 2X + mxf8f6f4
`"Cannot use 1X as scale vector size for mxf4nvf4 type"`	scale_vec 1X + mxf4nvf4
`"Cannot use 1X or 4X as scale vector size for mxf4 type"`	scale_vec 1X or 4X + mxf4

Note the typo "colletor" (missing 'c') in the binary -- this is a genuine NVIDIA binary string, not a transcription error.

tcgen05 Scaled MMA Operand Builder

Two identical copies exist for the tcgen05 scaled MMA descriptor:

Copy	Address	Layer
`sub_21E8CD0`	`0x21E8CD0`	AsmPrinter / PTX emission
`sub_35F3E90`	`0x35F3E90`	NVPTX backend / SelectionDAG

The packed descriptor encodes Blackwell-specific modifiers:

Bit	Query	Set Value	Clear Value	Semantics
0	`"scaleD"`	`"1"`	`"0"`	Scale output accumulator
1	`"negA"`	`"-1"`	`"1"`	Negate A matrix
2	`"negB"`	`"-1"`	`"1"`	Negate B matrix
3	`"transA"`	`"1"`	`"0"`	Transpose A
4	`"transB"`	`"1"`	`"0"`	Transpose B

scaleD and transA/transB emit boolean "0"/"1" strings. negA and negB emit sign multiplier strings "-1"/"1" because PTX applies negation as a multiplication factor.

tcgen05.cp Copy Operations

Shape variants (bits [3:1]):

Value	PTX shape
0	`.128x256b`
1	`.4x256b`
2	`.128x128b`
3	`.64x128b`
4	`.32x128b`

Destination format variants:

Condition	PTX format
default	`.b8x16`
bit 7 = 0	`.b6x16_p32`
bit 7 = 1	`.b4x16_p64`
bit 8 set	error: `"Unsupported tcgen05.cp destination format"`

Multicast modes:

Type	PTX modifier
type 1, shape 3	`.warpx2::02_13`
type 2, shape 3	`.warpx2::01_23`
type 3, shape 4	`.warpx4`

Duplicate Backend Copies

Several MMA functions exist as near-identical pairs -- one in the AsmPrinter emission layer (0x21Dxxxx--0x21Exxxx) and one in the NVPTX backend layer (0x36Exxxx). The difference is limited to error reporting and reference counting functions:

AsmPrinter Copy	Backend Copy	Operation
`sub_21DFBF0`	`sub_36E91F0`	hmmastc
`sub_21E0360`	`sub_36E72A0`	hmmaldab
`sub_21E0630`	`sub_36E7580`	hmmaldc
`sub_21E0870`	`sub_36E77C0`	hmmamma
`sub_21E1280`	`sub_36E7B50`	immaldab
`sub_21E15D0`	`sub_36E7EA0`	immaldc
`sub_21E1830`	`sub_36E8110`	immastc
`sub_21E1D20`	`sub_36E8630`	immamma
`sub_21E2280`	`sub_36E8BD0`	bmmamma
`sub_21E8CD0`	`sub_35F3E90`	tcgen05 scaled MMA

AsmPrinter copies use sub_16BD130 for errors; backend copies use sub_C64ED0. AsmPrinter copies use sub_1623A60/sub_161E7C0 for refcounting; backend copies use sub_B96E90/sub_B91220.

Shape x Type x Architecture Matrix

Shape	A/B Types	Accumulator	Min SM	Notes
m8n8k4	f16	f16, f32	SM 70	Original Volta
m16n8k4	f64	f64	SM 80	Ampere double precision
m16n8k8	f16	f16, f32	SM 75	Turing+
m16n8k16	f16, bf16, tf32	f16, f32	SM 80	Ampere+
m16n16k8	f16	f16, f32	SM 70	WMMA path
m16n16k16	f16, bf16	f16, f32	SM 90	Hopper+
m32n8k16	f16, bf16	f16, f32	SM 80	Tall shape
m8n8k16	s8, u8	s32	SM 72	Integer MMA
m16n8k16	s8, u8	s32	SM 75	Turing+ integer
m16n8k32	s8, u8	s32	SM 75	Turing+ integer
m8n8k32	s4, u4	s32	SM 75	Sub-byte
m16n8k64	s4, u4	s32	SM 75	Sub-byte
m8n8k64	s4, u4	s32	SM 75	Extended sub-byte
m16n8k128	s4, u4	s32	SM 75	Extended sub-byte
m8n8k128	b1	s32	SM 75	Binary (.and.popc / .xor.popc)
m16n8k256	b1	s32	SM 75	Binary extended
tcgen05 (10 variants)	mxf4nvf4, f8f6f4, mxf8f6f4, f16, tf32, i8, mxf4	varies	SM 100	+block_scale, +sparsity, +ws

LLVM Intrinsic ID Reference

Key intrinsic IDs used in the MMA code generation pipeline:

Intrinsic ID	Symbol	Usage
8181	`llvm.nvvm.wmma.store` (complex)	WMMA complex store
8210	`llvm.nvvm.wmma.store`	WMMA store
8279	(special)	IMMA MMA without satf
8280	(special)	Fragment count = 2 trigger
8914	(special)	Fragment count = 2 trigger
9062	`llvm.nvvm.wgmma.fence.aligned`	WGMMA fence (3 type overloads)
9067	`llvm.nvvm.wgmma.mma.async`	WGMMA MMA async (2 type overloads)
9145	`llvm.nvvm.wgmma.store`	WGMMA store
10654--10779	`llvm.nvvm.wgmma.mma.async.*`	Per-dimension WGMMA variants (144 entries)
5304--5447	(WGMMA grid)	5-dimensional intrinsic grid for WGMMA

Error Handling

Two error-reporting functions serve the two layers:

Function	Address	Layer	Behavior
`sub_16BD130`	`0x16BD130`	AsmPrinter / PTX emission	Fatal (severity=1 -> abort)
`sub_C64ED0`	`0xC64ED0`	NVPTX backend / SelectionDAG	Fatal (severity=1 -> abort)

Error categories:

Architecture not supported: "X is not supported on this architecture" -- SM gate failure
Constant validation: "rowcol not constant", "satf not constant" -- non-constant operand
Type restrictions: "Wrong MMA element type" -- invalid type enum
Feature combination: "ashift is not supported with tcgen05.mma.block_scale" -- conflicting modifiers
Scale restrictions: "Cannot use N as scale vector size for X type" -- type/scale mismatch

Differences from Upstream LLVM

Upstream LLVM's NVPTX backend has no MMA code generation. The entire MMA pipeline -- builtin tables, three-table lookup, fragment size computation, WGMMA dimension dispatch, tcgen05 lowering, packed descriptor encoding, and all shape/type validation -- is NVIDIA-proprietary code with no upstream equivalent.

Upstream LLVM handles MMA operations at the PTX level only: the upstream NVPTXAsmPrinter can print PTX mma.sync instructions, but the instruction selection, intrinsic lowering, and code generation logic that produces them exists only in NVIDIA's cicc binary. An open-source reimplementation would need to build the entire pipeline from the WMMA/MMA intrinsic definitions through SelectionDAG lowering and PTX emission.

Cross-References

Tensor / MMA Builtins -- per-builtin-ID reference table and validation rules
SelectionDAG & ISel -- DAG infrastructure hosting MMA lowering
ISel Pattern Matching -- downstream pattern matcher consuming MMA DAG nodes
SM 90 -- Hopper -- WGMMA feature gate details
SM 100 -- Blackwell -- tcgen05 feature gate details
SM 120 -- Blackwell consumer variant features
NVPTX Machine Opcodes -- ISD opcode reference
Register Classes -- fragment register allocation
PTX Emission -- downstream PTX text generation

NVVM Builtin Table Structure

770 builtins mapped to integer IDs (1--770) in a wyhash open-addressing hash table. Dual tables exist: pre-optimization (sub_90AEE0) and post-optimization (sub_126A910), both with identical content but separate address spaces.


Pre-opt table builder	`sub_90AEE0` (109 KB, populates all 770 entries)
Pre-opt dispatcher	`sub_913450` (name -> ID lookup)
Post-opt table builder	`sub_126A910` (123 KB)
Post-opt dispatcher	`sub_12731E0` (name -> ID lookup)
Hash function	`sub_CBF760` (wyhash v4 family)
Hash table insert	`sub_90ADD0` -> `sub_C92610` -> `sub_C92740`
Hash table find	`sub_C92860` (find-only, quadratic probing)
Rehash	`sub_C929D0` (75% load factor trigger)
Total builtins	770 (IDs 1--770)
Storage	Open-addressing at `context+480` (20-byte header)

Architecture

sub_913450 (public API: name -> builtin ID)
  |
  +-- Guard: context+492 == 0?
  |    +-- sub_90AEE0 (lazy init: populate all 770 entries, once)
  |
  +-- strlen(name)
  +-- sub_C92610(name, len)         -> compute wyhash
  +-- sub_C92860(context+480, ...)  -> quadratic probe find
  |
  +-- return *(uint32*)(entry + 8)  -> the builtin ID

Hash Table Infrastructure

The builtin name table uses a specialized 20-byte hash table header at context+480 with a parallel hash cache array and wyhash-v4 string hashing. The table employs quadratic probing with triangular-number increments and grows at 75% load factor. For 770 entries the capacity sequence is 16 -> 32 -> 64 -> 128 -> 256 -> 512 -> 1024.

Full structural details -- table layout, bucket format, string entry format, wyhash length-dispatch table with pseudocode, probing algorithm, triple-gated comparison guard, rehash procedure, and sentinel values -- are documented in Hash Table and Collection Infrastructure. The "wyhash v4 String Hasher" and "Probing Strategy" sections on that page are the canonical references.

Complete Builtin ID Inventory

Synchronization & Compiler Intrinsics (IDs 1–7)

ID	Name
1	`__syncthreads`
2	`__nvvm_bar0`
3	`__nvvm_membar_cta`
4	`__nvvm_membar_gl`
5	`__nvvm_membar_sys`
6	`__builtin_is_constant_evaluated`
7	`__builtin_unreachable`

Cluster Operations — SM 90+ (IDs 8–14)

ID	Name
8	`__nv_clusterDimIsSpecifed_impl`
9	`__nv_clusterRelativeBlockRank_impl`
10	`__nv_clusterSizeInBlocks_impl`
11	`__nv_cluster_barrier_arrive_impl`
12	`__nv_cluster_barrier_wait_impl`
13	`__nv_cluster_barrier_arrive_relaxed_impl`
14	`__nv_threadfence_cluster_impl`

Barrier Extensions (IDs 15–20)

ID	Name
15–17	`__nvvm_bar0_{popc,and,or}`
18–20	`__nvvm_bar{_sync_all,rier_sync,_warp_sync}`

Bit Manipulation (IDs 21–26)

__nvvm_clz_{i,ll}, __nvvm_popc_{i,ll}, __nvvm_brev{32,64}

Math — Rounding/Abs/Saturate (IDs 27–56)

__nvvm_{floor,ceil,abs,fabs,round,trunc,saturate}_{ftz_f,f,d}, __nvvm_{ex2,lg2,sin,cos}_approx_{ftz_f,f,d}

Reciprocal / Sqrt / Rsqrt (IDs 57–87)

__nvvm_rcp_{rn,rz,rm,rp}_{ftz_f,f,d}, __nvvm_sqrt_{f,rn,rz,rm,rp}_{ftz_f,f,d}, __nvvm_rsqrt_approx_{ftz_f,f,d}

Type Conversions (IDs 88–184)

97 entries covering all float↔int, double↔int, float↔half, bitcast combinations with all four rounding modes and FTZ variants.

Address Space & Memory Queries (IDs 185–204)

ID	Name
185	`__nv_isGlobal_impl`
186–188	`__nv_bswap{16,32,64}_impl`
189–192	`__nv_is{Shared,Constant,Local,GridConstant}_impl`
193–200	`__nv_cvta_{generic_to,to_generic}_{global,shared,constant,local}_impl`
201	`__builtin_assume`
202	`__nv_isClusterShared_impl`
203	`__nv_cluster_query_shared_rank_impl`
204	`__nv_associate_access_property_impl`

Atomic Operations — Legacy NVVM (IDs 207–275)

69 entries: __nvvm_atom_{,cta_,sys_}{add,xchg,min,max,inc,dec,and,or,xor}_gen_{i,ll,f,d,ui,ull,128}

FP Arithmetic (IDs 276–349)

__nvvm_{min,max}_{i,ui,ll,ull}, __nvvm_f{min,max}_{f,ftz_f,d}, __nvvm_mulhi_{i,ui,ll,ull}, __nvvm_mul_{rn,rz,rm,rp}_{ftz_f,f,d}, __nvvm_div_*, __nvvm_add_*

Vote Operations (IDs 351–358)

__nvvm_vote_{all,any,uni,ballot} + _sync variants

Match Operations (IDs 361–364)

__match{32,64}_{any,all}_sync

FMA (IDs 383–403)

__nvvm_fma_{rn,rz,rm,rp}_{ftz_f,f,d,ftz_f2,f2}

C++11 Atomics (IDs 417–473)

Sized variants: __nv_atomic_{load,store,fetch_add,fetch_sub,fetch_and,fetch_or,fetch_xor,fetch_max,fetch_min,exchange,compare_exchange}_{1,2,4,8,16}_{u,s,f}

Surface Stores — sust (IDs 474–638)

165 entries covering __nvvm_sust_b_{1d,1d_array,2d,2d_array,3d}_{i8,...,v4i32}_{clamp,trap,zero}.

Pattern: sust_b_<dim>_<type>_<oob_mode> across 5 dimensions × 11 types × 3 OOB modes.

CUDA Varargs (IDs 639–642)

__cu_va_{start,end,arg,copy}

Tex/Surf Handler (ID 647)

__nv_tex_surf_handler — generic dispatch for texture/surface reads (surface stores use the dedicated sust builtins above).

C++ ABI (IDs 648–677)

__cxa_vec_{ctor,cctor,dtor,new2,new,new3,delete2,delete,delete3}, __gen_nvvm_mem{cpy,set}_*, _Znw{j,m,y}, _Zna{j,m,y}, _ZdlPv{,m,y}, _ZdaPv{,m,y}

WMMA Tensor Core — SM 70+ (IDs 678–707)

30 entries: __hmma_m{16n16k16,32n8k16,8n32k16}_{ld_a,ld_b,ld_c_f16,ld_c_f32,st_c_f16,st_c_f32,mma_f16f16,mma_f32f16,mma_f16f32,mma_f32f32}

Integer/Binary Tensor Core — SM 75+ (IDs 708–745)

38 entries: __imma_m{16n16k16,32n8k16,8n32k16}_{ld_a,ld_b,ld_c,st_c,mma}_{s8,u8}, __imma_m8n8k32_{s4,u4}, __bmma_m8n8k128_{b1}

Extended Tensor Core — SM 80+ (IDs 746–764)

__dmma_m8n8k4_mma_f64, __mma_tf32_m16n16k8_mma_f32, __mma_bf16_m*_mma_f32 + load/store variants

WGMMA — SM 90+ (IDs 765–768)

__wgmma_mma_async_{f16,bf16,tf32,f8}

Alloca (IDs 769–770)

_alloca, __builtin_alloca

Category Summary

Category	ID Range	Count
Sync/barriers/cluster	1–20	20
Bit manipulation	21–26	6
Math (floor/ceil/abs/round/etc)	27–56	30
Reciprocal/sqrt/rsqrt	57–87	31
Type conversions	88–184	97
Address space queries/cvta	185–204	20
Atomic ops (NVVM legacy)	207–275	69
FP min/max, mulhi, arithmetic	276–349	74
Vote + match operations	351–364	12
Compare-and-swap	370–379	10
FMA	383–403	21
Shuffle + misc	404–416	13
C++11 atomics (sized)	417–473	57
Surface stores (sust)	474–638	165
CUDA varargs + math shim	639–646	8
Tex/surf handler	647	1
C++ ABI + memgen + new/delete	648–677	30
WMMA tensor core (f16)	678–707	30
IMMA/BMMA tensor core	708–745	38
Extended tensor (dmma/tf32/bf16)	746–764	19
WGMMA (SM 90+ warpgroup)	765–768	4
Alloca	769–770	2
TOTAL		770

SM Generation Coverage

Generation	Features Enabled
SM 70 (Volta)	WMMA (half-precision tensor core)
SM 75 (Turing)	IMMA (integer), BMMA (binary)
SM 80 (Ampere)	DMMA (double), TF32, BF16
SM 90 (Hopper)	WGMMA (warpgroup), cluster ops, f8

All 770 builtins are registered regardless of target SM. Architecture gating happens in the lowering layer that consumes the builtin IDs.

Key Observations

Lazy initialization: The entire table is built on first lookup. Guard: context+492 != 0.
No texture reads (suld): Only surface store builtins are registered. Texture/surface reads go through __nv_tex_surf_handler (ID 647).
Write-once table: Tombstone mechanics exist but deletions never occur for the builtin table.
Duplicate prefix optimization: IDA shows SSE xmmword constant loads for long common prefixes (__nvvm_sust_b_2d_array_*) — this is compiler optimization of string literal loads, not a different code path.

Atomic Operations Builtins

Atomic builtins constitute the largest and most complex category in the NVVM builtin system, spanning over 130 IDs across two distinct subsystems: the legacy NVVM intrinsic atomics (IDs 207--275, 370--379) and the C++11-model atomics (IDs 366, 417--473). Both families converge in the lowering layer at sub_12AE930 (EDG) / sub_9502D0 (NVVM), a 1495-line handler that generates inline PTX assembly with explicit memory ordering and scope annotations.

Two Atomic Subsystems

The compiler maintains two parallel atomic APIs that reflect CUDA's historical evolution. The legacy NVVM atomics (__nvvm_atom_*) predate the C++ memory model and encode scope directly in the builtin name (e.g., __nvvm_atom_cta_add_gen_i for block-scoped integer add). The C++11 atomics (__nv_atomic_*) accept ordering and scope as runtime parameters, matching the cuda::atomic_ref interface.

Both subsystems lower to identical PTX instructions. The distinction matters only during the EDG frontend phase, where sub_6BBC40 generates the mangled __nv_atomic_* names from C++ source, and the NVVM lowering layer sub_12B3FD0 dispatches them by ID.

Legacy NVVM Atomics (IDs 207--275)

These 69 builtins encode the operation, scope, and type directly in the name. The lowering dispatches through sub_12AA9B0 for exchange-style operations and sub_12ADE80 for load/store/fetch operations. Each operation exists in three scope variants: default (device), _cta_ (block), and _sys_ (system).

ID Range	Operation	Builtin Pattern	PTX Mnemonic
207--218	Add	`__nvvm_atom_{,cta_,sys_}add_gen_{i,ll,f,d}`	`atom.add`
219--227	Exchange	`__nvvm_atom_{,cta_,sys_}xchg_gen_{i,ll,128}`	`atom.exch`
228--251	Min/Max	`__nvvm_atom_{,cta_,sys_}{min,max}_gen_{i,ll,ui,ull}`	`atom.min` / `atom.max`
252--257	Inc/Dec	`__nvvm_atom_{,cta_,sys_}{inc,dec}_gen_ui`	`atom.inc` / `atom.dec`
258--275	Bitwise	`__nvvm_atom_{,cta_,sys_}{and,or,xor}_gen_{i,ll}`	`atom.and` / `atom.or` / `atom.xor`

Legacy CAS (IDs 370--379)

Compare-and-swap builtins include 128-bit variants for SM 70+ targets. The handler sub_12AA280 builds an AtomicCmpXchg IR node with acquire ordering on both success and failure paths and weak exchange semantics.

ID Range	Operation	Builtin Pattern
370--379	CAS	`__nvvm_atom_{,cta_,sys_}cas_gen_{i,ll,us,128}`

Half-Precision Atomics (IDs 459--468)

Added for SM 90+ (Hopper), these support f16x2 and f16x4 packed atomic adds:

ID Range	Operation	Builtin Pattern	SM Gate
459--461	f16x2 add	`__nvvm_atom_{,cta_,sys_}add_gen_f2`	SM 90+
466--468	f16x4 add	`__nvvm_atom_{,cta_,sys_}add_gen_f4`	SM 100+ (Blackwell)

C++11 Atomics (IDs 366, 417--473)

These 57 builtins implement the CUDA C++ atomic model with explicit memory ordering and scope parameters. The EDG frontend generator at sub_6BBC40 constructs the mangled names using a __nv_atomic_fetch_{op}_{width}_{type} pattern, where width is the byte count (1, 2, 4, 8, or 16) and the type suffix is _u (unsigned), _s (signed), or _f (float).

Thread Fence (ID 366)

__nv_atomic_thread_fence emits either a volatile fence (SM <= 69) or an explicit fence.{ordering}.{scope}; PTX instruction (SM 70+). Ordering and scope are extracted from constant operand parameters at compile time.

Load/Store (IDs 417--428)

ID	Builtin	Width	PTX
417	`__nv_atomic_load`	generic	`ld.{ordering}.{scope}.{type}`
418--422	`__nv_atomic_load_{1,2,4,8,16}`	1--16 bytes	same
423	`__nv_atomic_store`	generic	`st.{ordering}.{scope}.{type}`
424--428	`__nv_atomic_store_{1,2,4,8,16}`	1--16 bytes	same

Fetch-Op (IDs 429--458)

Arithmetic and bitwise fetch operations are registered with width and type suffixes. Bitwise operations (and, or, xor) omit the type suffix since signedness is irrelevant for bitwise logic.

ID Range	Operation	Builtin Pattern
429--434	fetch_add	`__nv_atomic_fetch_add_{4,8}_{u,s,f}`
435--440	fetch_sub	`__nv_atomic_fetch_sub_{4,8}_{u,s,f}`
441--446	fetch_and/or/xor	`__nv_atomic_fetch_{and,or,xor}_{4,8}`
447--452	fetch_max	`__nv_atomic_fetch_max_{4,8}_{u,s,f}`
453--458	fetch_min	`__nv_atomic_fetch_min_{4,8}_{u,s,f}`

For fetch_sub with floating-point types (IDs 437, 440), the lowering negates the operand and emits atom.add rather than a dedicated subtraction instruction.

Exchange and CAS (IDs 462--473)

ID Range	Operation	Builtin Pattern
462--465	Exchange	`__nv_atomic_exchange{,_4,_8,_16}`
469--473	CAS	`__nv_atomic_compare_exchange{,_2,_4,_8,_16}`

PTX Inline Assembly Generation

The atomic codegen handler at sub_12AE930 (address 0x12AE930, 41KB) generates PTX inline assembly strings at compile time. The generated instruction format depends on the target SM:

Pre-SM 70 (volatile mode, unk_4D045E8 <= 0x45):

ld.volatile.b32 $0, [$1];
atom.add.volatile.u32 $0, [$1], $2;

SM 70+ (explicit memory model):

ld.acquire.gpu.b32 $0, [$1];
st.release.sys.b32 [$0], $1;
atom.add.acq_rel.cta.u32 $0, [$1], $2;
atom.cas.relaxed.gpu.b64 $0, [$1], $2, $3;

The sub_12AE930 / sub_9502D0 Algorithm in Detail

Both the EDG-side handler (sub_12AE930, 0x12AE930) and its NVVM-side twin (sub_9502D0, 0x9502D0) follow identical logic. They accept five parameters: (result, codegen_state, builtin_id, call_arg_list, type_info). The algorithm proceeds in six phases.

Phase 1: SM Version Check and Path Selection

v186 = (unk_4D045E8 <= 0x45)    // SM <= 69 -> volatile mode

When v186 is true, the handler enters the pre-SM 70 "volatile" path. All atomic operations receive a .volatile qualifier instead of explicit memory ordering and scope qualifiers. The 128-bit atomics emit diagnostic 0xEB6 (3766) and are rejected entirely.

When v186 is false (SM 70+), the handler enters the memory model path, which constructs the full {mnemonic}.{ordering}.{scope}.{type} format.

Phase 2: Operand Extraction and Builtin ID Dispatch

The handler extracts between 2 and 5 operands from the call argument list (pointer, value, compare-value for CAS, plus the ordering and scope parameters encoded as compile-time constants). The builtin ID selects the PTX mnemonic via a switch:

switch (builtin_id) {
    case 417..422:  mnemonic = "ld";          // atomic load
    case 423..428:  mnemonic = "st";          // atomic store
    case 429..434:  mnemonic = "atom.add";    // fetch-add (unsigned, signed, float)
    case 435..440:  mnemonic = "atom.add";    // fetch-sub (negated; see below)
    case 441..442:  mnemonic = "atom.and";    // fetch-and
    case 443..444:  mnemonic = "atom.or";     // fetch-or
    case 445..446:  mnemonic = "atom.xor";    // fetch-xor
    case 447..452:  mnemonic = "atom.max";    // fetch-max
    case 453..458:  mnemonic = "atom.min";    // fetch-min
    case 462..465:  mnemonic = "atom.exch";   // exchange
    case 469..473:  mnemonic = "atom.cas";    // compare-and-swap
    default:        fatal("unexpected atomic builtin function");
}

For IDs 435--440 (fetch_sub), the handler does not emit atom.sub (which does not exist in PTX). Instead, for integer types it negates the operand and emits atom.add; for float types it negates via fneg and emits atom.add.f.

For thread fence (ID 366), the handler branches to sub_12AE0E0 (volatile fence, pre-SM 70) or sub_12AE4B0 (explicit fence, SM 70+) and returns immediately, bypassing the rest of the atomic pipeline.

Phase 3: Memory Ordering Resolution

The ordering parameter is extracted from the first constant operand of the C++11 atomic call via sub_620EE0. The value (0--5) maps to a PTX qualifier string:

Value	C++ Ordering	PTX Qualifier	Applies To
0	`relaxed` / monotonic	`relaxed`	All operations
1	`consume` (treated as acquire)	`acquire`	Loads, RMW
2	`acquire`	`acquire`	Loads, RMW
3	`release`	`release`	Stores, RMW
4	`acq_rel`	`acq_rel`	RMW operations
5	`seq_cst`	`acquire` (loads), `release` (stores)	All

Sequential consistency (value 5) is downgraded: loads get acquire, stores get release, and RMW operations get acq_rel. True seq_cst semantics are achieved by inserting explicit fences around the operation (see "Fence Insertion for Seq_Cst" below).

Store-specific validation. For store builtins (IDs 423--428), only ordering values 0, 3, and 5 are legal. Any other value triggers fatal("unexpected memory order."). Value 5 is treated as relaxed for the store instruction itself, with the seq_cst fence handling the ordering guarantee externally.

Load-specific validation. For load builtins (IDs 417--422), values 3 (release) and 4 (acq_rel) are illegal and trigger the same fatal error.

Phase 4: Scope Resolution

The scope parameter is extracted from the second constant operand via sub_620EE0. The value (0--4) maps to a PTX scope qualifier:

switch (scope_value) {
    case 0:  // fall through
    case 1:  scope_str = "cta";      break;   // thread block
    case 2:
        if (unk_4D045E8 > 0x59)               // SM > 89
            scope_str = "cluster";             // SM 90+ (Hopper)
        else
            scope_str = "gpu";                 // SM <= 89: fallback
        break;
    case 3:  scope_str = "gpu";      break;   // device
    case 4:  scope_str = "sys";      break;   // system
    default: fatal("unexpected atomic operation scope.");
}

The cluster scope fallback is the critical SM gate at line 255 / 424 of sub_12AE930 / sub_9502D0: when the SM version is 89 or below, scope value 2 ("cluster") silently degrades to gpu. No diagnostic is emitted; the scope is simply rewritten. On SM 90+ (Hopper and later), cluster passes through to the PTX output.

Phase 5: Type Suffix Construction

The type suffix is built from two components: a type-class letter and a byte-width number. The type-class lookup uses a 4-entry table stored in local variable v196:

v196[0] = 'b'    // bitwise   (for exch, and, or, xor, cas)
v196[1] = 'u'    // unsigned  (for add, inc, dec, max, min on unsigned)
v196[2] = 's'    // signed    (for max, min on signed)
v196[3] = 'f'    // float     (for add on float/double)

The type-class index is derived from the LLVM type of the atomic operand:

Integer type with unsigned semantics: index 1 (u)
Integer type with signed semantics: index 2 (s)
Floating-point type: index 3 (f)
All other cases (exchange, CAS, bitwise): index 0 (b)

The byte-width is the size of the atomic operand in bytes. Valid sizes are validated against the bitmask 0x10116:

valid = ((1LL << byte_size) & 0x10116) != 0

This bitmask has bits set at positions 1, 2, 4, 8, and 16, accepting exactly the byte widths {1, 2, 4, 8, 16}. Any other size triggers fatal("unexpected size1").

The resulting suffix is the letter concatenated with the bit width (byte_size * 8): .u32, .s64, .f32, .b128, etc.

Phase 6: Inline ASM String Assembly and Emission

The handler assembles the final PTX string by concatenating the components. Two string buffers are maintained throughout: v190 (ordering string) and v193 (scope string), set during phases 3 and 4.

For SM 70+ (memory model mode):

// Loads:
sprintf(buf, "ld.%s.%s.%c%d $0, [$1];", v190, v193, type_letter, bit_width);
// Stores:
sprintf(buf, "st.%s.%s.%c%d [$0], $1;", v190, v193, type_letter, bit_width);
// RMW atomics:
sprintf(buf, "%s.%s.%s.%c%d $0, [$1], $2;", mnemonic, v190, v193, type_letter, bit_width);
// CAS:
sprintf(buf, "%s.%s.%s.%c%d $0, [$1], $2, $3;", mnemonic, v190, v193, type_letter, bit_width);

For pre-SM 70 (volatile mode):

// Loads:
sprintf(buf, "ld.volatile.%c%d $0, [$1];", type_letter, bit_width);
// RMW atomics:
sprintf(buf, "%s.volatile.%c%d $0, [$1], $2;", mnemonic, type_letter, bit_width);

Constraint string construction. The LLVM inline ASM constraint string is built dynamically to match the operand pattern:

Pattern	Constraint String	Meaning
Load (`ld`)	`"=r,l,~{memory}"` or `"=l,l,~{memory}"`	result in reg, address in 64-bit reg, memory clobber
Store (`st`)	`"l,r,~{memory}"` or `"l,l,~{memory}"`	address in 64-bit reg, value in reg, memory clobber
RMW (`atom.*`)	`"=r,l,r,~{memory}"`	result, address, operand, memory clobber
CAS (`atom.cas`)	`"=r,l,r,r,~{memory}"`	result, address, compare, swap, memory clobber

The register class for result and value operands is r for 32-bit types and l for 64-bit types. 128-bit types use l with pair operands.

The assembled PTX string and constraint string are passed to sub_B41A60 (NVVM side) or the equivalent EDG-side helper, which creates an LLVM InlineAsm node. The node is then emitted via sub_921880 / sub_1285290.

Fence Insertion for Seq_Cst

When the memory ordering is sequential consistency (value 5) and the SM version supports explicit fences (SM 70+), the handler does not simply emit atom.sc.{scope}. Instead, it implements seq_cst through a fence-bracketed pattern:

Pre-fence: If the operation is a store or RMW and ordering >= release, the handler calls sub_94F9E0 (membar) or sub_94FDF0 (fence) to emit a leading fence:
- sub_94F9E0 emits membar.{scope}; as inline PTX
- sub_94FDF0 emits fence.sc.{scope}; or fence.acq_rel.{scope};
The atomic operation: Emitted with downgraded ordering (acquire for loads, release for stores, acq_rel for RMW).
Post-fence: If the operation is a load or RMW and ordering >= acquire, a trailing fence is emitted.

The fence scope matches the atomic operation's scope. The decision to emit membar vs fence depends on the SM version and the specific ordering level: membar is used for the pre-SM 70 path (though that path should not reach this code), and fence.sc / fence.acq_rel for SM 70+.

The pre/post-fence logic is gated by two conditions in the NVVM-side handler:

PRE-FENCE:  if (v186 && (v187 - 3) <= 2)    // v187 is ordering; range [3,5] = release, acq_rel, seq_cst
POST-FENCE: if (!v175 && v169 == 5)          // v175 = is_store; v169 = ordering = seq_cst

The Volatile Fence Handler (sub_12AE0E0)

For thread fence on SM <= 69, sub_12AE0E0 emits a volatile memory barrier. The function takes an ASM buffer and fence configuration parameters. It produces:

membar.{scope};

where the scope is derived from the fence's scope parameter (cta / gl / sys). This is the pre-memory-model equivalent of the explicit fence path.

The Explicit Fence Handler (sub_12AE4B0)

For thread fence on SM 70+, sub_12AE4B0 constructs an explicit fence.{ordering}.{scope}; instruction. The ordering for fences is a restricted set compared to atomics:

Ordering Value	Fence Qualifier
3	`sc` (sequentially consistent)
4	`acq_rel`
5	`sc` (same as 3)
Other	`fatal("unexpected memory order.")`

The scope string follows the same rules as atomics. The assembled string is emitted as LLVM inline ASM with a ~{memory} clobber.

Memory Ordering Encoding

The ordering parameter (values 0--5) maps to PTX qualifiers:

Value	Ordering	Used For
0	`relaxed`	Default / monotonic
1, 2	`acquire`	Loads, RMW
3	`release`	Stores
4	`acq_rel`	RMW operations
5	`acquire`	Sequential consistency (downgraded)

Scope Encoding

The scope parameter (values 0--4) maps to PTX scope qualifiers:

Value	Scope	PTX	SM Requirement
0, 1	Block	`.cta`	All
2	Cluster	`.cluster`	SM 90+ (Hopper); falls back to `.gpu` on SM <= 89
3	Device	`.gpu`	All
4	System	`.sys`	All

Type Suffix Construction

The type suffix is built from a 4-entry table: b (bitwise), u (unsigned), s (signed), f (float). Combined with the byte size, this produces suffixes like .u32, .f64, .b128. Valid sizes are validated against the bitmask 0x10116 (bits for 1, 2, 4, 8, and 16 bytes).

The 13 Atomic Operations at PTX Emission

The PTX emission layer at sub_21E5E70 (base) and sub_21E6420 (L2-hinted) implements the final encoding from the NVPTX MachineInstr opcode to the PTX text. The instruction operand word at this stage encodes both scope and operation:

bits[7:4]    — scope:  0 = gpu (default), 1 = cta, 2 = sys
bits[23:16]  — atomic operation opcode (BYTE2)

The 13-entry dispatch table:

Opcode	PTX Suffix	L2-Hinted Suffix	Description
0x00	`.exch.b`	`.exch.L2::cache_hint.b`	Bitwise exchange
0x01	`.add.u`	`.add.L2::cache_hint.u`	Unsigned add
0x02	(missing)	(missing)	No `.add.s` in PTX ISA
0x03	`.and.b`	`.and.L2::cache_hint.b`	Bitwise AND
0x04	(missing)	(missing)	Unused slot
0x05	`.or.b`	`.or.L2::cache_hint.b`	Bitwise OR
0x06	`.xor.b`	`.xor.L2::cache_hint.b`	Bitwise XOR
0x07	`.max.s`	`.max.L2::cache_hint.s`	Signed max
0x08	`.min.s`	`.min.L2::cache_hint.s`	Signed min
0x09	`.max.u`	`.max.L2::cache_hint.u`	Unsigned max
0x0A	`.min.u`	`.min.L2::cache_hint.u`	Unsigned min
0x0B	`.add.f`	`.add.L2::cache_hint.f`	Float add
0x0C	`.inc.u`	`.inc.L2::cache_hint.u`	Unsigned increment
0x0D	`.dec.u`	`.dec.L2::cache_hint.u`	Unsigned decrement
0x0E	`.cas.b`	`.cas.L2::cache_hint.b`	Compare-and-swap

Opcodes 0x02 and 0x04 are unoccupied. There is no signed atomic add in PTX (signed add uses .add.u since two's-complement wrapping is identical). Slot 0x04 is simply skipped.

The scope prefix is emitted before the operation suffix:

bits[7:4] & 0xF:
    0  ->  (nothing; implicit .gpu scope)
    1  ->  ".cta"
    2  ->  ".sys"

Full PTX emission format:

atom[.scope].{op}.{type}{size}

Example: atom.cta.add.u32, atom.sys.cas.b64, atom.exch.b32.

L2 Cache Hint System (SM 80+ / Ampere)

sub_21E6420 (address 0x21E6420) is a parallel version of the base atomic emitter sub_21E5E70. It inserts .L2::cache_hint between the operation and type suffix for all 13 atomic operations:

atom[.scope].{op}.L2::cache_hint.{type}{size}

The L2 cache hint instructs the GPU's L2 cache to retain (or evict) the atomic target data after the operation completes. This is a PTX 7.3+ feature introduced with Ampere (SM 80+).

The L2-hinted path is selected when bit 0x400 is set in the instruction's encoding flags. The hint is applied at the MachineInstr level during instruction selection, not during the inline ASM generation phase of sub_12AE930. Both paths produce identical scope and type encoding; the L2 path adds exactly the .L2::cache_hint substring.

String emission uses SSE (xmm) register loads from precomputed constant data at addresses xmmword_435F590 through xmmword_435F620 to fast-copy the 16-byte prefix of each operation string, then patches the remaining bytes. This avoids branch-heavy string concatenation for the 13 cases.

AtomicExpandPass: IR-Level Expansion (sub_20C9140)

Before sub_12AE930 handles the C++11 atomics, and separately from the legacy builtin lowering, an LLVM FunctionPass named "Expand Atomic instructions" (pass ID "atomic-expand", registered at sub_20CA900) runs on LLVM IR to decide which atomic operations the NVPTX target can handle natively and which must be expanded into CAS loops.

Expansion Decision Tree

For each atomic instruction in the function:

shouldExpandAtomicCmpXchgInIR (vtable +0x258): Default expands all cmpxchg to LL/SC or CAS-based loops. The NVPTX override may keep native i32/i64 cmpxchg on SM 70+.
shouldExpandAtomicRMWInIR (vtable +0x280):
- i32 xchg/add/min/max: kept native on all SM.
- i64 xchg/add: kept native on SM 70+.
- i32/i64 sub/nand: always expanded to CAS loop (no native PTX instruction).
- i8/i16 (any operation): always expanded via partword masking.
- Float atomicAdd: native on SM 70+ (fp32), SM 80+ (fp16/bf16).
shouldExpandAtomicLoadInIR (vtable +0x270): Native for aligned i32/i64. Expanded for i8/i16 (widen to i32 load + extract) and i128+ (decompose to multiple loads).
shouldExpandAtomicStoreInIR (vtable +0x278): Native for aligned i32/i64. Expanded for sub-word and >64-bit types.

Sub-Word Atomic Expansion (sub_20CB200)

No NVIDIA GPU architecture through SM 120 supports native sub-word (i8/i16) atomics. The pass generates mask-and-shift wrappers around word-sized CAS loops. The mask generation function sub_20CB200 (2896 bytes) produces a 6-field output struct:

Field	Name	Purpose
+0x00	`AlignedAddr`	Pointer masked to word boundary: `ptr & ~(word_size - 1)`
+0x08	`AlignedType`	Always i32
+0x10	`PtrLSB`	Low address bits: `ptr & (word_size - 1)`
+0x18	`ShiftAmt`	Bit position within the word: `PtrLSB * 8` (little-endian)
+0x20	`Inv_Mask`	Inverted mask: `~(((1 << (type_size * 8)) - 1) << ShiftAmt)`
+0x28	`Mask`	Mask: `(1 << (type_size * 8)) - 1`

The CAS loop (sub_20CBD50, 1646 bytes) then:

Shifts the new value into position: ValOperand_Shifted = new_val << ShiftAmt.
Loops: loads the word, applies the RMW operation on the masked sub-word, attempts CAS on the full word.
On success: extracts the sub-word result via shift + mask.

CAS Loop Generation (sub_20C96A0)

For operations that cannot be handled natively, the pass builds a compare-and-swap loop with three basic blocks:

entry -> "atomicrmw.start" -> (CAS failure) -> "atomicrmw.start" (retry)
                           -> (CAS success) -> "atomicrmw.end"

Steps:

Load current value from pointer.
Compute new value using the RMW operation (dispatched through an 11-case switch at sub_20CC690: Xchg, Add, Sub, And, Or, Xor [implied], Nand, Max, Min, UMax, UMin, FMin, FMax).
Emit cmpxchg with packed success+failure orderings.
Branch back to start on failure, fall through to end on success.

Ordering-to-Fence Table (address 0x428C1E0)

The pass uses a 7-entry fence decision table indexed by LLVM AtomicOrdering enum:

Ordering	Index	Release Fence Before?	Acquire Fence After?
NotAtomic	0	No	No
Unordered	1	No	No
Monotonic	2	No	No
Acquire	3	No	Yes
Release	4	Yes	No
AcquireRelease	5	Yes	Yes
SequentiallyConsistent	6	Yes	Yes (+ barrier)

Fence emission calls sub_15F9C80 which creates an LLVM fence instruction with the specified ordering and sync scope.

Memory Barrier and Fence Emission

The PTX emission layer has two dedicated handlers for barriers and fences, separate from the atomic operation emitters.

Memory Barrier (sub_21E94F0)

Emits membar instructions based on a 4-bit operand encoding:

Value	Instruction	Scope
0	`membar.gpu`	Device
1	`membar.cta`	Thread block
2	`membar.sys`	System
4	`fence.sc.cluster`	Cluster (SM 90+)
3	`fatal("Bad membar op")`	Invalid

NVVM-Side Membar (sub_94F9E0)

At the NVVM lowering level, sub_94F9E0 handles membar emission with a different scope encoding:

Scope Value	Scope String	PTX Instruction
0, 1	`cta`	`membar.cta;`
2, 3	`gl`	`membar.gl;`
4	`sys`	`membar.sys;`
Other		`fatal("unexpected atomic operation scope.")`

NVVM-Side Fence (sub_94FDF0)

Constructs fence.{ordering}.{scope}; from a state array. The ordering mapping is:

Value	Ordering String
3	`sc`
4	`acq_rel`
5	`sc`
Other	`fatal("unexpected memory order.")`

Both membar and fence are emitted as inline PTX assembly (not LLVM IR fence instructions) because PTX-level memory ordering semantics have no direct LLVM IR equivalent at the precision NVIDIA requires.

Architecture Gates

SM Threshold	Effect
SM <= 59	Diagnostic `0xEB6` warning for certain atomic patterns
SM 60--69	Diagnostic `0xEB2` (3762) for specific atomic patterns
SM <= 69	Volatile mode; 128-bit atomics not supported (diagnostic `0xEB4`)
SM 70+	Explicit ordering/scope in PTX output
SM <= 89	Scope value 2 silently falls back from `cluster` to `gpu`
SM <= 89	Half-precision (2-byte FP) atomics not supported
SM 90+ (Hopper)	Cluster scope (`.cluster`) becomes available
SM 90+	`f16x2` packed atomic add (IDs 459--461)
SM 90+	`fence.sc.cluster` becomes available
SM 100+ (Blackwell datacenter)	`f16x4` packed atomic add (IDs 466--468)

EDG Frontend Name Construction

The EDG atomic builtin generator sub_6BBC40 (address 0x6BBC40, 1251 lines) constructs internal function names from C++ cuda::atomic_ref calls. The algorithm uses a dispatch key v165 = *(uint16_t*)(type_node + 176), the EDG "builtin kind" tag, to select the operation:

v165 (hex)	v165 (dec)	Operation
0x6241, 0x6242	25153, 25154	compare_exchange
0x6248, 0x6249	25160, 25161	exchange
0x624F, 0x6250	25167, 25168	fetch_add
0x6257, 0x6258	25175, 25176	fetch_sub
0x625F, 0x6260	25183, 25184	fetch_and
0x6263, 0x6264	25187, 25188	fetch_xor
0x6267, 0x6268	25191, 25192	fetch_or
0x626B, 0x626C	25195, 25196	fetch_max
0x6273, 0x6274	25203, 25204	fetch_min
0x627B, 0x627C	25211, 25212	load
0x6280, 0x6281	25216, 25217	store
0x6286	25222	thread_fence

Within each pair, the odd ID is the "generic" overload that enters the renaming path; the even ID has its base name string set explicitly via strcpy.

Name Construction Algorithm (lines 877--996 of sub_6BBC40)

Step 1 -- Base name. Copy the EDG source name, then overwrite with the canonical base for the seven fetch-op builtins:

v165     Base name
------   ---------------------------
0x6250   "__nv_atomic_fetch_add"
0x6258   "__nv_atomic_fetch_sub"
0x6260   "__nv_atomic_fetch_and"
0x6264   "__nv_atomic_fetch_xor"
0x6268   "__nv_atomic_fetch_or"
0x626C   "__nv_atomic_fetch_max"
0x6274   "__nv_atomic_fetch_min"

Step 2 -- Width suffix. Append "_%u" formatted with the type size in bytes from *(uint32_t*)(type_node + 128). For fetch-op builtins, the size is validated as (type_size - 4) <= 4, accepting only 4 and 8 bytes.

Step 3 -- Type suffix (only for add/sub/max/min; lines 960--996). Reads type_kind = *(uint8_t*)(type_node + 140):

type_kind	Meaning	Suffix	Condition
2	integer	`_s`	`byte_4B6DF90[signedness_byte] != 0` (signed)
2	integer	`_u`	`byte_4B6DF90[signedness_byte] == 0` (unsigned)
3	float	`_f`	Always
6	unsigned explicit	`_u`	Always

byte_4B6DF90 is a 256-entry lookup table that maps the EDG "integer kind" sub-tag (at type_node + 160) to a boolean: 1 = signed, 0 = unsigned.

Bitwise operations (and/or/xor) omit the type suffix entirely.

Naming Pattern Summary

__nv_atomic_fetch_{op}_{width}[_{type}]

{op}    = add | sub | and | xor | or | max | min
{width} = 4 | 8  (bytes)
{type}  = _s (signed), _u (unsigned), _f (float), or omitted (bitwise)

For load/store/exchange/compare_exchange, only the width suffix is appended; no type suffix.

Validation Diagnostics

Diagnostic	Hex	Condition
852	`0x354`	Unsupported atomic operation for target
1645	`0x66D`	Wrong return type for builtin
1646	`0x66E`	Unsupported type size (not in {1,2,4,8,16})
3745	`0xEA1`	Atomic not supported for given type
3746	`0xEA2`	First param scope exceeds range (>5)
3747	`0xEA3`	Return param scope exceeds range (>4)
3748	`0xEA4`	`fetch_op` type size not 4 or 8 bytes
3749	`0xEA5`	Store with type_size <= 1 (too small)
3750	`0xEA6`	Load with type_size > 3 (too large)
3756	`0xEAC`	CAS parameter type mismatch
3757	`0xEAD`	Exchange parameter type mismatch
3759	`0xEAF`	Float return not supported below SM 90
3762	`0xEB2`	SM 60--69 atomic variant diagnostic
3763	`0xEB3`	Return type on store (SM <= 89)
3764	`0xEB4`	128-bit store/load not supported on this SM
3765	`0xEB5`	16-bit store not supported on SM <= 69
3766	`0xEB6`	Generic warning for SM <= 59
3767	`0xEB7`	Type size not in {1,2,4,8,16} bitmask
3769	`0xEB9`	Null argument list error

EDG Type Node Field Map

Offset	Size	Field
+128	8	`type_size` (byte count: 1, 2, 4, 8, 16)
+140	1	`type_kind` (0=void, 2=integer, 3=float, 6=unsigned, 8=pointer, 12=typedef)
+160	varies	For type_kind 12 (typedef): pointer to underlying type. For type_kind 2 (integer): uint8_t signedness sub-tag indexed into `byte_4B6DF90`.
+168	8	Pointer chain (for struct/compound types)
+176	2	`builtin_kind` (the v165 dispatch tag, uint16_t)

NVPTX MachineInstr Atomic Opcodes

At the SelectionDAG / MachineInstr level, atomic operations map to NVPTX-specific opcodes distinct from the inline ASM emission:

MachineInstr Opcode	PTX Operation
149	`ATOMIC_LOAD`
294--297	`atom.add` (f32 / f64 / i32 / i64)
302--305	`atom.min` (s32 / s64 / u32 / u64)
314--317	`atom.max` (s32 / s64 / u32 / u64)
462	`atom.cas` (generic)

These opcodes are emitted by the SelectionDAG lowering for native atomic operations that survive the AtomicExpandPass without expansion.

Function Map

Function	Address	Size	Role
`sub_6BBC40`	`0x6BBC40`	~1251 lines	EDG atomic builtin name generator
`sub_12AA280`	`0x12AA280`		Legacy CAS IR node builder
`sub_12AA9B0`	`0x12AA9B0`		Legacy atomic exchange handler
`sub_12ADE80`	`0x12ADE80`		Scoped atomic load/store/fetch handler
`sub_12AE010`	`0x12AE010`		Fence acquire/release emitter (EDG only; BUG on NVVM)
`sub_12AE0E0`	`0x12AE0E0`		Volatile fence emitter (pre-SM 70)
`sub_12AE4B0`	`0x12AE4B0`		Explicit fence emitter (SM 70+)
`sub_12AE930`	`0x12AE930`	41KB	PTX inline ASM atomic codegen (EDG side)
`sub_12B3FD0`	`0x12B3FD0`	103KB	Main builtin lowering mega-switch
`sub_20C7CE0`	`0x20C7CE0`	1399	AtomicExpandPass: recursive type walker
`sub_20C84C0`	`0x20C84C0`	1656	AtomicExpandPass: address space checker
`sub_20C9140`	`0x20C9140`	1204	AtomicExpandPass: runOnFunction
`sub_20C96A0`	`0x20C96A0`	1814	AtomicExpandPass: CAS loop generation
`sub_20CA900`	`0x20CA900`	218	AtomicExpandPass: registration
`sub_20CB200`	`0x20CB200`	2896	AtomicExpandPass: sub-word mask generation
`sub_20CBD50`	`0x20CBD50`	1646	AtomicExpandPass: partword RMW expansion
`sub_20CC690`	`0x20CC690`	43	AtomicExpandPass: 11-case operation dispatch
`sub_20CD3E0`	`0x20CD3E0`	6030	AtomicExpandPass: partword CmpXchg expansion
`sub_20CEB70`	`0x20CEB70`	10640	AtomicExpandPass: full CmpXchg LL/SC expansion
`sub_21E5E70`	`0x21E5E70`		PTX emission: base atomic opcode emitter
`sub_21E6420`	`0x21E6420`		PTX emission: L2-hinted atomic opcode emitter
`sub_21E8EA0`	`0x21E8EA0`		PTX emission: cluster barrier emitter
`sub_21E94F0`	`0x21E94F0`		PTX emission: membar/fence emitter
`sub_9502D0`	`0x9502D0`	55KB	PTX inline ASM atomic codegen (NVVM side)
`sub_94F9E0`	`0x94F9E0`		NVVM membar emitter
`sub_94FDF0`	`0x94FDF0`		NVVM fence emitter

Cross-References

Builtin System Overview -- hash table infrastructure and ID dispatch
SM 70--89 Feature Gates -- unk_4D045E8 thresholds
SM 90 Hopper Features -- cluster scope, fence.sc.cluster
SM 100 Blackwell Features -- f16x4 atomics
PTX Emission -- instruction printer subsystem
NVPTX Opcodes Reference -- MachineInstr opcode table
Inline Assembly Codegen -- general inline ASM infrastructure at sub_1292420

Math Function Builtins

Math builtins cover floating-point rounding, transcendental approximations, reciprocal/square-root operations, type conversions, and precise arithmetic with explicit rounding modes. They span IDs 21--184 and 276--403, totaling over 230 entries. Unlike most other builtin categories, many math builtins fall through the dispatch switch entirely and resolve via the generic LLVM intrinsic path.

Bit Manipulation (IDs 21--26)

These integer utility operations map directly to hardware instructions available on all SM targets.

ID	Builtin	Operation
21--22	`__nvvm_clz_{i,ll}`	Count leading zeros (32/64-bit)
23--24	`__nvvm_popc_{i,ll}`	Population count (32/64-bit)
25--26	`__nvvm_brev_{i,ll}`	Bit reverse (32/64-bit)

Rounding and Absolute Value (IDs 27--46)

Float rounding and absolute value operations exist in three type variants: flush-to-zero single (ftz_f), IEEE single (f), and double (d).

ID Range	Operation	Variants
27--29	`__nvvm_floor_{ftz_f,f,d}`	Floor
30--32	`__nvvm_ceil_{ftz_f,f,d}`	Ceiling
33--35	`__nvvm_abs_{ftz_f,f,d}`	Absolute value (integer-style)
36--38	`__nvvm_fabs_{ftz_f,f,d}`	Absolute value (float)
39--41	`__nvvm_round_{ftz_f,f,d}`	Round to nearest
42--44	`__nvvm_trunc_{ftz_f,f,d}`	Truncate toward zero
45--46	`__nvvm_saturate_{ftz_f,f}`	Clamp to [0.0, 1.0]

Transcendental Approximations (IDs 47--56)

Hardware-accelerated approximations for transcendental functions. These use the GPU's special function units (SFU) and are not IEEE-compliant.

ID Range	Operation	Variants
47--49	`__nvvm_ex2_approx_{ftz_f,f,d}`	Base-2 exponential
50--52	`__nvvm_lg2_approx_{ftz_f,f,d}`	Base-2 logarithm
53--55	`__nvvm_sin_approx_{ftz_f,f,d}`	Sine
56	`__nvvm_cos_approx_ftz_f`	Cosine (FTZ only registered)

Reciprocal (IDs 57--69)

Full-precision reciprocal with all four IEEE rounding modes and three type variants.

ID Range	Operation	Rounding Modes
57--69	`__nvvm_rcp_{rn,rz,rm,rp}_{ftz_f,f,d}`	RN (nearest), RZ (zero), RM (minus), RP (plus)

The 13 entries cover 4 rounding modes x 3 types, with the FTZ single-precision variant adding one additional entry.

Square Root and Reciprocal Square Root (IDs 70--87)

ID Range	Operation	Description
70--84	`__nvvm_sqrt_{f,rn,rz,rm,rp}_{ftz_f,f,d}`	Square root (5 modes x 3 types)
85--87	`__nvvm_rsqrt_approx_{ftz_f,f,d}`	Reciprocal square root (SFU approximation)

The sqrt_f variant (without rounding qualifier) uses the default hardware rounding. The rsqrt_approx variants use the SFU fast path.

Type Conversions (IDs 88--184)

The largest math subcategory with 97 entries, covering every combination of source type, destination type, rounding mode, and FTZ flag.

Double-to-Float (IDs 88--95)

__nvvm_d2f_{rn,rz,rm,rp}_{ftz,} -- 4 rounding modes x 2 FTZ variants.

Integer/Float Cross-Conversions (IDs 96--177)

82 entries covering all permutations of:

Source types: d (double), f (float), i (int32), ui (uint32), ll (int64), ull (uint64)
Destination types: same set
Rounding modes: rn, rz, rm, rp

Pattern: __nvvm_{src}2{dst}_{rounding} (e.g., __nvvm_d2i_rn, __nvvm_f2ull_rz).

Half Precision (IDs 178--180)

ID	Builtin	Description
178	`__nvvm_f2h_rn_ftz`	Float to half (FTZ, round nearest)
179	`__nvvm_f2h_rn`	Float to half (round nearest)
180	`__nvvm_h2f`	Half to float

Bitcast (IDs 181--184)

Reinterpret-cast between integer and float types without value conversion. Lowered via sub_12A7DA0 which emits opcode 0x31 (49, bitcast).

ID	Builtin	Direction
181	`__nvvm_bitcast_f2i`	float -> int32
182	`__nvvm_bitcast_i2f`	int32 -> float
183	`__nvvm_bitcast_ll2d`	int64 -> double
184	`__nvvm_bitcast_d2ll`	double -> int64

Integer Min/Max and Multiply-High (IDs 276--293)

ID Range	Operation	Types
276--279	`__nvvm_{min,max}_{i,ui}`	32-bit signed/unsigned
280--283	`__nvvm_{min,max}_{ll,ull}`	64-bit signed/unsigned
284--289	`__nvvm_f{min,max}_{f,ftz_f,d}`	Float min/max (with FTZ)
290--293	`__nvvm_mulhi_{i,ui,ll,ull}`	Upper half of multiplication

Precise Float Arithmetic (IDs 294--349)

These builtins provide IEEE-compliant arithmetic with explicit rounding mode control. Each operation exists in all four rounding modes and up to five type variants (ftz_f, f, ftz_f2, f2, d).

ID Range	Operation	Entries
294--313	`__nvvm_mul_{rn,rz,rm,rp}_{ftz_f,f,ftz_f2,f2,d}`	20
314--333	`__nvvm_add_{rn,rz,rm,rp}_{ftz_f,f,ftz_f2,f2,d}`	20
334--349	`__nvvm_div_{rn,rz,rm,rp}_{ftz_f,f,d}`	16

FMA (IDs 383--402)

Fused multiply-add with all rounding/type combinations:

ID Range	Operation	Entries
383--402	`__nvvm_fma_{rn,rz,rm,rp}_{ftz_f,f,d,ftz_f2,f2}`	20

Miscellaneous (IDs 350, 380--382, 403)

ID	Builtin	Description
350	`__nvvm_lohi_i2d`	Compose double from two 32-bit halves
380	`__nvvm_prmt`	Byte permute (PRMT instruction)
381--382	`__nvvm_sad_{i,ui}`	Sum of absolute differences
403	`__nvvm_fns`	Find Nth set bit

Table-Based Lowering for Precise Arithmetic

The precise arithmetic builtins (mul, add, div, fma with rounding modes) are lowered through sub_12B3540 (address 0x12B3540, 10KB), which uses two lazily-initialized red-black trees (std::map<int, triple>) to map builtin IDs to IR opcode triples.

Tree 1 serves three-operand builtins (FMA): maps ID ranges to opcode 0xF59 with variant codes encoding the rounding mode and type.

Tree 2 serves two-operand builtins (mul, add, div): maps to opcodes 0xE3A, 0xE3B, 0x105E, 0x1061 depending on the operation.

The lookup procedure:

Extract up to 4 operand arguments from the call expression
Find the builtin ID in the appropriate tree to obtain (opcode, variant)
Look up the IR function via sub_126A190
Emit the call instruction via sub_1285290
Generate the inline asm fragment via sub_12A8F50

LLVM Intrinsic Fallback Path

Many standard math builtins (floor, ceil, sin, cos, sqrt, fma, exp, log) are not handled by the switch cases at all. When the builtin table lookup returns ID 0 (name not found), the dispatcher falls through to the generic LLVM intrinsic path at LABEL_4 in sub_955A70. This path:

Checks if the name starts with "llvm." (prefix constant 0x6D766C6C)
Looks up the intrinsic via sub_B6ACB0 (LLVM intrinsic name-to-ID)
Lowers all arguments with type-cast insertion where needed
Emits a standard LLVM call via sub_921880

This means functions like llvm.floor.f32, llvm.cos.f64, and llvm.fma.f32 bypass the builtin ID system entirely and map directly to LLVM's intrinsic infrastructure.

Float Compatibility Wrappers (IDs 643--646)

Four C runtime float functions are registered as builtins for compatibility:

ID	Builtin	Maps To
643	`__ceilf`	`__nvvm_ceil_f` equivalent
644	`__floorf`	`__nvvm_floor_f` equivalent
645	`__roundf`	`__nvvm_round_f` equivalent
646	`__truncf`	`__nvvm_trunc_f` equivalent

Tensor Core / MMA Builtins

Tensor core builtins implement the Warp Matrix Multiply-Accumulate (WMMA) and Warp Group MMA (WGMMA) interfaces, spanning IDs 678--770 across four SM generations. Each generation added new data types and matrix shapes, resulting in 91 registered builtins that cover half-precision, integer, binary, double-precision, TF32, BF16, and FP8 matrix operations. SM 100 (Blackwell) adds a fifth generation -- tcgen05 -- documented in Tensor / MMA Codegen.

Key Facts

Property	Value
Builtin IDs	678--770 (93 entries)
WGMMA handler (IDs 753--768)	~800 lines in `sub_12B3FD0` / `sub_955A70`
LLVM intrinsic range (WGMMA)	5304--5447 (144-entry 5-D grid) plus 10654--10779 (N-dimension table)
NVVM lowering	`sub_955A70` (105KB), `sub_12B3FD0` (103KB)
Backend emission	`sub_21E74C0` (PTX builder), `sub_36E9630` (tcgen05 ISD selection)
SM gates	SM 70+ HMMA, SM 72+ IMMA, SM 75+ BMMA, SM 80+ DMMA/TF32/BF16, SM 90+ WGMMA

WMMA Architecture Evolution

SM Generation	Feature	ID Range	Count
SM 70 (Volta)	HMMA: FP16 tensor core	678--707	30
SM 75 (Turing)	IMMA: INT8/INT4, BMMA: binary	708--745	38
SM 80 (Ampere)	DMMA: FP64, TF32, BF16	746--764	19
SM 90 (Hopper)	WGMMA: warp-group MMA, FP8	765--768	4
SM 100 (Blackwell)	tcgen05: MX formats, block-scale, sparsity	(intrinsic path)	--

HMMA -- Half-Precision (IDs 678--707, SM 70+)

The original tensor core builtins provide 16-bit floating-point matrix multiply for three tile shapes. Each shape has 10 operations: load A, load B, load C (f16 and f32 accumulators), store C (f16 and f32), and four MMA variants for input/output precision combinations.

ID Range	Shape	Builtin Prefix
678--687	16x16x16	`__hmma_m16n16k16_*`
688--697	32x8x16	`__hmma_m32n8k16_*`
698--707	8x32x16	`__hmma_m8n32k16_*`

Per-shape operations (10 each):

Suffix	Operation	Description
`ld_a`	Load A fragment	Load matrix A tile from memory
`ld_b`	Load B fragment	Load matrix B tile from memory
`ld_c_f16`	Load C (f16)	Load accumulator as half-precision
`ld_c_f32`	Load C (f32)	Load accumulator as single-precision
`st_c_f16`	Store C (f16)	Store result as half-precision
`st_c_f32`	Store C (f32)	Store result as single-precision
`mma_f16f16`	MMA f16->f16	FP16 input, FP16 accumulator
`mma_f32f16`	MMA f16->f32	FP16 input, FP32 accumulator
`mma_f16f32`	MMA f32->f16	FP32 accumulator, FP16 output
`mma_f32f32`	MMA f32->f32	FP32 input and accumulator

IMMA -- Integer MMA (IDs 708--739, SM 75+)

Integer tensor core operations for INT8 and INT4 data types.

INT8 (IDs 708--731)

Three shapes (16x16x16, 32x8x16, 8x32x16), each with 8 operations:

Suffix	Description
`ld_a_s8` / `ld_a_u8`	Load A fragment (signed/unsigned INT8)
`ld_b_s8` / `ld_b_u8`	Load B fragment (signed/unsigned INT8)
`ld_c`	Load accumulator (INT32)
`st_c_i32`	Store result (INT32)
`mma_s8` / `mma_u8`	INT8 MMA (signed/unsigned)

INT4 (IDs 732--739)

Single shape (8x8x32) with the same operation set but _s4 / _u4 type suffixes.

BMMA -- Binary MMA (IDs 740--745, SM 75+)

Binary (1-bit) matrix multiply with XOR-POPC and AND-POPC accumulation modes. Single shape: 8x8x128.

ID	Builtin	Description
740	`__bmma_m8n8k128_ld_a_b1`	Load A fragment (binary)
741	`__bmma_m8n8k128_ld_b_b1`	Load B fragment (binary)
742	`__bmma_m8n8k128_ld_c`	Load accumulator
743	`__bmma_m8n8k128_st_c_i32`	Store result
744	`__bmma_m8n8k128_mma_xor_popc_b1`	Binary MMA (XOR + popcount)
745	`__bmma_m8n8k128_mma_and_popc_b1`	Binary MMA (AND + popcount)

Extended Tensor Core (IDs 746--764, SM 80+)

SM 80 (Ampere) added double-precision, TF32, and BF16 tensor operations.

DMMA -- Double Precision (IDs 746, 751--754)

ID	Builtin	Description
746	`__dmma_m8n8k4_mma_f64`	FP64 MMA
751	`__dmma_m8n8k4_st_c_f64`	Store FP64 result
752--754	`__dmma_m8n8k4_{ld_a,ld_b,ld_c}`	Load fragments

TF32 (IDs 747, 755--757)

ID	Builtin	Description
747	`__mma_tf32_m16n16k8_mma_f32`	TF32 MMA producing FP32
755--757	`__mma_tf32_m16n16k8_{ld_a,ld_b,ld_c}`	Load fragments

BF16 (IDs 748--750, 758--764)

ID	Builtin	Description
748	`__mma_bf16_m16n16k16_mma_f32`	BF16 16x16x16 MMA
749	`__mma_bf16_m32n8k16_mma_f32`	BF16 32x8x16 MMA
750	`__mma_bf16_m8n32k16_mma_f32`	BF16 8x32x16 MMA
758--764	`__mma_bf16_m*_{ld_a,ld_b}`	Load fragments for each shape

WMMA Lowering Details

Three-Table Lookup

WMMA builtins use a three-table structure for mapping builtin IDs to LLVM intrinsic IDs:

Table	Address (NVVM)	ID Range	Description
`dword_3F14840`	Entries 0--29	678--707	HMMA (first-generation, FP16)
`dword_3F147E0`	Entries 0--23	708--731	IMMA (INT8)
`dword_3F147A0`	Entries 0--12	732--744	BMMA (binary) / INT4

The EDG-side parallel tables live at dword_42810C0 (678--709), dword_4281060 (708--731), dword_4281020 (732--744), addressed from sub_12AC1A0.

Fragment Size Determination

The number of register-level fragments varies by operation and data type:

Condition	Fragment Count	Example
First-gen WMMA, BF16, store	4	BF16 store_c
First-gen WMMA, default	8	FP16 mma
IMMA, intrinsic 8914/8280	2	INT8 ld_a compact
BMMA	2	Binary operations
IMMA intrinsic 0x22BB/0x22BC/0x22C5/0x22C6	4	INT4 load A/B
IMMA intrinsic 0x22BD/0x22BE/0x22C3/0x22C4/0x22CB--0x22CE	1	Sub-byte single-element
IMMA intrinsic 0x22B7/0x22BF/0x22C7	8	INT8 full-width

MMA Codegen Flow

The MMA handler (sub_94E0D0 / sub_12AC5F0) processes 5 input operands:

dest_ptr -- Pointer to output fragment storage
A_fragment -- Matrix A input (loaded v100 times)
B_fragment -- Matrix B input (loaded v95 times)
C_fragment -- Accumulator input (loaded v101 times)
rowcol -- Layout operand (validated 0--3 for MMA)

An optional satf flag (saturation, validated 0--1) is consumed for most intrinsics except ID 8279.

The handler emits the MMA call via sub_921880 and scatters results back to the destination fragment through v103 iterations of element-wise stores.

Fragment iteration counts per family (NVVM path, sub_94E0D0):

Family	v95 (load B)	v100 (load A)	v101 (load C)	v103 (store D)
BMMA (b1)	1	1	2	2
IMMA (0x22C0-0x22C1)	1	4	8	8
IMMA (0x22B8-0x22B9 = 8888-8889)	2	2	8	8
IMMA (0x22C8-0x22C9 = 8904-8905)	4	1	8	8
HMMA (default, first-gen)	8	8	varies	varies (4 or 8)

The output fragment count is determined by bit-test: (0x300C003 >> (intrinsic_id + 127)) & 1 selects 4 vs 8 fragments.

Architecture Gating -- Exact Thresholds

The architecture version is stored at *(target_info + 252) as a DWORD.

Function	Gate Expression	Minimum SM	Notes
`sub_21DFBF0` hmmastc	`v8 > 0x45`	SM 70	FP16 store
`sub_21E0360` hmmaldab	`v8 > 0x45`	SM 70	FP16 load A/B
`sub_21E0870` hmmamma	`v8 > 0x45`	SM 70	FP16 MMA
`sub_21E1280` immaldab	`v8 > 0x47`	SM 72	INT load; `v8==72 && variant>1` rejected
`sub_21E1D20` immamma	`v8 > 0x47`	SM 72	INT MMA; `variant>1 && v8==72` rejected
`sub_21E2280` bmmamma	`v8 > 0x48`	SM 73/75	Binary MMA
`sub_36E9630` tcgen05	`arch >= 0x3E8`	SM 100	Blackwell only

SM 72 (Xavier) has a unique partial IMMA implementation: only variant 0/1 shapes are supported, with explicit gating that blocks higher variants. This matches hardware reality where Xavier had limited INT8 tensor cores.

WGMMA -- Warp Group MMA (SM 90+ Hopper)

WGMMA operates on an entire warp group (4 warps, 128 threads) rather than a single warp. The system is split across four builtin IDs, 20 auxiliary IDs for fence/store/load operations, and two massive handler blocks totaling ~800 lines of lowering logic.

Builtin Registration

Four builtins are registered in sub_90AEE0 (NVVM) and sub_126A910 (EDG):

ID	Builtin	Data Type	Lowering Case
765 (0x2FD)	`__wgmma_mma_async_f16`	FP16	Full operand set (6 chained: A, B, C, scale, negate, sparsity)
766 (0x2FE)	`__wgmma_mma_async_bf16`	BF16	2-operand (no scale/negate)
767 (0x2FF)	`__wgmma_mma_async_tf32`	TF32	Reduced operand set
768 (0x300)	`__wgmma_mma_async_f8`	FP8 (SM 90a+)	Minimal (2 scale operands only)

WGMMA ID Space Overview

The full WGMMA ID range spans 745--770, subdivided into four functional groups:

ID Range	Function	Handler
745--750 (0x2E9--0x2EE)	Fence / commit / wait	`sub_12B1C20` / `sub_953BA0`
751--752 (0x2EF--0x2F0)	Store	`sub_12B27B0` / `sub_954350`
753--764 (0x2F1--0x2FC)	MMA async load (12 variants)	inline / `sub_9547E0`
765--768 (0x2FD--0x300)	MMA async compute (4 type builtins)	inline ~800 lines / `sub_12B2E10`
769--770 (0x301--0x302)	Warp-group barrier	inline IR via `sub_127FC40`

WGMMA Fence / Commit / Wait (IDs 745--750)

sub_953BA0 (NVVM) / sub_12B1C20 (EDG) builds a red-black tree on first call with 7 entries keyed by builtin ID. Each entry packs:

struct wgmma_fence_entry {
    uint32_t id;           // builtin ID (745--751)
    uint32_t trans_a;      // transpose A flag
    uint32_t shape;        // shape code (0 or 1)
    uint32_t trans_b;      // transpose B flag
    uint32_t a_nregs;      // register count for A fragment
    uint32_t b_nregs;      // register count for B fragment
    uint32_t padding;      // unused alignment
    llvm_type *a_type;     // LLVM type for A (i64, i32, i16x2, i32x4)
    llvm_type *b_type;     // LLVM type for B
    llvm_type *c_type;     // LLVM type for C (i32x2, i32x8)
};

Decoded entries from local variables v47--v106:

ID	trans_a	shape	trans_b	a_nregs	b_nregs	A type	B type	C type
745	0	1	5	1	1	i64	i64	--
746	1	0	1	9	9	i32	i32	i32x2
747	0	0	25	8	8	i16x2	i16x2	--
748	0	0	23	7	7	i32x4	i32x4	i32x8
749	0	0	24	7	7	i32x4	i32x4	i32x8
750	0	0	6	7	7	i64	i32x2	i32x8

Output packed encoding (*a4, 64-bit):

Bits	Field	Source
[3:0]	trans_a	`*(entry+40)`
[7:4]	shape	`*(entry+48) << 4`
[15:8]	a_nregs	`*(entry+64) << 8`
[27:16]	b_nregs	`*(entry+72) << 16`
[31:28]	padding	`*(entry+80) << 28`
[63:32]	trans_b	`*(entry+56) << 32`
[25]	rowcol bit 1	`(rowcol & 2) == 0 ? 0x2000000 : 0x1000000`
[27:26]	rowcol bit 0	`((rowcol & 1) + 1) << 26`

The fence dispatch validates the rowcol operand (must be 0--3) and emits a 4-argument call to intrinsic 9062 (llvm.nvvm.wgmma.fence.aligned) with 3 type overloads. Fragment operands are prepared via sub_94B510.

WGMMA Store (IDs 751--752)

sub_954350 / sub_12B27B0 builds a separate parameter lookup tree. Store operations validate rowcol (0 or 1) and emit a 5-argument call using intrinsic 9145 (llvm.nvvm.wgmma.store) with 2 type overloads. Operands: {constant, B_fragment, descriptor, rowcol, zero}.

WGMMA MMA Async Load (IDs 753--764)

sub_9547E0 (NVVM) / sub_12B2E10 (EDG) builds a 12-entry red-black tree at ctx+656:

ID	Shape	nregs	Variant	Fragment Type
753	1	9	0	--
754	1	9	1	--
755	1	9	2	i16x2
756	25	8	0	--
757	25	8	1	--
758	25	10	2	i32x8
759	23	7	0	i32x4
760	23	7	1	i32x4
761	24	7	0	i32x4
762	24	7	1	i32x4
763	6	7	0	i32x2/i64
764	6	7	1	i32x2/i64

Output packed encoding (*a4, 64-bit):

Bits	Field
[63:32]	`*(entry+40) << 32`
[31:4]	`*(entry+48) << 4 \| rowcol`
[1]	`*(entry+56) << 1`

Emits intrinsic 9067 (llvm.nvvm.wgmma.mma.async) with 2 type overloads. Arguments: {constant, B_fragment, rowcol_value, zero_constant}. Results scattered via sub_94B940.

WGMMA MMA Async Compute -- The 800-Line Handler (IDs 765--768)

This is the primary WGMMA lowering path. It lives inline in the mega-switch of sub_955A70 (NVVM, lines ~2850--3138) and sub_12B3FD0 (EDG, lines ~2270--3138). The handler implements two completely different intrinsic selection strategies depending on which builtin ID triggered entry.

Argument Extraction

The handler walks the argument chain 7 levels deep from the call expression:

v263 = M dimension              (first constant argument)
v512 = accumulator fragments    (pointer to fragment array)
v528 = A descriptor             (64-bit matrix descriptor or register fragments)
v524 = B descriptor             (64-bit matrix descriptor)
v519 = scale factors            (A and D scale constants)
v264 = layout params            (rowcol encoding)
v516, v265 = shape params       (additional dimension info)
v540 = element type info        (integer type tag from AST)

Each constant argument is validated through sub_620FD0 (EDG) / sub_620FD0 (shared), which extracts the integer value and sets an overflow flag. On overflow:

"unexpected constant overflow in __wgmma_mma_async operand"

This check is applied 5 times: once for N dimension, once for each scale factor, and once for each negate/saturation bit.

Per-Builtin Argument Layouts

ID	Builtin	Operand Chain
765 (0x2FD)	`_f16`	6 chained: A, B, C, scaleA, scaleD, negate/saturation
766 (0x2FE)	`_bf16`	Separate branch (LABEL_56 path), 2-operand (no scale/negate)
767 (0x2FF)	`_tf32`	Rearranged arguments, fewer config bits
768 (0x300)	`_f8`	Simplest form, 2 matrix descriptors + config

Strategy 1: N-Dimension Dispatch (IDs 765--768, inner path)

When the element type is checked and the first argument yields an N dimension, the handler enters a 33-entry switch mapping N values to LLVM intrinsic IDs in the range 10654--10779:

N	Integer-type Intrinsic	Float-type Intrinsic
8	10774	10775
16	10690	10691
24	10734	10735
32	10742	10743
40	10746	10747
48	10750	10751
56	10754	10755
64	10758	10759
72	10762	10763
80	10766	10767
88	10770	10771
96	10778	10779
104	10654	10655
112	10658	10659
120	10662	10663
128	10666	10667
136	10670	10671
144	10674	10675
152	10678	10679
160	10682	10683
168	10686	10687
176	10694	10695
184	10698	10699
192	10702	10703
200	10706	10707
208	10710	10711
216	10714	10715
224	10718	10719
232	10722	10723
240	10726	10727
248	10730	10731
256	10738	10739

The even/odd intrinsic ID pairing encodes the distinction between integer-element and float-element variants. Type discrimination uses the AST element type: if the element type is integer with width 10 (i.e., a 10-bit integer signaling bf16/tf32 internal encoding), the even (integer) intrinsic is selected; otherwise the odd (float) intrinsic.

N dimension validation:

if ((N & (N - 1)) != 0)
    error("N only supported for powers of two");

This is applied when the N value does not match any case in the 33-entry switch. The N values 8, 16, 32, 64, 128, 256 are powers of two; the intermediate values (24, 40, 48, ..., 248) are non-power-of-two multiples of 8 that are still valid WGMMA dimensions.

Strategy 2: 5-Dimensional Intrinsic Grid (IDs 753--764 path, shared)

For the full WGMMA async variants (handled through sub_12B2E10), the handler selects from a 144-entry intrinsic table spanning IDs 5304--5447, organized as a 5-dimensional grid:

Dimension	Values	Description
1. N	{16, 32, 64, 128}	Output column dimension
2. B_shared	{false, true}	Is B operand from shared memory? (`sub_12A71A0 != 0`)
3. is_s64	{false, true}	Is accumulator type s64/int? (type tag 2, subtype 10)
4. scale/negate	varies	A scale nonzero? D scale nonzero?
5. variant	{0x2FD, 0x2FE, 0x2FF, 0x300}	Which builtin triggered entry

Base addresses and stride:

N	Base ID	Stride per N
128	5304	24 variants
64	~5328	24
32	~5352	24
16	~5376	24
overflow	~5400--5447	remaining

Size-based opcode selection (for f16, ID 765):

Accumulator Size	Opcode (integer)	Opcode (float)
16	5332	5333
32	5380	5381
64	5404	5405
128	5308	5309
other	5356/5428	5357/5429

The mapping formula: base + N_offset + shared_offset + type_offset + variant_offset. The accumulator size is extracted by sub_12A71A0(expr) from the expression type chain.

WGMMA Config Bit Packing

Multiple boolean arguments are packed into a single configuration word passed to the final intrinsic call:

Bit	Field	Source	Value Semantics
0	Accumulate / saturation flag	Final constant operand (`v433`)	1 = accumulate into D, 0 = overwrite
1	ScaleD / transpose flag	`v445` constant	1 = transpose B descriptor
2	Negate-C / layout flag	`v81` / `v433` constant	1 = negate accumulator input
3	Sign bit for B	`v427` constant (if present)	Reserved / sign extension
4	Negate-A / additional mode	`v80` / `v427` constant (if present)	1 = negate A operand

Combined via: v79 = bit0 | (bit1 << 1) | (bit2 << 2) | (bit4 << 4).

After intrinsic selection, the handler:

Converts the accumulator pointer to a vector pointer (.asvecptr tag)
Extracts bitfield from constant operands for mode flags
Calls sub_1285290 / sub_921880 with name hint "mmafrag"
Scatters results via sub_94B940 / sub_1280F50 (size 4 = float elements)

WGMMA Validation Summary

All constant arguments pass through sub_620FD0, which extracts the integer value and sets an overflow flag.

Check	Error Message	Condition
Constant overflow	`"unexpected constant overflow in __wgmma_mma_async operand"`	Any integer operand overflows extraction (5 occurrences)
N power-of-two	`"N only supported for powers of two"`	`(N & (N - 1)) != 0` and N not in the 33-entry switch
rowcol range (fence)	`"'rowcol' operand can be 0 or 1 only"`	rowcol > 1 for load/store
rowcol range (MMA)	(implicit -- validated 0--3)	rowcol > 3 for MMA operations

WGMMA Support Functions

Function	Address	EDG Parallel	Purpose
`sub_953BA0`	0x953BA0	`sub_12B1C20`	Fence/commit/wait parameter lookup, builds packed 64-bit encoding
`sub_9547E0`	0x9547E0	`sub_12B2E10`	MMA async load parameter lookup, 12-entry red-black tree
`sub_954350`	0x954350	`sub_12B27B0`	Store variant parameter lookup
`sub_94B510`	0x94B510	--	Prepare fragment operand for WGMMA call
`sub_94B940`	0x94B940	`sub_1280F50`	Scatter MMA results back to fragment outputs
`sub_94B2B0`	0x94B2B0	--	Extract fragment element at index (WMMA shared)
`sub_12A71A0`	0x12A71A0	--	Extract size/dimension from expression type (EDG-only)
`sub_12A6F10`	0x12A6F10	--	Validate constant integer in range (EDG-only)
`sub_620FD0`	0x620FD0	--	Extract constant integer with overflow detection (shared)

Packed MMA Descriptor Word

The MMA PTX string builder at sub_21E74C0 (AsmPrinter) / sub_35F_range (NVPTX backend) reads a packed 64-bit descriptor for all MMA instruction emission. The descriptor is stored at:

v22 = *(QWORD *)(*(QWORD *)(a1 + 16) + 16 * a2 + 8)

Bits	Field	Query Key	Values
[0]	Row/col layout	`"rowcol"`	0=row, 1=col
[2:1]	Matrix ID	`"mid"`	0=a, 1=b, 2=c, 3=d
[7:4]	Binary opcode	`"opc"`	0=default, 1=`.and.popc`, 2=`.xor.popc`
[2:0]	Rounding mode	`"rnd"`	0=none, 1=`.rn`, 2=`.rm`, 3=`.rp`, 4=`.rz`
[15:8]	A element type	`"aty"`	Type enum 1--11
[23:16]	B element type	`"bty"`	Type enum 1--11
[25:24]	A layout	`"al"`	0=row, nonzero=col
[27:26]	B layout	`"bl"`	0=row, nonzero=col
[28]	Saturation	`"satf"`	1=`.satfinite`
[39:32]	Shape enum	`"shape"`	0x01--0x19, 18 entries

Shape Enum

Enum	Shape	PTX String	Min SM	Notes
0x01	m8n8k4	`"m8n8k4"`	SM 70	Original Volta HMMA
0x02	m8n8k16	`"m8n8k16"`	SM 72	Integer MMA (s8/u8)
0x03	m8n8k32	`"m8n8k32"`	SM 75	Sub-byte (s4/u4)
0x04	m8n8k64	`"m8n8k64"`	SM 75	Extended sub-byte
0x05	m8n8k128	`"m8n8k128"`	SM 75	Binary MMA (b1)
0x06	m8n32k16	`"m8n32k16"`	--	Appears unused in standard paths
0x10	m16n8k4	`"m16n8k4"`	SM 75	Turing HMMA, f64 on Ampere
0x11	m16n8k8	`"m16n8k8"`	SM 75	Turing/Ampere HMMA
0x12	m16n8k16	`"m16n8k16"`	SM 80	Ampere HMMA (bf16, tf32)
0x13	m16n8k32	`"m16n8k32"`	SM 75	Ampere integer
0x14	m16n8k64	`"m16n8k64"`	SM 75	Sub-byte integer
0x15	m16n8k128	`"m16n8k128"`	SM 75	Extended sub-byte
0x16	m16n8k256	`"m16n8k256"`	SM 75	Binary/sub-byte (largest)
0x17	m16n16k16	`"m16n16k16"`	SM 90	Square shape, Hopper+
0x18	m32n8k16	`"m32n8k16"`	SM 80	Tall shape
0x19	m16n16k8	`"m16n16k8"`	SM 70	WMMA f16 path

Unknown shape codes hit the default branch and abort via BUG(). String emission uses fast-path integer stores: *(QWORD *)ptr = 0x36316B386E36316DLL emits "m16n8k16" as a single 8-byte write.

Type Enum

Enum	Type	Bits	PTX String
1	b1	1	`"b1"`
2	s4	4	`"s4"`
3	u4	4	`"u4"`
4	s8	8	`"s8"`
5	u8	8	`"u8"`
6	f16	16	`"f16"`
7	bf16	16	`"bf16"`
8	tf32	19	`"tf32"`
9	f64	64	`"f64"`
10	f32	32	`"f32"`
11	s32	32	`"s32"`

Any other type code produces fatal error: "Wrong MMA element type".

Shape x Type x Architecture Summary

Shape	A/B Types	Acc Types	Min SM	Notes
m8n8k4	f16	f16, f32	SM 70	Original Volta
m16n8k4	f64	f64	SM 80	Ampere f64
m16n8k8	f16	f16, f32	SM 75	Turing+
m16n8k16	f16, bf16, tf32	f16, f32	SM 80	Ampere+
m16n16k8	f16	f16, f32	SM 70	WMMA path
m16n16k16	f16, bf16	f16, f32	SM 90	Hopper+
m32n8k16	f16, bf16	f16, f32	SM 80	Tall shape
m8n8k16	s8, u8	s32	SM 72	Integer MMA
m16n8k16	s8, u8	s32	SM 75	Turing+
m16n8k32	s8, u8	s32	SM 75	Turing+
m8n8k32	s4, u4	s32	SM 75	Sub-byte
m16n8k64	s4, u4	s32	SM 75	Sub-byte
m8n8k64	s4, u4	s32	SM 75	Extended sub-byte
m16n8k128	s4, u4	s32	SM 75	Extended sub-byte
m8n8k128	b1	s32	SM 75	Binary (`.and.popc`, `.xor.popc`)
m16n8k256	b1	s32	SM 75	Binary extended
WGMMA (N=8..256)	f16, bf16, tf32, f8	f16, f32	SM 90	Warp-group, 33 N values
tcgen05 (10 variants)	mxf8f6f4, mxf4, mxf4nvf4, f16, bf16, tf32, i8, fp4	varies	SM 100	See mma-codegen

tcgen05 Blackwell Overview (SM 100+)

Full tcgen05 documentation lives in Tensor / MMA Codegen. Key points summarized here for cross-reference:

Data type kinds (bits [8:6] of the tcgen05 operand, emitted by sub_35F3330):

Value	Kind	Notes
0	`mxf4nvf4`	MX FP4 with NV FP4
1	`f8f6f4`	FP8/FP6/FP4 standard
2	`mxf8f6f4`	MX variant of f8f6f4
3	`f16`	Half precision
4	`i8`	8-bit integer (arch-conditional only)
5	`tf32`	TensorFloat-32
7	`mxf4`	MX FP4

Modifier fields:

Modifier	Bits	Description
Weight stationary (`.ws`)	bit 0	NOT compatible with `cta_group::2`, `mxf8f6f4`, `fp4`
CTA group	bit 1	`cta_group::1` (clear) or `cta_group::2` (set)
Scale vector size	[3:2]	`.scale_vec::1X`/`2X`/`4X` with per-type constraints
Scale input accumulator	bit 4	f16/tf32 only; NOT on sm_100a/sm_103a
Sparsity	bit 5	MXF4/MXF4NVF4 restricted to arch-conditional
Block scale alias	[10:9]	`.block16` (0) or `.block32` (1)

Collector modes (emitted by sub_35F38B0):

Value	Modifier	Constraint
1	`.collector::a::lastuse`	--
2	`.collector::a::fill`	Cannot combine with `.ashift`
3	`.collector::a::use`	Cannot combine with `.ashift`

tcgen05 scaled MMA operand builder (sub_21E8CD0 / sub_35F3E90):

Bit	Query	Clear	Set
0	`"scaleD"`	`"0"`	`"1"`
1	`"negA"`	`"1"` (no negate)	`"-1"` (negate)
2	`"negB"`	`"1"`	`"-1"`
3	`"transA"`	`"0"`	`"1"`
4	`"transB"`	`"0"`	`"1"`

Note the asymmetry: scaleD/transA/transB emit boolean "0"/"1" strings, while negA/negB emit sign multiplier "1"/"-1" strings. This reflects the PTX encoding where negation is a multiplication factor and transpose is a boolean flag.

LLVM Intrinsic Reference

Intrinsic ID	Name	Usage
9062	`llvm.nvvm.wgmma.fence.aligned`	WGMMA fence (3 type overloads)
9067	`llvm.nvvm.wgmma.mma.async`	WGMMA MMA async load (2 type overloads)
9145	`llvm.nvvm.wgmma.store`	WGMMA store (2 type overloads)
10654--10779	`llvm.nvvm.wgmma.mma.async.*`	Per-N-dimension variants (126 entries, even=int, odd=float)
5304--5447	(WGMMA 5-D grid)	Per-N x shared x type x scale x variant (144 entries)
4905--4940	(tcgen05 ISD opcodes)	tcgen05.mma variants (36 opcodes via 10-way shape switch)

NVPTX Backend Duplicate Functions

All MMA emission functions exist in two structurally identical copies:

AsmPrinter (0x21Dxxxx)	NVPTX Backend (0x36Exxxx)	Function
`sub_21DFBF0`	`sub_36E91F0`	hmmastc (HMMA store C)
`sub_21E0360`	`sub_36E72A0`	hmmaldab (HMMA load A/B)
`sub_21E0630`	`sub_36E7580`	hmmaldc (HMMA load C)
`sub_21E0870`	`sub_36E77C0`	hmmamma (HMMA MMA)
`sub_21E1280`	`sub_36E7B50`	immaldab (IMMA load A/B)
`sub_21E15D0`	`sub_36E7EA0`	immaldc (IMMA load C)
`sub_21E1830`	`sub_36E8110`	immastc (IMMA store C)
`sub_21E1D20`	`sub_36E8630`	immamma (IMMA MMA)
`sub_21E2280`	`sub_36E8BD0`	bmmamma (Binary MMA)
`sub_21E8CD0`	`sub_35F3E90`	tcgen05 scaled MMA operands

The pairs differ only in error reporting (sub_16BD130 vs sub_C64ED0) and reference counting functions (sub_1623A60/sub_161E7C0 vs sub_B96E90/sub_B91220).

Cross-References

Tensor / MMA Codegen -- backend PTX emission, tcgen05 full detail
NVPTX Opcodes -- ISD opcode numbers
SM 90 (Hopper) -- WGMMA architecture context, TMA, cluster
SM 100 (Blackwell) -- tcgen05 architecture context
Builtin System -- hash table, registration, dispatch architecture

Surface and Texture Builtins

Surface and texture builtins form the largest contiguous block in the builtin table, with 165 surface store entries (IDs 474--638) plus a generic texture/surface handler (ID 647). CUDA separates texture reads (which go through a unified handler) from surface writes (which have dedicated per-format builtins). This asymmetry reflects the hardware: texture reads use a programmable texture pipeline, while surface stores map directly to typed sust (surface store) instructions.

Surface Store Builtins (IDs 474--638)

The 165 sust (surface store) builtins encode the dimensionality, data type, and out-of-bounds behavior directly in the builtin name. They follow the pattern:

__nvvm_sust_b_{dim}_{type}_{oob_mode}

Dimensions (5 variants)

Dimension	Description
`1d`	One-dimensional surface
`2d`	Two-dimensional surface
`3d`	Three-dimensional surface
`1d_array`	Array of 1D surfaces
`2d_array`	Array of 2D surfaces

Data Types (11 variants)

Type Suffix	Element Size	Vector
`i8`	8-bit integer	Scalar
`i16`	16-bit integer	Scalar
`i32`	32-bit integer	Scalar
`i64`	64-bit integer	Scalar
`v2i8`	8-bit integer	2-element vector
`v2i16`	16-bit integer	2-element vector
`v2i32`	32-bit integer	2-element vector
`v2i64`	64-bit integer	2-element vector
`v4i8`	8-bit integer	4-element vector
`v4i16`	16-bit integer	4-element vector
`v4i32`	32-bit integer	4-element vector

Out-of-Bounds Modes (3 variants)

Mode	ID Range	Behavior
`clamp`	474--528	Clamp coordinates to valid range
`trap`	529--583	Trigger hardware trap on OOB access
`zero`	584--638	Write zero for OOB coordinates

The total 5 x 11 x 3 = 165 entries are registered as a contiguous block. IDA shows SSE xmmword constant loads for the long common prefix strings (__nvvm_sust_b_2d_array_*), which is the compiler's optimization of string literal initialization during registration.

Surface Store ID Layout

Within each OOB-mode block of 55 entries, the ordering is dimension-major, type-minor:

base + 0..10:  1d       x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}
base + 11..21: 1d_array x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}
base + 22..32: 2d       x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}
base + 33..43: 2d_array x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}
base + 44..54: 3d       x {i8,i16,i32,i64,v2i8,v2i16,v2i32,v2i64,v4i8,v4i16,v4i32}

Given a surface store builtin ID, the decomposition is:

mode_offset = (id - 474)
oob_block   = mode_offset / 55          // 0=clamp, 1=trap, 2=zero
within_block = mode_offset % 55
dim_index    = within_block / 11         // 0=1d, 1=1d_array, 2=2d, 3=2d_array, 4=3d
type_index   = within_block % 11         // 0=i8 .. 10=v4i32

Texture/Surface Read Handler (ID 647)

All texture reads and surface reads are funneled through a single generic handler:

ID	Builtin	Description
647	`__nv_tex_surf_handler`	Dispatch for all texture/surface read operations

Unlike the surface stores which have 165 dedicated builtins, texture reads use a string-based dispatch mechanism. The handler is a single builtin that receives the texture/surface operation name as a string operand, then dynamically constructs the appropriate LLVM intrinsic name and emits the call.

Handler Dispatch Algorithm (case `0x287` in `sub_955A70`)

The NVVM-side lowering for __nv_tex_surf_handler (builtin ID 647, hex 0x287) is the most complex string-based builtin dispatch in cicc. It performs five steps:

Step 1 -- String extraction. Walks the AST operand tree from the call expression to locate the constant string naming the texture/surface operation. Validates that byte 173 of the operand node equals 2 (the constant-string-type marker in the EDG AST). The string is the NVVM intrinsic base name, for example __tex_fetch or __surf_read.

Step 2 -- Element type determination. Decodes the return element type from the AST type node attached to the call. The type switch maps to suffix strings:

AST Type	Suffix String	LLVM Type
`void`	`"void"`	`void`
`char` (as signed)	`"char_as_schar"`	`i8`
`char` (as unsigned)	`"char_as_uchar"`	`i8`
`signed char`	`"schar"`	`i8`
`unsigned char`	`"uchar"`	`i8`
`short`	`"short"`	`i16`
`unsigned short`	`"ushort"`	`i16`
`int`	`"int"`	`i32`
`unsigned int`	`"uint"`	`i32`
`long`	`"long"`	`i32`/`i64`
`unsigned long`	`"ulong"`	`i32`/`i64`
`long long`	`"longlong"`	`i64`
`unsigned long long`	`"ulonglong"`	`i64`
`float`	`"float"`	`float`

The long/ulong width follows the host ABI convention (32-bit on NVPTX).

Step 3 -- Intrinsic name construction. Concatenates the operation base name with the element type suffix using underscore separation:

intrinsic_name = "{operation_string}_{element_type_suffix}"

For example, __tex_fetch_v4 + float yields __tex_fetch_v4_float.

Step 4 -- Intrinsic lookup. Resolves the constructed name string via sub_BA8CA0 (NVVM intrinsic table lookup) to obtain the corresponding LLVM intrinsic function declaration. The EDG-side parallel path uses sub_1632190. If the intrinsic is not found, this is a fatal error.

Step 5 -- Call emission. Collects all arguments from the call expression, builds the LLVM function type signature from the argument types, and emits the intrinsic call via sub_921880. Returns a dummy i32 value via sub_AD6530.

This design allows the compiler to support an arbitrary number of texture/surface read variants without enumerating them in the builtin table. The single ID 647 entry is a trampoline that dispatches to hundreds of different NVVM intrinsics at runtime.

`__nv_tex_surf_handle_t` Built-in Type

The EDG parser recognizes __nv_tex_surf_handle_t as a built-in type (keyword index 277 in the keyword table at sub_72BA30). This opaque type is the C++-level representation of a texture or surface reference handle. When the type appears as a function parameter, the PTX emitter (sub_21502D0, 22KB) produces one of:

Parameter ABI	PTX Syntax
By-value `.texref`	`.param .texref NAME`
By-value `.surfref`	`.param .surfref NAME`
By-value `.samplerref`	`.param .samplerref NAME`
Pointer to `.texref`	`.param .u64 .ptr .texref NAME`
Pointer to `.surfref`	`.param .u64 .ptr .surfref NAME`
Pointer to `.samplerref`	`.param .u64 .ptr .samplerref NAME`

The selection between .texref / .surfref / .samplerref is determined by the NVVM metadata attached to the GlobalVariable that the handle references. The NVPTXReplaceImageHandles pass (sub_21DBEA0) performs the final substitution of IR-level image handles into PTX-level texture/surface references during machine-level code emission.

Texture/Surface Map Initialization

The NVVM-side handler sub_954F10 maintains two lazily-initialized red-black tree maps for resolving texture and surface operations. These maps are built once (guarded by flag bytes byte_4F6D3B0 and byte_4F6D378) and cleaned up via __cxa_atexit.

Surface Operation Map (`unk_4F6D3C0`)

Used when the handler's v8 flag is nonzero (surface path). Contains entries mapping builtin IDs to LLVM intrinsic IDs for surface read operations. Each entry is a 12-byte packed triple:

Intrinsic ID	Description
`0x21CA` (8650)	Surface read (primary)

The map contains 4 entries covering surface read and write variants with address space 4 (constant memory surface descriptors).

Texture Operation Map (`unk_4F6D380`)

Contains entries for texture fetch operations. The map has 12 entries covering the full matrix of texture modes:

Intrinsic ID	Mapped Builtin Base	Description
`0x1FC6` (8134)	ID 338	Texture fetch (sync variant)
`0x23C5` (9157)	ID 302	Texture fetch (base variant)
`0x23C8` (9160)	ID 303	Texture fetch (alternate)

These 12 entries span the following texture fetch modes:

Mode	Behavior
Unfiltered fetch	Direct texel access at integer coordinates
Filtered fetch	Hardware-interpolated fetch at float coordinates
LOD fetch	Explicit level-of-detail selection
Gradient fetch	Gradient-based LOD computation

Map Lookup and Dispatch (sub_954F10)

function TexSurfSampleHandler(retval, ctx, builtin_id, arglist):
    // Determine surface vs texture path
    is_surface = (v8 flag != 0)

    if is_surface:
        map = unk_4F6D3C0     // surface map
        if not initialized:
            populate 4 entries into red-black tree
            byte_4F6D3B0 = 1
    else:
        map = unk_4F6D380     // texture map
        if not initialized:
            populate 12 entries into red-black tree
            byte_4F6D378 = 1

    // Tree lookup
    entry = rbtree_find(map, builtin_id)
    if found:
        intrinsic_id = entry.intrinsic_id   // e.g. 0x1FC6
    else:
        intrinsic_id = 0
        default_mode = 1

    // Create type constant from element type
    type_const = sub_BCB2D0(sub_ACD640(...))

    // Process 4 standard operands
    for operand in [sampler, coordinate, lod, bias]:
        if operand != null:
            lowered = type_cast(operand, expected_llvm_type)
            emit_store(lowered)   // sub_B4D190 or sub_B4D3C0

    // Build and emit intrinsic call
    fn_decl = sub_90A810(intrinsic_tables, intrinsic_id, ...)
    sub_921880(fn_decl, args)        // emit call
    sub_B4D3C0(result)               // store result

Operand Processing

For each of the 4 standard texture operands (sampler, coordinate, LOD, bias), the handler:

Checks if the operand is non-null
Type-casts to match the expected LLVM type
Creates a store instruction via sub_B4D190 (loads) or sub_B4D3C0 (stores)
Builds the LLVM call via sub_90A810 with the resolved intrinsic ID

SelectionDAG Lowering Layer

After NVVM builtin lowering produces LLVM IR intrinsic calls, the SelectionDAG layer translates these into NVPTX-specific DAG nodes. Three subsystems handle different aspects.

Intrinsic Lowering Dispatch (`sub_33B0210`, 343KB)

The central intrinsic lowering function dispatches on LLVM intrinsic IDs via a giant switch covering ~440 case labels. Texture and surface operations occupy three distinct ID ranges:

Intrinsic ID Range	Handler	Category
`0x5D`--`0x8D` (93--141)	`sub_33A4350`	Texture fetch bulk handler (50 IDs)
`0x8E`--`0x90` (142--144)	`sub_33A3180`	Surface read/write handler (3 IDs)
`0x91` (145)	Inline	Complex texture sample with LOD/bias
`0x92`--`0x98` (146--152)	Various	Surface store variants
`0x9C`--`0x9D` (156--157)	`sub_33AEC60`	Surface atomics
`0x9E`--`0x9F` (158--159)	`sub_33AFBA0` / `sub_340EC60`	Surface special ops
`0xA0`--`0xA2` (160--162)	Various	Surface/texture helpers
`0x2952` (10578)	Inline	`nvvm_texsurf_handle` binding
`0x254D`+ (9549+)	`sub_34B8FD0`	Unified texture sample core

Texture Fetch Bulk Handler: `sub_33A4350`

The 50 consecutive intrinsic IDs 0x5D through 0x8D all delegate to a single helper sub_33A4350(state, dag_node). This function maps the intrinsic ID to an NVPTXISD opcode for one of the tex.1d, tex.2d, tex.3d, or tex.a1d/tex.a2d (array) variants.

The intrinsic-to-opcode mapping encodes:

dimension:    1d / 2d / 3d / 1d_array / 2d_array / cubemap
data_type:    u32 / s32 / f32 / f32f32 (filtered)
return_width: scalar / v2 / v4
access_mode:  level / grad / unified

Each opcode corresponds to a PTX texture instruction pattern that the instruction emitter will later produce.

Complex Texture Sample (Intrinsic ID `0x91`)

The most complex texture lowering path. Handles hardware-filtered texture sampling with programmable LOD computation:

sub_3281100 -- Determines element count for the return type
sub_3281590 -- Computes alignment for the result buffer
sub_327FD70 -- Resolves the return MVT (machine value type)
sub_33CC4A0 -- SM-specific path selection (some SM levels use different instruction encodings)
sub_3406EB0(opcode=57) -- Creates the core sample DAG node
sub_33FAF80(opcode=213) -- LOD computation DAG node
sub_3406EB0(opcode=186) -- Merge result node
sub_33FAF80(opcode=389) -- Final type fixup
Fallback via sub_33A1E80 if the target architecture does not support this texture mode

Surface Read/Write Handler: `sub_33A3180`

Intrinsic IDs 0x8E (surf1Dread), 0x8F (surf2Dread), 0x90 (surf3Dread) delegate to sub_33A3180(state, dag_node, intrinsic_id). The intrinsic_id parameter selects the dimensionality. This handler produces NVPTXISD suld (surface load) DAG nodes.

Texture/Surface Handle Binding (Intrinsic `0x2952`)

The nvvm_texsurf_handle intrinsic (ID 10578) is the mechanism for binding a GlobalVariable to a texture or surface reference. The DAG lowering:

Validates that operand 0 is metadata wrapping a GlobalVariable -- errors with "nvvm_texsurf_handle op0 must be metadata wrapping a GlobalVariable" otherwise
Creates a DAG constant node for the handle via sub_3400BD0(opcode=10579)
Binds the handle via sub_3406EB0(opcode=46)

The NVPTXReplaceImageHandles pass (sub_21DBEA0) later resolves these abstract handles into concrete PTX .texref / .surfref globals during machine-level emission.

Unified Texture Sample Core (Intrinsic IDs `0x254D`+)

For SM 30+ unified texture mode, a more complex sampling path handles the full matrix of texture configurations:

sub_34B8FD0 -- Unpacks the parameter block encoding dimension, filtering, coordinate type
Vtable dispatch at *src+88 -- Selects the sampling mode (point, linear, etc.)
sub_3409320 -- Creates the sampler state DAG node
sub_33EB1C0(opcode=47) -- Creates the core tex/surf sample DAG node with memory semantics
sub_33FC220(opcode=2) -- Merges vector result components
sub_33E5830 + sub_3411630(opcode=55) -- Packages the final result
sub_B91FC0 -- Attaches debug info

Two modes exist: v2637=true (unified texture) and v2637=false (legacy separate-handle texture). The unified path is the modern default.

Texture/Surface Binding Lowering (Intrinsic IDs `0x44`, `0x45`, `0x47`)

These intrinsics handle the compile-time binding of texture and surface references. The lowering checks the a1+120 flag to determine whether the reference is a .texref or .surfref:

sub_3382030 -- Initial binding setup
sub_3382930 -- Variant analysis via sub_3380DB0 and sub_B58DC0
sub_3386E40 -- Final binding emission

Intrinsic 0x48 (opcode 332) handles global texture handles, while 0x162 (opcode 331) handles sampler handles. Intrinsic 0x169 dispatches to sub_3400BD0 + sub_3406EB0(opcode=333) for indirect texture access.

Instruction Selection: `sub_306A930` (52KB)

The NVPTX instruction selection pass contains a 52KB handler (sub_306A930) dedicated to matching texture/surface DAG nodes to machine instructions. It calls five helper functions:

Helper	Address	Role
`sub_2FE5F00`	`0x2FE5F00`	Texture instruction type selection
`sub_2FE5F30`	`0x2FE5F30`	Surface instruction type selection
`sub_2FE5F60`	`0x2FE5F60`	Image type validation
`sub_2FE69A0`	`0x2FE69A0`	Coordinate mode encoding
`sub_2FE6CC0`	`0x2FE6CC0`	Return type dispatch

The ISel handler selects among tex, suld, sust machine instruction patterns, with address space awareness for the different texture/surface memory regions.

Image Type Validation: `sub_21DD1A0` (16KB)

A dedicated 16KB validation function (sub_21DD1A0) checks that the image type encoding is legal for the instruction class. Four error messages cover the instruction categories:

Error String	Instruction Class
`"Invalid image type in .tex"`	Texture fetch
`"Invalid image type in .suld"`	Surface load
`"Invalid image type in suq."`	Surface query
`"Invalid image type in .sust"`	Surface store

This validation occurs during instruction emission, catching type mismatches that survived earlier lowering.

Surface Store Lowering Details

Surface store builtins in the 474--638 range are handled by the main dispatch switch with a block of consecutive cases. Each case:

Extracts the surface handle, coordinate(s), and data value(s) from the argument list
The number of coordinate arguments varies by dimensionality (1D: 1, 2D: 2, 3D: 3, arrays: +1 for layer index)
The number of data arguments varies by vector width (scalar: 1, v2: 2, v4: 4)
Emits a call to the corresponding llvm.nvvm.sust.b.* intrinsic

The out-of-bounds mode is encoded in the intrinsic name itself, not as a parameter, which is why each mode requires a separate builtin ID.

PTX Emission: Sampler State Initializers

The PTX emitter sub_2156420 (20KB) handles module-level emission of texture, surface, and sampler global variables. Sampler references receive structured initializers:

.global .samplerref my_sampler = {
    addr_mode_0 = wrap,          // or clamp_to_border, clamp_to_edge, mirror
    addr_mode_1 = clamp_to_edge,
    addr_mode_2 = clamp_to_edge,
    filter_mode = linear,        // or nearest
    force_unnormalized_coords = 1
};

The addressing mode and filter mode values are extracted from NVVM metadata attached to the sampler GlobalVariable. The emitter recognizes these sampler reference types via sub_1C2E890 and generates the structured PTX initializer. Texture and surface references use the simpler forms:

.global .texref my_texture;
.global .surfref my_surface;

End-to-End Pipeline

The complete texture/surface compilation pipeline spans five compiler phases:

Phase	Function(s)	What Happens
EDG Frontend	`sub_72BA30`	Parses `__nv_tex_surf_handle_t` as built-in type; keyword 277
NVVM Builtin Lowering	`sub_955A70` case `0x287` / `sub_954F10`	String-based dispatch constructs LLVM intrinsic names; red-black tree maps resolve builtin IDs to intrinsic IDs
SelectionDAG Lowering	`sub_33B0210` / `sub_33A4350` / `sub_33A3180`	50+ texture intrinsic IDs become NVPTXISD DAG nodes; handle binding validated against GlobalVariable metadata
Instruction Selection	`sub_306A930` (52KB)	DAG nodes matched to `tex.` / `suld.` / `sust.*` machine instructions
PTX Emission	`sub_2156420` / `sub_21DD1A0`	`.texref`/`.surfref`/`.samplerref` globals emitted; image type validated; `NVPTXReplaceImageHandles` substitutes abstract handles

Architecture Considerations

Surface and texture operations are available on all SM architectures. However, the texture pipeline has evolved significantly:

All SM: Basic texture fetch, surface read/write with clamp/trap/zero modes
SM 30+: Unified texture mode via __nv_tex_surf_handler generic dispatch; v2637=true path in DAG lowering
SM 90+ (Hopper): Tensor memory accelerator (TMA) operations provide an alternative high-throughput path for bulk data movement, partially overlapping with texture/surface functionality but handled through separate builtins (IDs 411--412)

The 165 surface store builtins are registered unconditionally regardless of target SM. Architecture gating occurs at the PTX emission layer, not during builtin registration or lowering. The complex texture sample path (intrinsic 0x91) has an explicit SM feature gate via sub_33CC4A0 that selects alternate instruction encodings for older architectures, with sub_33A1E80 as the fallback for unsupported targets.

Function Map

Function	Address	Size	Role
NVVM builtin lowering dispatch	`sub_955A70`	--	Main switch; case `0x287` handles `__nv_tex_surf_handler`
Texture/surface sample handler	`sub_954F10`	--	Red-black tree dispatch for IDs 302--309, 338--345, 395--402
EDG keyword handler	`sub_72BA30`	--	Parses `__nv_tex_surf_handle_t` built-in type (keyword 277)
NVPTX intrinsic lowering	`sub_33B0210`	--	343KB central dispatch; tex IDs 0x5D--0x8D, surf IDs 0x8E--0x90
Texture fetch bulk handler	`sub_33A4350`	--	50 consecutive intrinsic IDs for all tex1D/2D/3D/array variants
Surface read/write handler	`sub_33A3180`	--	3 intrinsic IDs for surf1D/2D/3D read
Tex/surf sample DAG node builder	`sub_33EB1C0`	--	Creates memory-typed NVPTXISD sample nodes (opcode 47)
Sampler state DAG node builder	`sub_3409320`	--	Creates sampler state binding nodes
Surface atomics handler	`sub_33AEC60`	--	Intrinsic IDs 0x9C--0x9D
Surface special handler	`sub_33AFBA0`	--	Intrinsic ID 0x9E
Texture/surface ISel	`sub_306A930`	--	52KB instruction selection for tex/suld/sust patterns
Image type validator	`sub_21DD1A0`	--	16KB; validates `.tex`/`.suld`/`.sust`/`suq.` image types
NVPTXReplaceImageHandles	`sub_21DBEA0`	--	Replaces IR image handles with PTX `.texref`/`.surfref`
Global variable emitter	`sub_2156420`	--	20KB; emits `.texref`/`.surfref`/`.samplerref` with initializers
Parameter list emitter	`sub_21502D0`	--	22KB; emits `.param .texref`/`.surfref`/`.samplerref` in function signatures
visitNVVMTexSurf	`sub_2077400`	--	20KB SelectionDAGBuilder extension for tex/surf handle lowering
NVVM intrinsic lookup	`sub_BA8CA0`	--	Resolves constructed intrinsic name string to LLVM function declaration
Intrinsic table lookup	`sub_90A810`	--	Resolves intrinsic ID to function declaration with type overloads

Cross-References

Builtin System Overview -- Hash table infrastructure and ID assignment
Atomics Builtins -- PTX inline asm generation pattern shared by surface stores
NVPTX Instruction Selection -- ISel pattern matching context
SelectionDAG Lowering -- DAG node construction infrastructure
PTX Emission -- Final instruction text generation
Address Spaces -- Memory space qualifiers for tex/surf

Barrier and Synchronization Builtins

Barrier builtins handle thread synchronization, memory fencing, and cluster-level coordination. They span IDs 1--5 (core barriers), 8--20 (cluster and barrier extensions), and several scattered IDs for memory barriers and fences. The lowering layer emits either LLVM intrinsic calls or inline PTX assembly, depending on whether the operation has a direct LLVM IR equivalent.

Core Barriers (IDs 1--5)

The most fundamental synchronization primitives in CUDA map to the lowest builtin IDs.

ID	Builtin	PTX Equivalent	Description
1	`__syncthreads`	`bar.sync 0`	Block-wide barrier
2	`__nvvm_bar0`	`bar.sync 0`	Alias for `__syncthreads`
3	`__nvvm_membar_cta`	`membar.cta`	CTA-scope memory fence
4	`__nvvm_membar_gl`	`membar.gl`	Device-scope memory fence
5	`__nvvm_membar_sys`	`membar.sys`	System-scope memory fence

The core __syncthreads (ID 1) lowers to the LLVM intrinsic llvm.nvvm.barrier0 (intrinsic ID 8259). Memory barriers at IDs 3--5 are lowered via inline IR generation: the handler builds a barrier store node through sub_128B420 / sub_92C9E0 and inserts it into the current basic block.

Barrier Extensions (IDs 15--20)

These builtins extend the basic barrier with predicate reduction and explicit warp/block synchronization.

ID	Builtin	Intrinsic	Description
15	`__nvvm_bar0_popc`	`llvm.nvvm.barrier0.popc`	Barrier + population count of predicate
16	`__nvvm_bar0_and`	`llvm.nvvm.barrier0.and`	Barrier + AND reduction of predicate
17	`__nvvm_bar0_or`	`llvm.nvvm.barrier0.or`	Barrier + OR reduction of predicate
18	`__nvvm_bar_sync_all`	`llvm.nvvm.barrier.sync` (8925)	Named barrier sync (all threads)
19	`__nvvm_barrier_sync`	`llvm.nvvm.barrier.sync.cnt` (9296)	Named barrier sync with count
20	`__nvvm_bar_warp_sync`	`llvm.nvvm.bar.warp.sync` (8258)	Warp-level barrier

The reduction barriers (IDs 15--17) are dispatched through sub_12AB550 / sub_94C360. The handler looks up intrinsic 3767 (EDG) or the corresponding entry from dword_3F14778[] (NVVM) and emits a function call via sub_1285290 / sub_921880. ID 16 sets flag=1 (AND) and ID 17 sets flag=16|0 (OR); the population count variant uses the default flag.

Barriers with explicit count (IDs 205--206, __nvvm_bar_sync_all_cnt and __nvvm_barrier_sync_cnt) follow the same pattern with additional count arguments.

Cluster Operations (IDs 8--14, SM 90+)

Thread block cluster operations were introduced with SM 90 (Hopper). These builtins query cluster geometry and perform inter-block synchronization within a cluster.

Cluster Geometry Queries (IDs 8--10, 405--408)

ID	Builtin	Handler	Description
8	`__nv_clusterDimIsSpecified_impl`	`sub_12AB0E0(ctx, 0)`	Whether cluster dimensions are explicit
9	`__nv_clusterRelativeBlockRank_impl`	`sub_12AB0E0(ctx, 1)`	Block rank within cluster
10	`__nv_clusterSizeInBlocks_impl`	`sub_12AB0E0(ctx, 2)`	Number of blocks in cluster
405	`__nv_clusterDim_impl`	--	Cluster dimension
406	`__nv_clusterRelativeBlockIdx_impl`	--	Block index within cluster
407	`__nv_clusterGridDimInClusters_impl`	--	Grid dimension in cluster units
408	`__nv_clusterIdx_impl`	--	Cluster index

Cluster Barriers (IDs 11--14)

ID	Builtin	Intrinsic ID	Description
11	`__nv_cluster_barrier_arrive_impl`	3767	Signal arrival at cluster barrier
12	`__nv_cluster_barrier_wait_impl`	3767	Wait at cluster barrier
13	`__nv_cluster_barrier_arrive_relaxed_impl`	3767	Relaxed arrival (no ordering guarantee)
14	`__nv_threadfence_cluster_impl`	4159 / 9052	Cluster-scope memory fence

The cluster fence at ID 14 emits intrinsic llvm.nvvm.cp.async.commit.group (EDG intrinsic 4159, NVVM intrinsic 9052) with a flag constant of 4, encoding the thread-fence semantic.

Cluster Shared Memory (IDs 202--203, 365)

ID	Builtin	Description
202	`__nv_isClusterShared_impl`	Query if address is in cluster shared memory
203	`__nv_cluster_query_shared_rank_impl`	Get rank of block that owns shared address
365	`__nv_cluster_map_shared_rank_impl`	Map address to another block's shared memory

ID 203 has an SM-dependent lowering path: on SM <= 63, the handler returns an inline constant (passthrough); on SM 64+, it emits intrinsic 3769 (EDG) / 8825 (NVVM). The same pattern applies to ID 365, which gates on intrinsic 3770 / 9005.

Memory Fence Lowering

Memory fences are emitted as inline PTX assembly because they have no direct LLVM IR equivalent. Two handlers exist:

`sub_94F9E0` -- membar (CTA/Device/System)

Generates membar.{scope}; where scope is determined by the scope parameter:

Scope Value	PTX Output
0, 1	`membar.cta;`
2, 3	`membar.gl;`
4	`membar.sys;`

The constraint string is ~{memory} to ensure the compiler treats the fence as a full memory clobber. The emitted node receives two memory attributes: inaccessiblemem (attribute 41) and a readonly fence marker (attribute 6).

`sub_94FDF0` -- fence (with explicit ordering)

Generates fence.{ordering}.{scope}; for SM 70+ targets:

Ordering Value	PTX Qualifier
3	`sc` (sequentially consistent)
4	`acq_rel`
5	`sc` (same as 3)

Both fence handlers use sub_B41A60 to create the inline assembly call and sub_921880 to emit it into the instruction stream.

Async Memory Copy Barriers (IDs 367--369)

The cp.async instructions for asynchronous shared-to-global memory copies include implicit barrier semantics:

ID	Builtin	Size	Description
367	`__nv_memcpy_async_shared_global_4_impl`	4 bytes	Async copy with barrier
368	`__nv_memcpy_async_shared_global_8_impl`	8 bytes	Async copy with barrier
369	`__nv_memcpy_async_shared_global_16_impl`	16 bytes	Async copy with barrier

These are lowered through sub_12AB730 / sub_94C5F0, which builds the cp.async PTX instruction with the specified transfer size.

Architecture Gates

SM Threshold	Barrier Feature
All SM	`__syncthreads`, `membar.{cta,gl,sys}`, barrier reductions
SM 70+	Explicit fence ordering (`fence.{ordering}.{scope}`)
SM 70+	`cp.async` asynchronous memory copy with barrier
SM 90+ (Hopper)	Cluster barriers, cluster fence, cluster shared memory queries

Lowering Strategy Summary

Barrier builtins use three distinct lowering strategies:

LLVM intrinsic call -- __syncthreads, barrier reductions, cluster barriers. These map to well-known LLVM/NVVM intrinsic IDs (8259, 8925, 9296, etc.) and emit via sub_1285290.
Inline IR generation -- Memory barriers (__nvvm_membar_*). The handler directly constructs barrier store IR nodes without going through an intrinsic lookup.
Inline PTX assembly -- Memory fences (membar.*, fence.*). These have no LLVM IR equivalent and are emitted as inline asm strings with ~{memory} clobber constraints.

Warp-Level Operation Builtins

Warp-level builtins provide lane-to-lane communication within a 32-thread warp. They cover four major categories: shuffle (data exchange between lanes), vote (predicate aggregation), match (value matching across lanes), and redux (warp-wide reductions). The shuffle operations also serve as the lowering target for the WMMA fragment load/store operations described in the tensor core page.

Shuffle Operations (IDs 413--416)

The __shfl_sync family enables direct register-to-register communication between warp lanes. Four shuffle modes exist, each registered as a _sync variant:

ID	Builtin	Mode	Description
413	`__nvvm_shfl_up_sync`	Up	Lane reads from `lane - delta`
414	`__nvvm_shfl_down_sync`	Down	Lane reads from `lane + delta`
415	`__nvvm_shfl_bfly_sync`	Butterfly	Lane reads from `lane XOR delta`
416	`__nvvm_shfl_idx_sync`	Index	Lane reads from arbitrary `srcLane`

Shuffle Dispatch via Table Lookup

All shuffle builtins route through sub_12B3540 (EDG) / sub_954F10 (NVVM), the table-based lowering handler. Three groups of 8 IDs each cover the complete shuffle interface:

ID Range	Group	Description
302--309	Legacy `__shfl`	Non-sync variants (4 modes x 2 types: i32/f32)
338--345	`__shfl_sync`	Sync variants with mask (4 modes x 2 types)
395--402	`__shfl_*_sync`	Newer SM interface (4 modes x 2 types)

Within each group of 8, the layout is:

Offset	Mode	i32 Variant	f32 Variant
+0, +1	shfl_up	offset +0	offset +1
+2, +3	shfl_down	offset +2	offset +3
+4, +5	shfl_xor	offset +4	offset +5
+6, +7	shfl_idx	offset +6	offset +7

The handler builds the argument list (mask, value, delta/lane, width), looks up the target intrinsic by shuffle mode and data type from its red-black tree map, and emits a function call.

Vote Operations (IDs 351--358)

Warp vote builtins aggregate a boolean predicate across all participating lanes. Both legacy (non-sync) and sync variants are registered.

ID	Builtin	Operation	Sync
351	`__nvvm_vote_all`	All predicates true?	No
352	`__nvvm_vote_any`	Any predicate true?	No
353	`__nvvm_vote_uni`	All predicates equal?	No
354	`__nvvm_vote_ballot`	Bitmask of predicates	No
355	`__nvvm_vote_all_sync`	All predicates true?	Yes
356	`__nvvm_vote_any_sync`	Any predicate true?	Yes
357	`__nvvm_vote_uni_sync`	All predicates equal?	Yes
358	`__nvvm_vote_ballot_sync`	Bitmask of predicates	Yes

Vote Lowering

The handler sub_12ABB90 (EDG) / sub_94D570 (NVVM) takes parameters:

(result, ctx, vote_op, args, is_ballot, is_sync)

The vote_op encoding: 0 = all, 1 = any, 2 = uni, 3 = ballot.

When is_sync=1, an extra mask argument is consumed from the call arguments. For non-sync variants, the handler looks up intrinsic 5301 (llvm.nvvm.vote). For sync variants, it generates an inline predicate pattern. The ballot variant (vote_op=3) sets is_ballot=1, which changes the return type from i1 (predicate) to i32 (bitmask).

Match Operations (IDs 361--364)

Match builtins find lanes with equal values and return a bitmask of matching lanes. Available in 32-bit and 64-bit variants with two matching modes.

ID	Builtin	Width	Mode	Intrinsic
361	`__match32_any_sync`	32-bit	Any match	`0x1011`
362	`__match64_any_sync`	64-bit	Any match	`0x1011`
363	`__match32_all_sync`	32-bit	All match	`0x100F`
364	`__match64_all_sync`	64-bit	All match	`0x100F`

The handler sub_12AD230 (EDG) dispatches on two opcodes: 0x1011 for any-match and 0x100F for all-match. The NVVM-side handler sub_94F430 uses intrinsic pairs 0x2017 / 0x2018 with mode variants 0, 1, 2 to encode the width and match type.

Warp Redux (IDs 413--416 range, via `sub_12ADD20`)

Warp-wide reduction operations perform arithmetic reductions across all active lanes in a single instruction. These are dispatched through sub_12ADD20 (EDG) / sub_94F250 (NVVM).

ID	Operation	NVVM Intrinsic	Description
redux.sync.add	`0x24F5` (9461)	Sum reduction	Sum of values across warp
redux.sync.min	`0x24ED` (9453)	Minimum reduction	Minimum value across warp
redux.sync.max	`0x24E9` (9449)	Maximum reduction	Maximum value across warp
redux.sync.or	`0x24F1` (9457)	Bitwise OR reduction	OR of values across warp

The EDG side uses intrinsic codes 0x2332 and 0x2330 for the two redux variant families.

Activemask and Lanemask

The active mask and per-lane mask builtins are handled through sub_12ADB00 (EDG) / sub_94CF30 (NVVM):

These builtins return the set of currently active lanes (__activemask()) or per-lane positional masks (__lanemask_lt(), __lanemask_le(), __lanemask_eq(), __lanemask_ge(), __lanemask_gt()). They compile to PTX special register reads (%lanemask_*).

Predicate-Register Conversion (IDs 411--412)

Two builtins convert between predicate registers and general-purpose registers:

ID	Builtin	Direction	Description
411	`__nv_p2r`	Predicate -> Register	Pack predicates into a 32-bit register
412	`__nv_r2p`	Register -> Predicate	Unpack a 32-bit register into predicates

The handler generates element-wise operations: sub_9483E0 iterates over vector elements using sub_39FAC40 to compute the element count, then builds per-element extractelement + store (for p2r) or load + insertelement (for r2p) chains.

Nanosleep and CP.Async

Warp-adjacent utility builtins handled through sub_12AD230 / sub_94ED50:

ID Range	Operation	Description
367--369	`__nv_memcpy_async_shared_global_{4,8,16}_impl`	Asynchronous copy (cp.async)

These builtins combine data movement with implicit synchronization and are lowered through sub_12AB730 / sub_94C5F0, which builds the cp.async PTX instruction with the specified transfer size (4, 8, or 16 bytes).

Architecture Requirements

Feature	Minimum SM	Notes
`__shfl` (legacy, non-sync)	SM 30+	Deprecated; requires full warp convergence
`__shfl_sync`	SM 70+ (Volta)	Explicit mask; independent thread scheduling
Vote (non-sync)	SM 30+	Deprecated
Vote (`_sync`)	SM 70+	Explicit mask required
Match (`_sync`)	SM 70+	Warp-level value matching
Redux (`redux.sync.*`)	SM 80+ (Ampere)	Hardware-accelerated warp reduction
Elect sync	SM 90+ (Hopper)	Single-lane election from active mask
`cp.async`	SM 80+	Asynchronous shared memory copy

GPU Target Architecture

45 SM variants across 6 generations. Processor table at qword_502A920 (stride-2 layout: name + PTX version). Architecture gating throughout the binary controls feature availability.


SM table	`qword_502A920` (45 entries, `ctor_605` at `0x584510`)
Arch detection	`sub_95EB40` (38KB, CLI -> 3-column mapping)
NVVM arch enum	`sub_CD09E0` (14.5KB, `NVVM_ARCH_*` strings)
EDG arch gates	`sub_60E7C0` (~60 feature flags based on SM version)
Backend subtarget	NVPTXSubtarget (feature offsets at +2498, +2584, +2843)
Target triples	`nvptx64-nvidia-cuda`, `nvsass-nvidia-directx`, `nvsass-nvidia-spirv`

Per-SM Deep Dives:

SM 70-89 (Volta through Ada Lovelace) -- Feature configuration call order, complete sub_60E7C0 flag table, atomic lowering, cumulative flag profiles
SM 90 -- Hopper -- Thread block clusters, TMA descriptor format and lowering, WGMMA, setmaxnreg, distributed shared memory
SM 100 -- Blackwell Datacenter -- tcgen05 tensor core ISA, arch-conditional vs. family-conditional gating, cvt_packfloat FP4/FP6/MX formats
SM 120 -- Blackwell Consumer -- No tcgen05, .offset.bindless texture intrinsics, f16 texture support, mma.sync.block_scale (future)

Complete SM Table

SM	`__CUDA_ARCH`	PTX Ver	Generation	Suffix	Status	Deep Dive
`sm_75`	750	5	Turing	--	Production	sm70-89
`sm_80`	800	5	Ampere	--	Production	sm70-89
`sm_82`	820	5	Ampere	--	Undocumented	sm70-89
`sm_86`	860	5	Ampere	--	Production	sm70-89
`sm_87`	870	5	Ampere	--	Production	sm70-89
`sm_88`	880	5	Ada	--	Undocumented	sm70-89
`sm_89`	890	5	Ada	--	Production	sm70-89
`sm_90`	900	5	Hopper	--	Production	sm90
`sm_90a`	900	6	Hopper	`a`	Production	sm90
`sm_100`	1000	6	Blackwell	--	Production	sm100
`sm_100a`	1000	7	Blackwell	`a`	Production	sm100
`sm_100f`	1000	7	Blackwell	`f`	Production	sm100
`sm_101`	1010	6	Jetson Thor (pre-rename)	--	Undocumented	sm100
`sm_101a`	1010	7	Jetson Thor (pre-rename)	`a`	Undocumented	sm100
`sm_101f`	1010	7	Jetson Thor (pre-rename)	`f`	Undocumented	sm100
`sm_102`	1020	6	Blackwell	--	Undocumented	sm100
`sm_102a`	1020	7	Blackwell	`a`	Undocumented	sm100
`sm_102f`	1020	7	Blackwell	`f`	Undocumented	sm100
`sm_103`	1030	6	Blackwell	--	Production	sm100
`sm_103a`	1030	7	Blackwell	`a`	Production	sm100
`sm_103f`	1030	7	Blackwell	`f`	Production	sm100
`sm_110`	1100	6	Jetson Thor	--	Production	sm120
`sm_110a`	1100	7	Jetson Thor	`a`	Production	sm120
`sm_110f`	1100	7	Jetson Thor	`f`	Production	sm120
`sm_120`	1200	6	Blackwell (sm120)	--	Production	sm120
`sm_120a`	1200	7	Blackwell (sm120)	`a`	Production	sm120
`sm_120f`	1200	7	Blackwell (sm120)	`f`	Production	sm120
`sm_121`	1210	6	Blackwell (sm120)	--	Production	sm120
`sm_121a`	1210	7	Blackwell (sm120)	`a`	Production	sm120
`sm_121f`	1210	7	Blackwell (sm120)	`f`	Production	sm120

Legacy architectures also present in the table but not in the CLI mapping: sm_20, sm_21, sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_73.

Suffix Meanings

Suffix	Meaning	PTX Version	Detail
(none)	Base feature set	5 (legacy) or 6 (sm_100+)	All architectures; sm70-89 has no suffix-gated logic
`a`	Accelerated / advanced features	6 (sm_90a) or 7 (sm_100a+)	sm_90a enables one EDG gate (sm90); sm_100a+ enables tcgen05 arch-conditional path (sm100)
`f`	Forward-compatible feature set	7	Implies `a`; never read by cicc logic (sm120); reserved for ptxas

PTX Version Mapping

PTX Version	SM Range	Notes
5	sm_20 through sm_90 (legacy/base)	All pre-Blackwell base variants
6	sm_90a, sm_100/101/102/103/110/120/121 (base)	sm_90a is the sole pre-Blackwell PTX 6 target (sm90)
7	sm_100a/f through sm_121a/f (extended features)	Required for tcgen05 arch-conditional intrinsics (sm100)

Architecture Gating

Four subsystems cooperate to configure feature flags from the SM version. The master configurator sub_60E7C0 runs last and has the highest non-CLI priority. For the complete flag table per tier, see SM 70-89 Complete sub_60E7C0 Flag Table.

Feature Configuration Pipeline

CLI parser (sub_617BD0)              Sets byte_4CF8* override flags
    |
sub_60DFC0 (secondary)               Sets unk_4D041B8 at sm_80+ (__VA_OPT__)
    |
sub_60D650 (optimization level)      ~109 flags from -O level
    |
sub_60E7C0 (master SM configurator)  ~60 flags via SM threshold comparisons
    |--- sub_60E530 (tertiary)        Supplementary progressive unlocks
    |
sub_982C80 (NVPTX subtarget)         224-byte bitfield for LLVM backend

Override priority: CLI flag > SM version > Optimization level > C++ standard version > CUDA mode > Virtual arch flag. See CLI Flag Inventory for the complete CLI flag-to-pipeline routing and Optimization Levels for per-level flag differences.

EDG-Level Gates -- `sub_60E7C0`

Sets ~60 unk_4D04* feature flags based on SM version thresholds. Each flag is gated by a byte_4CF8* user-override check.

Threshold	SM Boundary	Features Enabled	Detail
> 30399	sm_75 (Turing)	Base CUDA features, dynamic parallelism	sm70-89 Turing
> 40000	sm_80 (Ampere)	C++20 `__VA_OPT__`, L2 cache hints, extended atomics	sm70-89 Ampere
> 89999	sm_90 (Hopper)	Cluster ops, TMA, setmaxnreg, WGMMA fence	sm90 Feature Flags
> 109999	sm_100 (Blackwell)	tcgen05, match instruction, `dword_4D041AC`	sm100 Feature Flags
> 119999	sm_120	`unk_4D047BC` disabled, `unk_4D0428C`	sm120 Feature Flags

Backend Subtarget Feature Offsets (NVPTXSubtarget)

Offset	Purpose	Stride	Detail
+2498	Type legality flags (per MVT)	259 bytes	See Type Legalization
+2584	Float legality flags (per MVT)	259 bytes	See Type Legalization
+2843	Integer type support flag	1 byte	--
+2870	Branch distance flag	1 byte	See Block Placement
+2871	Jump table eligibility flag	1 byte	See BranchFolding

For the complete NVPTXSubtarget analysis, see NVPTX Target Infrastructure.

Intrinsic Verifier Architecture Gates -- `sub_2C7B6A0`

The NVVMIntrinsicVerifier (143KB) gates intrinsics by SM version. For the complete three-layer verification architecture, see NVVM IR Verifier.

Gate	SM	Intrinsics
sm_72 (Volta)	Convergent branch intrinsics, some atomic ops	sm70-89 Volta
sm_75 (Turing)	Conversion type intrinsics	sm70-89 Turing
sm_89 (Ada)	Specific intrinsics	sm70-89 Ada
sm_90 (Hopper)	Cluster dimensions, TMA, WGMMA	sm90 TMA, sm90 WGMMA
sm_100+ (Blackwell)	`.offset.bindless` intrinsics, tcgen05	sm100 tcgen05, sm120 .offset.bindless

Feature Gate Matrix

This matrix shows which major compiler features are available at each SM tier. Each cell links to the detailed discussion in the per-SM deep-dive page.

Tensor Core / MMA Instructions

Feature	sm_70-75	sm_80-89	sm_90/90a	sm_100/103	sm_110	sm_120/121
HMMA m16n16k16 (f16)	Yes	Yes	Yes	Yes	Yes	Yes
IMMA int8/int4, BMMA	sm_75+	Yes	Yes	Yes	Yes	Yes
DMMA fp64, TF32, BF16	--	sm_80+	Yes	Yes	Yes	Yes
WGMMA async (f16/bf16/tf32/f8)	--	--	Yes	Yes	Yes	--
tcgen05.mma (MX formats)	--	--	--	a/f only	a/f only	No
mma.sync.block_scale	--	--	--	--	--	Future

See Tensor / MMA Builtins for the per-builtin ID reference and Tensor / MMA Codegen for the code generation pipeline.

Memory and Synchronization

Feature	sm_70-75	sm_80-89	sm_90/90a	sm_100/103	sm_110	sm_120/121
Full atomic memory ordering	sm_70+	Yes	Yes	Yes	Yes	Yes
128-bit atomics	sm_70+	Yes	Yes	Yes	Yes	Yes
L2 cache hint atomics	--	sm_80+	Yes	Yes	Yes	Yes
Cluster scope atomics	--	--	Yes	Yes	Yes	Yes
cp.async	--	sm_80+	Yes	Yes	Yes	Yes
TMA (tensor memory access)	--	--	Yes	Yes	Yes	Yes
TMA 2CTA mode, Im2Col_W	--	--	--	sm_100+	sm_100+	sm_100+
setmaxnreg	--	--	Yes	Yes	Yes	Yes
fence.sc.cluster	--	--	Yes	Yes	Yes	Yes

See Atomics Builtins for atomic PTX generation detail and Barriers & Sync for barrier builtins.

Thread Block Clusters

Feature	sm_70-89	sm_90/90a	sm_100+
`__cluster_dims__` attribute	Diagnostic 3687	Yes	Yes
`__launch_bounds__` 3rd param	Diagnostic 3704	Yes	Yes
`__block_size__` 5th arg	Diagnostic 3790	Yes	Yes
Cluster special registers (15)	--	Yes	Yes
barrier.cluster.arrive/wait	--	Yes	Yes
Cluster query builtins (9)	--	Yes	Yes
Distributed shared memory	--	Yes	Yes
`.blocksareclusters` directive	--	Yes	Yes

Numeric Formats

Format	First Available	Gate Location	Detail
f16, f32, f64	All	--	Standard types
bf16 (bfloat16)	sm_80+	Ampere tensor core	Tensor core and `cvt`
tf32 (TensorFloat-32)	sm_80+	Ampere tensor core	Tensor core only
fp8 e4m3, e5m2	sm_90+	WGMMA	`cvt_packfloat` cases 2-3
fp6 e2m3, e3m2	sm_100+	cvt_packfloat	Arch-conditional only
fp4 e2m1	sm_100+	cvt_packfloat	Arch-conditional only
ue8m0 (scale factor)	sm_100+	cvt_packfloat	Both arch and family-conditional
MX formats (mxf4, mxf8f6f4, mxf4nvf4)	sm_100+	tcgen05.mma	tcgen05 a/f sub-variants only

Texture and Surface

Feature	sm_70-89	sm_90	sm_100/103	sm_120/121
Standard texture intrinsics	Yes	Yes	Yes	Yes
`.offset.bindless` intrinsics (68 variants)	--	--	--	sm_120+
f16 texture element types	Limited (builtin 3811 only)	Limited	Limited	Full support

See Surface & Texture Builtins for the tex_surf_handler dispatch algorithm.

EDG Frontend Feature Flags

Feature	Threshold	Flag	Detail
C++17 feature gates (EDG)	sm_70+	`unk_4D041DC`, `unk_4D04858`, `unk_4D041EC`	sm70-89 Flag Table
C++20 `__VA_OPT__`	sm_80+	`unk_4D041B8`	sm70-89 sub_60DFC0
C++23 extended float suffixes	sm_70+	`unk_4D0428C`	sm70-89 Tertiary Cascade
C++20 feature gates	sm_90+	`unk_4D043D0`, `unk_4D041B0`, `unk_4D04814`	sm90 Feature Flags
Blackwell extended features	sm_100+	`unk_4D04184`, `dword_4D041AC`	sm100 Feature Flags

See EDG 6.6 Frontend for the 737-define configuration system.

tcgen05 Sub-Variant Access Table

The tcgen05 instruction family uses a two-tier gating system unique to Blackwell. Base variants (sm_100, sm_103, sm_110) are excluded; only a and f sub-variants pass the bitmask check.

SmVersion	Target	tcgen05	Detail
1001	sm_100a	Allowed	sm100 Arch-Conditional Gate
1002	sm_100f	Allowed	sm100 Arch-Conditional Gate
1031	sm_103a	Allowed	sm100 Arch-Conditional Gate
1032	sm_103f	Allowed	sm100 Arch-Conditional Gate
1101	sm_110a	Allowed	sm120: Jetson Thor
1102	sm_110f	Allowed	sm120: Jetson Thor
1000, 1030, 1100	base variants	Blocked	Bitmask `0xC0000C03` rejects; see sm100
1200-1212	all sm_120/121	Blocked	`v-1101 > 1`; see sm120 No tcgen05

Generation-Specific Features

Turing (sm_75)

sm_75 is the default architecture for cicc v13.0, hardcoded as "compute_75" in sub_900130 and sub_125FB30.

Base tensor core (HMMA m16n16k16) -- see Tensor / MMA Builtins
Conversion intrinsics
Baseline for cicc v13.0 (default architecture) -- see CLI Flag Inventory

Full detail: SM 70-89 (Volta through Ada)

Ampere (sm_80-sm_89)

L2::cache_hint on atomic operations (sub_21E6420) -- see Atomics Builtins
Extended tensor core shapes (tf32, bf16) -- see Tensor / MMA Builtins
Async copy (cp.async) -- see SM 70-89: Ampere
C++20 __VA_OPT__ support -- the sole differentiator between sm_75 and sm_80+ in sub_60E7C0/sub_60DFC0

Full detail: SM 70-89 (Volta through Ada)

Hopper (sm_90/90a)

Cluster operations: barrier.cluster.arrive/wait, fence.sc.cluster -- see Cluster Barriers
Cluster registers: %cluster_ctarank, %clusterid.x/y/z, %is_explicit_cluster -- see Cluster Special Registers
Kernel attributes: .blocksareclusters, .maxclusterrank, .reqnctapercluster, .cluster_dim -- see PTX Directives
setmaxnreg: Dynamic register allocation limit (sub_21EA5F0) -- see setmaxnreg
TMA: Tensor Memory Access with Im2Col, dimension validation, 2CTA mode -- see TMA
WGMMA: Warpgroup MMA async (f16, bf16, tf32, f8) -- see WGMMA
Distributed shared memory: .shared::cluster qualifier for cross-CTA access -- see DSMEM
Mbarrier extensions: DMA fence/arrive/wait for TMA coordination -- see Mbarrier

Full detail: SM 90 -- Hopper

Blackwell Datacenter (sm_100-sm_103)

tcgen05: Next-gen tensor core instruction set (scaleD, transA, negA, negB at sub_21E8CD0) -- see tcgen05
Arch-conditional vs. family-conditional gating: Two-tier feature system for tcgen05 sub-instructions -- see Gating
match instruction: Architecture-gated ("match instruction not supported on this architecture!") -- see sm100
Extended MMA shapes: m16n8k256 with MX format support
.offset.bindless intrinsics -- gated at sm_120+, NOT sm_100 (see sm120 .offset.bindless)
cvt_packfloat extended types: FP4, FP6, MX formats -- see cvt_packfloat

Full detail: SM 100 -- Blackwell Datacenter

Jetson Thor (sm_110)

sm_110 is architecturally a datacenter Blackwell derivative (originally sm_101 before rename). It retains full tcgen05/TMEM hardware on a/f sub-variants. The sm_110 section is documented on the sm_120 page because the two are often compared.

Full detail: SM 120 -- Jetson Thor section

Blackwell Consumer (sm_120, sm_121)

No tcgen05: The entire tcgen05 ISA is rejected by cicc for all sm_120/121 variants -- see No tcgen05
.offset.bindless texture intrinsics (68 variants) -- see .offset.bindless
16-bit texture element types -- see f16 Texture
mma.sync.block_scale: Present in upstream LLVM 22 but NOT emitted by cicc v13.0 -- see block_scale
Tensor core falls back to HMMA/IMMA inherited from sm_70-sm_90 path

Full detail: SM 120 -- Blackwell Consumer

NVVM Container Architecture Enum -- `sub_CD09E0`

The NVVM container format uses an architecture enumeration. See NVVM Container for the complete tag inventory.

Enum String	Implied SM	Detail
`NVVM_ARCH_BLACKWELL_10_0`	sm_100	sm100
`NVVM_ARCH_BLACKWELL_10_1`	sm_101	Undocumented
`NVVM_ARCH_BLACKWELL_10_3`	sm_103	sm100
`NVVM_ARCH_BLACKWELL_11_0`	sm_110	sm120: Jetson Thor
`NVVM_ARCH_BLACKWELL_12_0`	sm_120	sm120
`NVVM_ARCH_BLACKWELL_12_1`	sm_121	sm120
`NVVM_ARCH_HOPPER_9_0`	sm_90	sm90
`NVVM_ARCH_ADA_8_9`	sm_89	sm70-89
`NVVM_ARCH_AMPERE_8_0` through `8_8`	sm_80-sm_88	sm70-89
`NVVM_ARCH_HW_SM_5_0` through `10_4`	sm_50-sm_104	Hardware SM enum

Notable: NVVM_ARCH_HW_SM_10_4 (sm_104) and NVVM_ARCH_BLACKWELL_11_0 are not publicly documented. NVIDIA's internal naming uses "BLACKWELL" for all sm_100-sm_121 variants, even though sm_110 is marketed as Jetson Thor and sm_120/121 are a distinct consumer microarchitecture (RTX 50xx). See SM 120: Architecture Identity for the "SM 10.4" internal designation.

Target Triples

Triple	Purpose	Detail
`nvptx64-nvidia-cuda`	Standard 64-bit CUDA compilation	Default; see NVPTX Target Infrastructure
`nvptx-nvidia-cuda`	32-bit CUDA compilation	Legacy
`nvptx64-nvidia-nvcl`	OpenCL target	--
`nvsass-nvidia-cuda`	SASS backend (native assembly)	--
`nvsass-nvidia-directx`	DirectX SASS backend	Discovered in `sub_2C80C90`; see NVVM IR Verifier
`nvsass-nvidia-spirv`	SPIR-V SASS backend	Discovered in `sub_2C80C90`

The nvsass-nvidia-directx and nvsass-nvidia-spirv triples (discovered in sub_2C80C90) reveal that NVIDIA's SASS-level backend supports DirectX and SPIR-V targets alongside traditional CUDA and OpenCL.

Data Layout Strings

Mode	Layout	Notes
64-bit + shared	`e-p:64:64:64-p3:32:32:32-i1:8:8-...-n16:32:64`	`p3:32:32:32` = 32-bit shared mem pointers
64-bit	`e-p:64:64:64-i1:8:8-...-n16:32:64`	No shared memory specialization
32-bit	`e-p:32:32:32-i1:8:8-...-n16:32:64`	32-bit mode

Address space 3 (shared memory) uses 32-bit pointers even in 64-bit mode, controlled by nvptx-short-ptr and nvptx-32-bit-smem flags. See Address Spaces for the complete address space reference.

SM Version Encoding

Two parallel version tracking systems coexist in the binary:

qword_4F077A8 -- Encodes SM_MAJOR * 10000 + SM_MINOR * 100. Used in approximately 309 decompiled files, primarily in the NVVM frontend and optimizer. Boundary thresholds use the XX99 pattern (e.g., 69999 for pre-Volta, 89999 for pre-Hopper). See SM 70-89: SM Version Encoding for full detail.
unk_4D045E8 -- Stores the raw SM number as a decimal (e.g., 75 for sm_75, 89 for sm_89). Used in approximately 12 decompiled files, primarily in the builtin checker and atomic lowering logic. See SM 70-89: unk_4D045E8 Frontend Gates for the complete gate table.

Cross-References

NVPTX Target Infrastructure -- NVPTXTargetMachine, NVPTXSubtarget, TTI hooks
Tensor / MMA Builtins -- Per-builtin-ID reference for all MMA generations
Tensor / MMA Codegen -- Code generation pipeline for tensor core operations
Atomics Builtins -- Atomic PTX generation and scope validation
Surface & Texture Builtins -- Texture intrinsic dispatch algorithm
NVVM IR Verifier -- SM-gated intrinsic verification
NVVM Container -- Architecture enum and tag inventory
CLI Flag Inventory -- -arch=compute_XX parsing and flag routing
Optimization Levels -- Per-level flag differences that interact with SM gates
EDG 6.6 Frontend -- 737-define configuration, CUDA keyword handling
Address Spaces -- Address space 3 shared memory and data layout strings
GPU Execution Model -- CTA, warp, and cluster execution model context

Volta through Ada Lovelace (sm_70 – sm_89)

The sm_70 through sm_89 range spans four GPU generations — Volta, Turing, Ampere, and Ada Lovelace — and represents the most mature feature tier in cicc v13.0. Turing (sm_75) serves as the compiler's default architecture. Volta (sm_70/72) is no longer directly targetable: no compute_70 or compute_72 entry exists in the CLI parser, though the sm_70 feature boundary is still checked at 23 locations throughout the binary.

Supported Compute Capabilities

The architecture registration table at sub_95EB40 maps CLI strings to internal flags. Only the following are accepted for this generation range:

Compute Capability	Internal Target	`__CUDA_ARCH`	PTX Version	Generation
`compute_75`	`sm_75`	750	5	Turing
`compute_80`	`sm_80`	800	5	Ampere
`compute_86`	`sm_86`	860	5	Ampere
`compute_87`	`sm_87`	870	5	Ampere (Jetson Orin)
`compute_88`	`sm_88`	880	5	Ada Lovelace
`compute_89`	`sm_89`	890	5	Ada Lovelace

There is no compute_70, compute_72, compute_73, or compute_82. The sm_73, sm_82, and sm_88 targets exist only as internal processor table entries — they have no publicly documented differentiation and no unique feature gates in the compiler.

SM Version Encoding

Two parallel version tracking systems coexist in the binary:

qword_4F077A8 — Encodes SM_MAJOR * 10000 + SM_MINOR * 100. Used in approximately 309 decompiled files, primarily in the NVVM frontend and optimizer. Boundary thresholds use the XX99 pattern (e.g., 69999 for pre-Volta, 79999 for pre-Ampere, 89999 for pre-Hopper).
unk_4D045E8 — Stores the raw SM number as a decimal (e.g., 75 for sm_75, 89 for sm_89). Used in approximately 12 decompiled files, primarily in the builtin checker and atomic lowering logic.

Feature Configuration Call Order

The compiler configures feature flags through a strict four-function call sequence. Each subsequent function can override or augment the previous one's settings:

CLI parser — Sets byte_4CF8* override flags from user-specified options. These prevent any subsequent auto-configuration from touching the guarded flag.
sub_60DFC0 — Basic initialization. Sets unk_4D041B8 for sm_80+ (C++20 __VA_OPT__ support).
sub_60D650(opt_level) — Optimization-level-based flag configuration. Sets approximately 109 flags based on the -O level. Many of the same unk_4D04* flags set by SM gates are also set here under C++17/C++20 language-version conditions.
sub_60E7C0 — Master SM architecture feature configurator. Reads qword_4F077A8 and sets approximately 60 backend flags through threshold comparisons. Also calls sub_60E530 (tertiary cascade) for supplementary flags.
sub_982C80 — NVPTX subtarget feature table initialization (224-byte bitfield for the LLVM backend). This is a separate path from the EDG flags above.

Override priority: CLI flag > SM version > Optimization level > C++ standard version > CUDA mode > Virtual arch flag.

Feature Gates by Generation

Volta (sm_70+) — Threshold `qword_4F077A8 > 69999`

Volta introduced the first tensor core generation and independent thread scheduling. Although not directly targetable in this compiler version, the sm_70 boundary enables:

HMMA tensor core intrinsics — Builtin IDs 678–707 registered in sub_90AEE0. Three shape variants (m16n16k16, m32n8k16, m8n32k16) with load, store, and MMA operations across f16/f32 accumulator combinations.
Convergent branch intrinsic — llvm.nvvm.branch.if.all.convergent (builtin 3755/8282) requires sm_70+. Error: "not supported on pre-Volta Architectures" (checked in sub_1C36530 and sub_2C7B6A0).
Proper atomic memory ordering — At sm_70+, atomics use acquire/release/relaxed semantics instead of falling back to volatile qualification. The gate is unk_4D045E8 > 69.
128-bit atomic operations — Enabled at sm_70+. Below this threshold, diagnostic 3758 is emitted: "16-byte atomics only supported on sm_70+".
Optimizer feature flags — unk_4D041DC, unk_4D04858, unk_4D041EC are set by sub_60E7C0. The tertiary cascade sub_60E530 additionally sets unk_4D0428C (extended float suffix support for C++23 std::float*_t / std::bfloat16_t). Multiple SelectionDAG patterns in sub_706250 activate for sm_70+ codegen.
Variant-flag-gated features — When dword_4F077BC (SM variant flag, the a/f suffix) is set and sm_70+ is active, unk_4D043C4 is enabled. When compiling for a virtual architecture with effective SM > 69999, unk_4D04740 is set for multi-arch optimization.
WMMA memory space optimization — The wmma-memory-space-opt pass (registered at ctor_267, ctor_531) optimizes memory access patterns for tensor core operations.

Turing (sm_75) — Default Architecture

sm_75 is the baseline for cicc v13.0. The default is hardcoded in sub_900130 and sub_125FB30 via strcpy("compute_75"), and in sub_95EB40 as "-arch=compute_75".

No explicit sm_75-specific feature gates exist beyond the sm_70 tier. All Volta-era features are available. The key behavioral distinction is that sm_75 passes all pre-Volta gates cleanly — no diagnostic 3703 (sub_5C68F0), no volatile atomic fallback, no 128-bit atomic restrictions.

Ampere (sm_80+) — Threshold `qword_4F077A8 > 79999`

C++20 __VA_OPT__ support — unk_4D041B8 set at sub_60DFC0 line 132–133. This is the only flag set exclusively by sub_60DFC0 at the sm_80 threshold. It enables __VA_OPT__ recognition in the EDG macro expander (sub_A03 line 1010), variadic trailing argument elision (line 1584), and diagnostic 2939 for misuse.
Additional convergent branch — llvm.nvvm.branch.if.convergent (builtin 3754/8283) requires sm_80+. Error: "not supported on pre-Ampere Architectures". Note the distinction: branch.if.all.convergent requires only sm_70+, while branch.if.convergent requires sm_80+.
L2 cache hint atomics — The L2::cache_hint suffix on atomic operations, emitted from sub_21E6DD0 when bit 0x400 is set in instruction encoding flags. Supported operations: exch, add, and, or, xor, max, min, cas, and floating-point add. These are PTX 7.3+ features. Emission logic lives in sub_21E6420.
cp.async.bulk patterns — String matching for cp.async.bulk.tensor.g2s. and cp.async.bulk. in inline assembly validation at sub_A8E250.

Important correction: The master SM feature configurator sub_60E7C0 does NOT set any new flags at the sm_80 boundary (> 79999). The Ampere-specific unk_4D041B8 is set by the secondary configurator sub_60DFC0. The next threshold in sub_60E7C0 after sm_70+ (> 69999) is sm_90+ (> 89999). This means sm_80 through sm_89 share the same sub_60E7C0 flag profile as sm_75.

Ada Lovelace and Ampere Variants (sm_86 – sm_89)

All of sm_86, sm_87, sm_88, and sm_89 share identical feature gates within cicc. They occupy unk_4D045E8 values 86–89 and qword_4F077A8 range 86000–89999, all below the 89999 Hopper boundary.

The primary gate at this tier is unk_4D045E8 <= 89, which delineates pre-Hopper from Hopper+:

Location	Feature	Behavior at sm_89 and below
`sub_5D1A60`	`__block_size__` attribute	Diagnostic 3790; only 4 args parsed (5th cluster arg is sm_90+)
`sub_5D1FE0`	`__cluster_dims__` attribute	Diagnostic 3687 emitted (cluster dimensions are Hopper-only)
`sub_5D2430`	`__launch_bounds__` 3rd param	Diagnostic 3704 emitted (cluster launch bounds)
`sub_6BBC40`	Atomic scope "cluster"	Falls through to "gpu" scope; diagnostic 3763/3759
`sub_6BBC40`	16-byte extended atomics	Diagnostic 3764 for certain scope+type combinations
`sub_9502D0` / `sub_12AE930`	Atomic scope emission	"gpu" used instead of "cluster"
`sub_214DA90`	Cluster PTX directives	Skipped entirely (`arch_id <= 89`)

No code path differentiates sm_89 from sm_86/87/88. Hardware differences between these sub-architectures (e.g., Ada Lovelace RTX 4090 at sm_89 vs. Jetson Orin at sm_87) are resolved at the ptxas assembler level, not in cicc.

Atomic Lowering Detail

The atomic builtin lowering (sub_12AE930 / sub_9502D0) follows two paths split at the sm_70 boundary:

Pre-sm_70 path (unk_4D045E8 <= 69): Atomics are emitted with a volatile qualifier instead of memory ordering. Scope (cta/gpu/sys) is parsed but ordering is forced to volatile. 128-bit atomics emit diagnostic 3758.

sm_70+ path (unk_4D045E8 > 69): Full memory ordering support — relaxed, acquire, release, acq_rel. Scope resolution: cta (scope 0–1), gpu (scope 3), sys (scope 4). Cluster scope (scope 2) is only available at sm_90+; on sm_70–89, scope 2 falls through to "gpu".

Operations: ld, st, atom.add, atom.and, atom.or, atom.xor, atom.max, atom.min, atom.exch, atom.cas. Type suffixes via lookup table: b (bitwise), u (unsigned), s (signed), f (float).

Hopper-Gated Intrinsics Rejected on sm_70–89

Multiple intrinsics emit "this intrinsic is only supported for Hopper+" when the SM version field is non-zero and <= 899:

Builtin ID	Description
0x10B3 (4275)	Hopper+ intrinsic requiring i1 or i32 return
0xFD5 (4053)	Hopper+ intrinsic
0xEB7 (3767)	Memory ordering/fence intrinsic with operation modes
0xEB9–0xEBA (3769–3770)	Pointer-size-dependent intrinsics (>= 64-bit)

Complete sub_60E7C0 Flag Table

The master feature configurator sub_60E7C0 (address 0x60E7C0, 12,466 bytes, 56 qword_4F077A8 comparisons) is the primary SM-architecture-to-feature-flag mapper. Every flag assignment follows a guarded pattern: if the corresponding byte_4CF8* override byte is nonzero (set by a CLI flag), the auto-configuration is skipped and the user's explicit value is preserved.

Unconditional Assignments

These flags are set regardless of SM version with no user override check:

Flag	Value	Notes
`unk_4D047C0`	1	Always enabled
`unk_4D047B4`	1	Always enabled
`unk_4F07584`	0	Always cleared
`unk_4D0423C`	1	Always enabled
`unk_4D04208`	0	Always cleared
`unk_4D04218`	1	Always enabled
`unk_4D04214`	1	Always enabled
`unk_4F06970`	0	Always cleared
`unk_4F06964`	0	Always cleared
`unk_4F06904`	0	Always cleared

SM-Dependent Unconditional Flags

These depend on SM version but have no user override check:

Flag	Condition	Value	Notes
`unk_4D047BC`	SM <= 119999	1	Disabled only for sm_120+
`unk_4D04758`	SM <= 30300	1	sm_32 and below only
`unk_4D04764`	SM <= 30399	1	sm_32 and below only
`unk_4D044B8`	SM <= 40299	1	Pre-Maxwell only

Guarded Flags (byte_4CF8* Override Bypass)

Each flag is set only when its guard byte is zero (user has not overridden via CLI):

Guard	Flag	Default Value (guard=0)
`byte_4CF807B`	`dword_4D048B8`	= 1
`byte_4CF810C`	`dword_4D04824`	= 1
`byte_4CF80F0`	`unk_4D04388`	= 1
`byte_4CF8108`	`unk_4D04338`	= (`dword_4F077BC` && !`dword_4F077B4` && SM <= 30399) ? 1 : 0
`byte_4CF8123`	`unk_4D047C8`	= 1 (only if SM > 30399, sm_35+)
`byte_4CF8125`	`dword_4D047B0`	= 1 (only if SM > 30399, sm_35+)
`byte_4CF8139`	`unk_4D04314`	= (SM <= 40299), pre-Maxwell
`byte_4CF814D`	`unk_4D047C4`	= 0
`byte_4CF8119`	`unk_4D047D0`	= (SM <= 40000)
`byte_4CF8119`	`unk_4D047CC`	= (SM <= 40099)
`byte_4CF810F`	`unk_4D047EC`	= 1
`byte_4CF8107`	`unk_4D04340`	= 0
`byte_4CF8116`	`unk_4D047E0`	= 1
`byte_4CF815F`	`unk_4F0771C`	= 1
`byte_4CF813E`	`unk_4D044B0`	= (SM > 40299), Maxwell+
`byte_4CF8149`	`unk_4D04470`	Complex Maxwell+ gate
`byte_4CF8172`	`dword_4D041AC`	= 0 (when SM <= 109999)
`byte_4CF8159`	`dword_4D048B0`	= (`dword_4D048B8` && `dword_4D048B4` && SM > 40799)
`byte_4CF811C`	`unk_4D04790`	= 0 (when virtual arch flag set)
`byte_4CF813C`	`dword_4D047AC`	sm_35+ feature gate
`byte_4CF8156`	`unk_4D04408`	CUDA C++ feature gate
`byte_4CF815D`	`unk_4D048A0`	= 1 (when SM > 40699)

Total: 21 override bytes controlling approximately 25 feature flags.

Feature Escalation by SM Version

The cumulative flag-setting cascade. Each tier inherits all flags from lower tiers. Only the tiers relevant to sm_70–89 plus their immediate predecessors and successors are shown.

SM > 59999 (sm_60+, Pascal+):

Flag	Identified Meaning
`unk_4D043CC`	EDG C++17 feature gate (also set by C++17 language block)
`unk_4D04404`	EDG extended feature gate (also set by C++17 language block)
`unk_4D043D8`	EDG C++17 feature gate (also set by C++17 language block)
`unk_4D043D4`	EDG feature gate (also set via virtual arch > 30599)
`dword_4F07760`	PTX generation mode flag
`unk_4D04870`	EDG C++20 feature gate (also set by C++20 language block)

SM > 69999 (sm_70+, Volta+):

Flag	Identified Meaning
`unk_4D041DC`	EDG C++17 feature gate (also set by C++17 language block)
`unk_4D04858`	EDG C++17 feature gate (also set by C++17 language block)
`unk_4D041EC`	EDG C++17/Pascal virtual arch feature gate

SM > 89999 (sm_90+, Hopper+) — NOT active for sm_70–89:

Flag	Identified Meaning
`unk_4D043D0`	EDG C++20 feature gate (also set by C++20 language block)
`unk_4D041B0`	EDG C++20 feature gate (also set by C++20 language block)
`unk_4D04814`	EDG C++20 feature gate (also set by C++20 language block)
`unk_4D0486C`	(with additional C++ version check)

sub_60E530 Tertiary Cascade

This supplementary function provides additional progressive unlocks. For the sm_70–89 range:

Threshold	Hex	Flags Set
> 40599	`0x9E97`	`unk_4F07764`
> 40699	`0x9EFB`	`unk_4D043F0`, `unk_4D043F4`
> 40899	`0x9FC3`	`unk_4D04220`, `unk_4D044D0`
> 59999	`0xEA5F`	`unk_4D043CC` (duplicates `sub_60E7C0`)
> 69999	`0x1116F`	`unk_4D0428C` (extended float suffixes: C++23 `std::float*_t` / `std::bfloat16_t`)
> 89999	`0x15F8F`	`dword_4F07760` (duplicates `sub_60E7C0`)
> 99999	`0x1869F`	`dword_4D043F8`, `dword_4D041E8`

Note: unk_4D0428C is set at > 69999 (sm_70+) by the cascade but at > 119999 (sm_120+) by sub_60E7C0. The cascade runs as part of sub_60E7C0, so the sm_70+ activation wins for all practical SM versions. This flag gates C++23 extended float suffixes (std::float16_t, std::float32_t, std::float64_t, std::bfloat16_t) in the EDG numeric parser at sub_A02 line 1612.

sub_60DFC0 SM-Gated Flags

The secondary configurator adds one flag at the sm_80 boundary:

Threshold	Flag	Identified Meaning
> 79999 (sm_80+)	`unk_4D041B8`	C++20 `__VA_OPT__` support in EDG macro expander. Enables `__VA_OPT__` recognition, variadic trailing argument elision, and diagnostic 2939.

Virtual Architecture Downgrade Path

When compiling for a virtual architecture (dword_4F077B4 = 1), sub_60E7C0 uses unk_4F077A0 (the effective/real SM) for a secondary tier of feature decisions:

Effective SM >	Flags Set
29999	`unk_4D043E4`
30099	`unk_4D044D0`
30199	`unk_4D043F0`
30299	`unk_4D04220`
30599	`unk_4D043D4`
59999	`unk_4D041EC`, `unk_4D043D8`, `unk_4D04404`
69999	`unk_4D04740`
79999	`unk_4D043D0`
89999	`unk_4D043D0` (redundant — already set at > 79999)
129999	`unk_4D04184`

Note: In the virtual arch path, unk_4D043D0 is set at > 79999 (sm_80+), while in the primary path it requires > 89999 (sm_90+). Virtual arch compilation is more conservative, enabling features the real target supports even if the virtual arch normally gates them.

unk_4D045E8 Frontend Gates

These gates use the raw SM number and control frontend semantic checks rather than backend flags:

Gate	Locations	Effect
<= 69	`sub_12AE930` ln 241, `sub_9502D0` ln 294	Atomic volatile fallback
<= 69	`sub_6BBC40` ln 763	128-bit atomic error 3758
<= 69	`sub_5C68F0`	Diagnostic 3703
<= 51	`sub_691790` ln 126	Surface builtin warning
<= 59	`sub_6BBC40` ln 639	Atomic scope restriction
60–69	`sub_6BBC40` ln 814	Diagnostic 3762
<= 79	`sub_5C6950` ln 15	Diagnostic 3660
<= 89	`sub_5D1A60` ln 35	`__block_size__` 5th arg blocked
<= 89	`sub_5D1FE0` ln 19	`__cluster_dims__` diagnostic 3687
<= 89	`sub_5D2430` ln 33	`__launch_bounds__` 3rd param diagnostic 3704
<= 89	`sub_6BBC40` ln 684	Atomic scope diagnostic 3763/3759
<= 89	`sub_6BBC40` ln 805, 827	16-byte atomic diagnostic 3764
<= 89	`sub_9502D0` ln 424, `sub_12AE930` ln 255	Cluster scope falls through to "gpu"
<= 89	`sub_214DA90` ln 66	Cluster PTX directives skipped

Cumulative Flag Profile per SM Version

This table shows the net flag state for each SM version in the range, combining all three configurators (sub_60E7C0 + sub_60E530 + sub_60DFC0). Only flags that differ across the sm_70–89 range are shown.

Flag	sm_75	sm_80	sm_86–89	Set By	Identified Role
`unk_4D041DC`	1	1	1	`sub_60E7C0` > 69999	EDG C++17 feature gate
`unk_4D04858`	1	1	1	`sub_60E7C0` > 69999	EDG C++17 feature gate
`unk_4D041EC`	1	1	1	`sub_60E7C0` > 69999	EDG C++17 / virtual arch feature gate
`unk_4D0428C`	1	1	1	`sub_60E530` > 69999	Extended float suffixes (C++23)
`unk_4D041B8`	0	1	1	`sub_60DFC0` > 79999	C++20 `__VA_OPT__` support
`unk_4D043D0`	0	0	0	`sub_60E7C0` > 89999	(sm_90+ only)
`unk_4D041B0`	0	0	0	`sub_60E7C0` > 89999	(sm_90+ only)
`unk_4D04814`	0	0	0	`sub_60E7C0` > 89999	(sm_90+ only)
`unk_4D0486C`	0	0	0	`sub_60E7C0` > 89999	(sm_90+ only)

The sole differentiator between sm_75 and sm_80+ within sub_60E7C0/sub_60DFC0 is unk_4D041B8. All flags set at > 69999 are shared by all sm_70–89 targets. All flags set at > 89999 are absent from all sm_70–89 targets. There is no per-flag difference between sm_86, sm_87, sm_88, and sm_89.

Identified Flag Semantics

Where flag consumers have been positively identified in the decompiled binary:

Flag	Set At	Consumer	Meaning
`unk_4D041B8`	sm_80+ (`sub_60DFC0`)	EDG macro expander (`sub_A03` ln 1010)	C++20 `__VA_OPT__` support: recognition, variadic trailing argument elision, diagnostic 2939
`unk_4D0428C`	sm_70+ (`sub_60E530`), sm_120+ (`sub_60E7C0`)	EDG numeric parser (`sub_A02` ln 1612)	Extended float suffixes: C++23 `std::float16_t`, `std::float32_t`, `std::float64_t`, `std::bfloat16_t`
`dword_4F07760`	sm_60+ (`sub_60E7C0`, `sub_60E530`)	PTX generation path	PTX emission mode flag
`unk_4D047C8`	sm_35+ (`sub_60E7C0`)	Backend	Dynamic parallelism optimization
`dword_4D047B0`	sm_35+ (`sub_60E7C0`)	Backend	Dynamic parallelism support
`unk_4D04780`	always	EDG macro expander	GNU `##__VA_ARGS__` comma-deletion extension

The remaining approximately 50 flags feed into the EDG frontend and NVVM IR generation pipeline. Based on the pattern that sub_60D650 (optimization level) and sub_60E7C0 (SM version) set the same flags with overlapping conditions, most are language feature gates (C++17/20/23 features that are also SM-gated) or optimization pass enables that depend on target capability.

Key Binary Locations

Hopper (sm_90, sm_90a)

Hopper represents the largest single-generation feature expansion in cicc v13.0. The sm_90 gate at qword_4F077A8 > 89999 unlocks thread block clusters, distributed shared memory, Tensor Memory Access (TMA), Warpgroup Matrix Multiply-Accumulate (WGMMA), dynamic register count control, and a new fence instruction. The sm_90a "accelerated" sub-variant shares __CUDA_ARCH=900 with sm_90 but uses a higher PTX version and enables one additional feature gate in the EDG frontend.

Architecture Identity

The NVVM container format registers Hopper as NVVM_ARCH_HOPPER_9_0 with numeric value 900, assigned in sub_CD09E0 (line 255) and sub_1C1B150 (line 270) via the pattern v62(a1, "NVVM_ARCH_HOPPER_9_0", v64) => *a2 = 900.

Variant	Subtarget Enum	`__CUDA_ARCH`	PTX Version	`-opt-arch`	`-mcpu`
`sm_90`	38	900	5	`sm_90`	`sm_90`
`sm_90a`	39	900	6	`sm_90a`	`sm_90a`

Both variants share __CUDA_ARCH=900. The distinction lies in the -opt-arch and -mcpu flags passed through the internal pipeline (sub_95EB40 lines 461–469, sub_12C8DD0 lines 435–457). The sm_90a variant is the only pre-Blackwell SM that uses PTX version 6; all sm_20 through sm_90 base variants use PTX version 5.

The a flag is stored in unk_4D045E4 and read in exactly one location: sub_6C4D80 line 167, where the check unk_4D045E8 != 90 || !unk_4D045E4 gates a specific sm_90a-only feature (error code 0xE90 = 3728).

Thread Block Cluster Infrastructure

Clusters are the headline Hopper feature. The compiler gates all cluster functionality at arch_id >= 90 (unk_4D045E8 > 89).

Frontend Attributes

The EDG frontend recognizes three cluster-related kernel attributes:

__cluster_dims__ — Attribute code k in sub_5C79F0. Processing in sub_5D1FE0 validates three integer arguments (x, y, z) and stores them at offsets +20, +24, +28 of the kernel metadata structure. Error codes 3685/3686 on invalid values. On sm_89 and below, diagnostic 3687 is emitted as a warning.

__launch_bounds__ 3rd parameter — The cluster dimension extension to __launch_bounds__ is processed in sub_5D2430. On sm_89 and below, diagnostic 3704 is emitted.

__block_size__ attribute — Handled in sub_5D1A60. At sm_90+, five block dimension arguments are parsed (including the cluster dimension). At sm_89 and below, diagnostic 3790 is emitted and only four arguments are accepted.

NVVM Metadata

Cluster configuration propagates through NVVM IR via several metadata keys:

Metadata Key	Writers	Readers
`nvvm.cluster_dim`	`sub_93AE30`, `sub_129A750`	`sub_A84F90`, `sub_CE8EA0`
`cluster_dim_x/y/z`	`sub_913C80`, `sub_1273830`	`sub_CE8C00/40/80`
`cluster_max_blocks`	`sub_913C80`, `sub_1273830`	(kernel metadata)
`nvvm.blocksareclusters`	`sub_93AE30`, `sub_129A750`	`sub_214DA90`
`nvvm.maxclusterrank`	(external)	`sub_A84F90`, `sub_CE9030`

The blocksareclusters metadata requires reqntid to be set — error message: "blocksareclusters requires reqntid" (sub_214DA90 line 111).

PTX Directives

The kernel attribute emitter at sub_214DA90 gates cluster directives at arch_id >= 90. When the gate passes, four directives may be emitted:

.blocksareclusters — Declares that thread blocks form clusters
.explicitcluster — Emitted when all three cluster dimensions are present
.reqnctapercluster X, Y, Z — Required CTA count per cluster
.maxclusterrank N — Maximum cluster rank

Cluster Special Registers

The PTX emitter at sub_21E9060 handles 15 cluster special registers via a switch statement:

Case	Register	Description
0	`%is_explicit_cluster`	Boolean: was cluster explicitly set
1	`%cluster_ctarank`	CTA rank within the cluster
2	`%cluster_nctarank`	Number of CTAs in cluster
3–5	`%cluster_nctaid.{x,y,z}`	Cluster grid dimensions
6–8	`%cluster_ctaid.{x,y,z}`	CTA position within cluster
9–11	`%nclusterid.{x,y,z}`	Cluster grid count
12–14	`%clusterid.{x,y,z}`	Cluster ID

Cluster Barrier Operations

The barrier.cluster instruction is emitted from sub_21E8EA0 with two operation modes and two memory ordering modes:

Opcode (bits 0–3)	Operation	Memory Mode (bits 4–7)	Qualifier
0	`arrive`	0	(default acquire/release)
1	`wait`	1	`.relaxed`

Error strings: "bad cluster barrier op" for invalid opcode, "bad cluster barrier mem mode" for invalid memory mode.

Three corresponding builtins are registered in sub_90AEE0:

Builtin	ID
`__nv_cluster_barrier_arrive_impl`	11
`__nv_cluster_barrier_wait_impl`	12
`__nv_cluster_barrier_arrive_relaxed_impl`	13

Cluster Query Builtins

Nine cluster information builtins are registered in sub_90AEE0:

Builtin	ID	Purpose
`__nv_clusterDimIsSpecifed_impl`	8	Check if cluster dims are set
`__nv_clusterRelativeBlockRank_impl`	9	Block rank within cluster
`__nv_clusterSizeInBlocks_impl`	10	Total blocks in cluster
`__nv_cluster_query_shared_rank_impl`	203	Query shared memory rank
`__nv_cluster_map_shared_rank_impl`	365	Map to shared memory rank
`__nv_clusterDim_impl`	405	Get cluster dimensions
`__nv_clusterRelativeBlockIdx_impl`	406	Relative block index
`__nv_clusterGridDimInClusters_impl`	407	Grid dimension in clusters
`__nv_clusterIdx_impl`	408	Cluster index

fence.sc.cluster Instruction

A new fence instruction is emitted from sub_21E94F0, the membar/fence printer. The opcode encoding uses the low 4 bits of the operand:

Value	Instruction	Generation
0	`membar.gpu`	All
1	`membar.cta`	All
2	`membar.sys`	All
4	`fence.sc.cluster`	Hopper+

A duplicate implementation exists in the NVPTX backend at sub_35F18E0.

Atomic Cluster Scope

At sm_90+, the atomic lowering paths (sub_12AE930 line 255, sub_9502D0 line 424) add cluster scope support. Scope value 2 now resolves to "cluster" instead of falling through to "gpu" as it does on sm_70–89. This enables atom.*.cluster operations for intra-cluster synchronization.

setmaxnreg — Dynamic Register Count

Hopper introduces dynamic register count adjustment via setmaxnreg.{inc,dec}.sync.aligned.u32.

NVVM IR validation (sub_BFC6A0 lines 1732–1754): Builtin IDs 9431–9432 correspond to nvvm.setmaxnreg.inc and nvvm.setmaxnreg.dec. Validation rules enforce that the register count must be a multiple of 8 and within the range [24, 256].

Inline assembly recognition (sub_FCDCB0, sub_21EA5F0): The compiler scans inline asm for setmaxnreg. followed by .sync.aligned.u32, extracting the immediate operand from either a $0 placeholder or a literal integer. Backend duplicates exist at sub_307BA30 and sub_3953170.

WGMMA — Warpgroup Matrix Multiply-Accumulate

WGMMA is Hopper's primary tensor core interface, superseding HMMA for large matrix operations.

Registered Builtins

Four type variants are registered in sub_90AEE0 (lines 2941–2944) with a duplicate table in sub_126A910:

Builtin	ID	Accumulator Type
`__wgmma_mma_async_f16`	765	FP16
`__wgmma_mma_async_bf16`	766	BF16
`__wgmma_mma_async_tf32`	767	TF32
`__wgmma_mma_async_f8`	768	FP8

Shape Selection

The WGMMA lowering at sub_955A70 (lines 2850–2910+) uses a switch on the M dimension (output rows) to select MachineInstr opcodes:

M Dimension	Opcode
8	10774
16	10690
24	10734
32	10742
40–88 (stride 8)	10746–10770

Error on invalid M: "unexpected constant overflow in __wgmma_mma_async operand".

Operand Modifiers

The NVPTX printer at sub_35F3330 emits WGMMA operand modifiers encoded in bitfields:

kind (bits 6–8): mxf4nvf4 (0), f8f6f4 (1), mxf8f6f4 (2), f16 (3), i8 (4), tf32 (5), mxf4 (7)
cta_group (bit 1): cta_group::1 (clear) or cta_group::2 (set)
scale (bits 2–3): Additional scaling modifier

TMA — Tensor Memory Access

TMA provides hardware-accelerated bulk data movement between global and shared memory, driven by a tensor map descriptor that encodes the multi-dimensional layout. Three independent subsystems in cicc cooperate to implement TMA: the intrinsic name parser (sub_A8E250), the SelectionDAG lowering handler (sub_33AD3D0), and the NVPTX ISel pattern matcher for CpAsyncBulkTensor (sub_36EC510).

TMA Descriptor Format (NVVM Container Tag 401)

The host-side tensor map descriptor is embedded in the NVVM container under tag 401. The tag is conditional on ExtOpt.Field344 (tag 301) having value 1, which identifies the Hopper TMA path. (Blackwell uses tag 402 for TCGen05Config instead, gated by Field344==4; the two are mutually exclusive.)

Component	Size	Description
Fixed header	44 bytes	Tensor map metadata (dimensions, strides, element type, interleave, swizzle, fill, OOB policy)
Per-descriptor entry	16 bytes each	One entry per `cp.async.bulk.tensor` call site in the kernel
Total struct at offset 408	44 + 16*N bytes	N = number of distinct TMA operations

The compiler serializes this into the NVVM container (sub_CDD2D0) so ptxas can validate shared memory allocation sizes and descriptor compatibility at link time.

TMA Descriptor ABI in Kernel Parameters

The EDG frontend detects TMA descriptor parameters during kernel registration stub generation. The detection function sub_8D4C10 (edg::get_tma_descriptor_flags) checks:

if (unk_4F068E0
    && arch > 0x9EFB
    && type_is_struct_or_class(type)
    && (*(type+140) & ~4) == 8
    && get_tma_descriptor_flags(type) & 4):
  insert copy_node(sub_7E7ED0, calling_convention=7)
  byte_at(node+88) |= 4   // TMA descriptor flag

This gives TMA descriptors a distinct ABI: calling convention 7 with flag bit 4, separate from normal struct-by-value passing. The copy node ensures the descriptor is materialized at the correct address space boundary before kernel launch.

TMA Intrinsic Name Parsing (sub_A8E250)

The intrinsic dispatcher sub_A8E250 (52 KB) matches TMA intrinsic names via string comparison and assigns internal opcode IDs. Two families exist:

Tensor-structured copies (require a tensor map descriptor):

Intrinsic Pattern	Dimensions	Opcode
`cp.async.bulk.tensor.g2s.tile.1d`	1D	9222
`cp.async.bulk.tensor.g2s.tile.2d`	2D	9223
`cp.async.bulk.tensor.g2s.tile.3d`	3D	9224
`cp.async.bulk.tensor.g2s.tile.4d`	4D	9225
`cp.async.bulk.tensor.g2s.tile.5d`	5D	9226
`cp.async.bulk.tensor.g2s.im2col.3d`	3D	9213
`cp.async.bulk.tensor.g2s.im2col.4d`	4D	9214
`cp.async.bulk.tensor.g2s.im2col.5d`	5D	9215
`cp.async.bulk.tensor.gmem.to.smem.1d`	1D	8324
`cp.async.bulk.tensor.gmem.to.smem.2d`	2D	8325
`cp.async.bulk.tensor.gmem.to.smem.3d`	3D	8326
`cp.async.bulk.tensor.gmem.to.smem.4d`	4D	8327
`cp.async.bulk.tensor.gmem.to.smem.5d`	5D	8328
`cp.async.bulk.tensor.gmem.to.smem.im2col.w.3d`	3D	8329
`cp.async.bulk.tensor.gmem.to.smem.im2col.w.4d`	4D	8330
`cp.async.bulk.tensor.gmem.to.smem.im2col.w.5d`	5D	8331

Unstructured bulk copies (byte-level, no tensor map descriptor):

Intrinsic Pattern	Opcode
`cp.async.bulk.global.to.shared.cluster`	8315
`cp.async.bulk.gmem.to.dsmem`	8316

Fragment-indexed TMA (from builtin IDs 411/412 via sub_9483E0):

LLVM Intrinsic	Base Opcode	Index Range
`llvm.nvvm.tma.load`	9233	9227–9232 (6 entries, indexed by fragment count)
`llvm.nvvm.tma.store`	9257	(corresponding store entries)

TMA SelectionDAG Lowering (sub_33AD3D0)

The unified TMA handler sub_33AD3D0 receives a mode argument from the main intrinsic lowering switch in sub_33B0210:

Case	Mode	Operation	Memory Direction
`0x179`	2	TMA load	global -> shared
`0x17A`	3	TMA store	shared -> global
`0x17B`	5	TMA prefetch	global (read-only)
`0x17C`	7	TMA multicast load	global -> N shared (across cluster)

Related cp.async handlers in the same dispatch table:

Case	Handler	Operation
`0x175`	`sub_33AC2B0`	`cp.async` (non-TMA async copy)
`0x176`	`sub_33AC130`	`cp.async.wait`
`0x177`	`sub_33AB690`	`cp.async.bulk` (non-tensor bulk copy)
`0x178`	`goto LABEL_32`	No-op — commit/barrier (scheduling fence only)

The 0x178 no-op is significant: it represents the cp.async.bulk commit/barrier intrinsic that exists purely for scheduling purposes. The compiler preserves it as a DAG ordering constraint even though it produces no data-flow SDNode.

CpAsyncBulkTensor G2S Lowering (sub_36EC510)

The 27 KB function sub_36EC510 (1185 lines) implements the complete cp.async.bulk.tensor global-to-shared lowering with full architecture gating and mode validation.

Architecture gates (read from offset+340 of the subtarget object):

SM Value	Hex	Features Unlocked
>= 1000	`0x3E8`	SM 90: tile mode (1D–5D), Im2Col mode (3D–5D)
>= 1032	`0x408`	SM 100: adds 2CTA mode, Im2Col_W, Im2Col_W128

Mode bit decoding from operand v11:

Bits	Mask	Meaning
2–4	`v11 & 0x1C`	Im2Col variant: Im2Col, Im2Col_W, Im2Col_W128
3–4	`v11 & 0x18`	2CTA mode flag

Validation error strings (emitted as fatal diagnostics):

"NumDims should be at least 3 for Im2Col or Im2Col_W or Im2Col_W128 mode" — Im2Col requires >= 3D tensors
"Im2Col_W and Im2Col_W128 modes are not supported on this architecture." — SM 90 does not support Im2Col_W/W128; requires SM 100+
"2CTA Mode for CpAsyncBulkTensorG2S not supported on this architecture" — 2CTA mode requires SM 100+

TMA Builtin Codegen (EDG -> LLVM IR)

The EDG-to-LLVM builtin lowering handles TMA as builtin IDs 411 and 412 (hex 0x19B / 0x19C).

ID 411 (scatter/store path) — sub_12A7070 extracts TMA descriptor info, then an iterative loop builds a vector of per-element store nodes. The intrinsic table 0x107A–0x107F (4218–4223) selects among 6 entries indexed by element count. Approximately 300 lines of handler code (lines 1256–1501 of sub_12A71A0).

ID 412 (gather/load path) — Similar structure but for the load direction. Uses intrinsic table 0x1094–0x109A (4244–4250). Includes bitcast insertion (opcode 47) for type mismatches between the descriptor element type and the destination register type. Approximately 450 lines (lines 1503–1713).

Both paths use:

sub_12AA280 — TMA descriptor builder (constructs the multi-operand struct from the builtin arguments)
sub_12A9E60 — extractvalue emission (decomposes aggregate returns into individual registers)
sub_39FAC40 — Fragment count computation (determines how many load/store fragments the TMA operation expands into)

TMA Scheduling Constraints

TMA operations impose specific scheduling constraints visible in cicc's SelectionDAG construction:

Chain dependencies by mode. Every TMA operation produces a memory chain in the SelectionDAG. The mode parameter determines the chain direction:

Mode Reads Writes Chain Effect

2 (load) global shared Load chain

3 (store) shared global Store chain

5 (prefetch) global (none) Load chain

7 (multicast) global N x shared Load chain
Commit-as-fence. Intrinsic ID 0x178 lowers to no-op (goto LABEL_32), functioning as a pure scheduling barrier. This prevents the DAG scheduler from reordering TMA operations past their commit point.
Async qualifier hierarchy. The memory space qualifiers emitted by sub_35F4B50 form an ordered fence hierarchy:

Qualifier Scope Strength

.async Unscoped Weakest

.async.global Global memory domain

.async.shared::cta CTA-local shared memory

.async.shared::cluster Cluster shared memory (DSMEM) Strongest

Distributed Shared Memory

Hopper's cluster architecture enables distributed shared memory (DSMEM) across CTAs in a cluster. The NVPTX backend emits memory space qualifiers from two functions:

sub_35F4B50 — Async memory space qualifier emission (switch on operand):

Line	Qualifier	Semantic
20	`.async`	Base async qualifier (unscoped)
32	`.async.global`	Async from global memory
45	`.async.shared::cta`	Async to CTA-local shared memory
59	`.async.shared::cluster`	Async to cluster distributed shared memory
73	`.alias`	Aliased access modifier (permits overlapping accesses)

sub_35F4E30 — Commit modifier emission (switch on operand):

Line	Qualifier	Semantic
28	`.cta_group::1`	CTA group 1 selection
38	`.cta_group::2`	CTA group 2 selection
51	`.mbarrier::arrive::one`	Single-thread mbarrier arrive
67	`.shared::cluster`	Cluster shared memory scope
80	`.multicast::cluster`	Multicast to all CTAs in cluster

sub_35F4080 — Secondary .shared::cluster emission (line 68), used in non-commit contexts.

These qualifiers attach to cp.async.bulk and mbarrier instructions to specify the scope and direction of asynchronous data movement within the cluster.

Mbarrier Extensions — DMA Fence/Arrive/Wait

Hopper extends the async barrier (mbarrier) mechanism to coordinate TMA data movement. The TMA DMA pipeline follows a three-phase synchronization protocol:

Phase 1: Initialization

.mbarrier_init (emitted from sub_35F4AD0) initializes the async barrier with the expected transaction byte count. The arrive_expect_tx variant sets both the expected arrival count and the transaction byte count atomically.

Phase 2: Arrive (Producer Signals Completion)

When a TMA operation completes, it signals the mbarrier:

.mbarrier::arrive::one (sub_35F4E30 line 51) — single-thread arrive notification. The TMA hardware auto-arrives with the transferred byte count.
.cta_group::1 / .cta_group::2 (sub_35F4E30 lines 28/38) — selects which CTA group the arrive targets, enabling pipelined producer-consumer patterns where two groups alternate roles.

Phase 3: Wait (Consumer Blocks)

The consumer thread issues mbarrier.try_wait with a phase bit. The phase alternates each time the barrier completes a full cycle, enabling pipelined double-buffered access patterns. No additional cicc emission function is needed; the standard mbarrier wait path handles this.

WGMMA Fence/Commit/Wait (Distinct Pipeline)

WGMMA has its own synchronization cycle, separate from TMA mbarriers:

Builtin	IDs	Handler	LLVM Intrinsic
`__wgmma_fence`	745–750	`sub_12B1C20`	9062 (`wgmma.fence.aligned`, 3 type overloads)
`__wgmma_commit_group`	(same range)	`sub_12B1C20`	(same dispatch)
`__wgmma_wait_group`	(same range)	`sub_12B1C20`	(same dispatch)

WGMMA fences synchronize the tensor core accumulator pipeline; TMA mbarriers synchronize the DMA engine. A typical Hopper kernel pipelines both: TMA loads data into shared memory (mbarrier-synchronized), then WGMMA consumes the data from shared memory (fence-synchronized). The two synchronization domains must not be confused in a reimplementation.

Feature Flag Configuration

The master feature configurator sub_60E7C0 sets the following flags at the sm_90+ threshold (qword_4F077A8 > 89999):

Flag	Source
`unk_4D043D0`	`sub_60E7C0`
`unk_4D041B0`	`sub_60E7C0`
`unk_4D04814`	`sub_60E7C0`
`unk_4D0486C`	`sub_60E7C0` (with C++ version check)
`dword_4F07760`	`sub_60E530`
`dword_4D043F8`	`sub_60E530` (at > 99999)
`dword_4D041E8`	`sub_60E530` (at > 99999)

Key Binary Locations

Blackwell Datacenter (sm_100, sm_100a, sm_103, sm_103a)

The Blackwell datacenter family introduces the fifth-generation tensor core instruction set (tcgen05), new floating-point formats (FP4, FP6, MX formats), and a sophisticated arch-conditional versus family-conditional feature gating system. sm_100/sm_100a targets the NVIDIA B200, while sm_103/sm_103a targets Blackwell Ultra (GB300 system). Both share the tcgen05 ISA but differ in __CUDA_ARCH values and minor tensor core configuration.

Architecture Identity

Six Blackwell arch constants are defined in sub_CD09E0:

NVVM Enum	Numeric Value	Implied SM
`NVVM_ARCH_BLACKWELL_10_0`	1000	sm_100
`NVVM_ARCH_BLACKWELL_10_1`	1010	sm_101
`NVVM_ARCH_BLACKWELL_10_3`	1030	sm_103
`NVVM_ARCH_BLACKWELL_11_0`	1100	sm_110 (Jetson Thor)
`NVVM_ARCH_BLACKWELL_12_0`	1200	sm_120
`NVVM_ARCH_BLACKWELL_12_1`	1210	sm_121

Notable: sm_110 (Jetson Thor) was originally designated sm_101 before being renumbered to its own 11.x line. Despite the rename, both remain in the Blackwell family (NVVM_ARCH_BLACKWELL_*). The numeric encoding follows the standard major*100 + minor*10 formula: 11100 + 010 = 1100.

SM Variant Table

Each Blackwell datacenter target has base, accelerated (a), and forward-compatible (f) sub-variants:

Variant	`__CUDA_ARCH`	PTX Version	Product
`sm_100`	1000	6	B200 base
`sm_100a`	1000	7	B200 accelerated
`sm_100f`	1000	7	B200 forward-compatible
`sm_103`	1030	6	Blackwell Ultra / GB300 base
`sm_103a`	1030	7	Blackwell Ultra / GB300 accelerated
`sm_103f`	1030	7	Blackwell Ultra / GB300 forward-compatible

The undocumented sm_101 and sm_102 targets also exist in the processor table (ctor_605) with their own a/f variants. sm_101 maps to __CUDA_ARCH=1010 and sm_102 to __CUDA_ARCH=1020. No unique feature gates differentiate them from sm_100 in cicc.

Suffix Semantics

The sub-variant flags are stored in EDG frontend globals:

unk_4D045E8 — Major SM number (100, 103)
unk_4D045E4 — Accelerated flag; set for both a and f variants
unk_4D045E0 — Forward-compatible flag; set only for f variants

The f suffix implies a — whenever the forward-compatible flag is set, the accelerated flag is also set. In cicc v13.0, the f flag is set during CLI parsing and reset in sub_615CB0 but is never read by any compiler logic. It exists for future-proofing and potential ptxas-level differentiation.

Arch-Conditional vs. Family-Conditional Gating

Blackwell introduces a two-tier feature gating system that distinguishes between "arch-conditional" and "family-conditional" access to instructions. This pattern repeats across every tcgen05 handler.

The gate check at sub_30462A0, sub_304E6C0, and sub_36E9630 uses a complex encoding:

v = arch_version (offset +340 of arch struct)
if (v > 0x408) {           // 0x408 = 1032 = sm_103.2
    if (v - 1101 > 1)      // allows {1101, 1102} — sm_110a/sm_110f (Jetson Thor)
        goto ERROR;
} else if (v <= 0x3E8 || ((1LL << ((v & 0xFF) + 23)) & 0xC0000C03) == 0) {
    goto ERROR;             // 0x3E8 = 1000 = sm_100 base
}

The bitmask 0xC0000C03 selects specific sub-variants when shifted by (v & 0xFF) + 23. PTX version gates further refine access: family-conditional features require PTX >= 86, while arch-conditional features require PTX >= 88.

Features gated by both arch-conditional and family-conditional (broader access): tcgen05.fence, tcgen05.wait, tcgen05.relinquish.alloc, tcgen05.cp, tcgen05.commit, tcgen05.alloc, tcgen05.mma, and the ue8m0x2 type in cvt_packfloat.

Features gated by arch-conditional only (stricter): {fp6/fp4}x2 types in cvt_packfloat, INT8 type in tcgen05.mma, MXF4/MXF4NVF4 with sparsity, and explicit scale vector size.

tcgen05 — Tensor Core Generation 5

The tcgen05 instruction family is the primary new ISA extension for Blackwell datacenter. All tcgen05 instructions are handled in sub_30462A0 and sub_304E6C0.

Lifecycle Instructions

Instruction	Opcode	ISD	Operands	Purpose
`tcgen05.alloc`	10080	4765	Basic allocation	Allocate tensor core accumulator memory
`tcgen05.alloc` (multicast)	10083	4770/4771	32-bit flag variant	Multicast allocation
`tcgen05.dealloc`	10140	4827	4 operands	Deallocate tensor core memory
`tcgen05.commit`	10090/10091	4772–4777	Mask variants	Commit pending operations
`tcgen05.fence`	10143	4830	2 operands	Memory fence for tensor ops
`tcgen05.wait`	10351	5020	2 operands	Wait for tensor ops to complete
`tcgen05.relinquish.alloc`	10311	4941	2 operands	Relinquish allocated tensor memory
`tcgen05.cp.*`	10101	4790	4 operands	Copy operations for tensor data

The commit instruction has multiple variants based on multicast mask size. Only 16-bit and 32-bit masks are valid; other sizes produce an error.

tcgen05.mma — Matrix Multiply-Accumulate

The main MMA instruction is handled in sub_304E6C0 (opcodes 10299–10309) and validated in sub_36E9630. The operand encoding packs configuration into bitfields:

Data types (bits 8–6 of operand):

Value	Kind	Notes
0	`kind::mxf4nvf4`	MX FP4 with NV FP4
1	`kind::f8f6f4`	Standard FP8/FP6/FP4
2	`kind::mxf8f6f4`	MX variant of f8f6f4
3	`kind::f16`	Half precision
4	`kind::i8`	8-bit integer (arch-conditional only)
5	`kind::tf32`	TensorFloat-32
7	`kind::mxf4`	MX FP4

Scale vector sizes (bits 3–2):

Value	Modifier	Constraints
default	`.scale_vec::1X`	Not for mxf4nvf4 or mxf4
2	`.scale_vec::2X`	Not for mxf8f6f4
3	`.scale_vec::4X`	Not for mxf8f6f4 or mxf4

Block scale (bits 10–9): .block16 (16-element block scaling) or .block32 (32-element block scaling). Not supported for f16, tf32, f8f6f4, or i8.

Weight stationary (bit 0): .ws flag. Incompatible with cta_group::2, mxf8f6f4, and FP4 types.

Sparsity (bit 5): Restricted for MXF4 and MXF4NVF4 types on arch-conditional variants only.

Scale input accumulator (bit 4): Scales the accumulator input. Only usable with f16 and tf32 types. Notably, this is NOT supported on the a sub-variants (sm_100a at v=1001, sm_103a at v=1033) but IS supported on base variants (sm_100 at v=1000, sm_103 at v=1030) and sm_120+.

CTA group (bit 1): cta_group::1 (clear) or cta_group::2 (set).

Collector modes (from sub_35F38B0): .collector::a::fill, .collector::a::use, .collector::a::lastuse, and .collector::b with ::ws sub-variants. Constraint: cannot use collector::a::use or collector::a::fill with the ashift modifier.

tcgen05.cp Copy Shapes

The copy instruction shape emission at sub_35F5090 supports:

Shape	Bits 3–1 Value
`.128x256b`	0
`.4x256b`	1
`.128x128b`	2
`.64x128b`	3
`.32x128b`	4

Destination format modifiers: .b8x16 (base), .b6x16_p32 (6-bit with 32-bit padding), .b4x16_p64 (4-bit with 64-bit padding).

Multicast modes: .warpx2::02_13 (warp pairs 0,2 and 1,3), .warpx2::01_23 (warp pairs 0,1 and 2,3), .warpx4 (all 4 warps).

cvt_packfloat — Extended Numeric Formats

The cvt_packfloat intrinsic (sub_304FBD0 for validation, sub_35ED820 for emission) has a base requirement of SM >= 90 and PTX >= 78. Blackwell adds four new types:

Case	Type	Generation
0	`.f32`	sm_90+
1	`.f16x2`	sm_90+
2	`.e4m3x2` (FP8 E4M3)	sm_90+
3	`.e5m2x2` (FP8 E5M2)	sm_90+
4	`.bf16x2` (BFloat16)	sm_90+
5	`.e2m1x2` (FP4 E2M1)	sm_100+
6	`.e2m3x2` (FP6 E2M3)	sm_100+
7	`.e3m2x2` (FP6 E3M2)	sm_100+
8	`.ue8m0x2` (UE8M0 scale)	sm_100+

The ue8m0x2 type is gated by both arch-conditional and family-conditional paths, while {fp6/fp4}x2 types (e2m1x2, e2m3x2, e3m2x2) are arch-conditional only.

tcgen05 Commit with Mbarrier

The commit modifier emission at sub_35F4E30 combines tensor core commit with mbarrier synchronization:

.cta_group::1 / .cta_group::2 — Group selection
.mbarrier::arrive::one — Mbarrier arrive modifier
.shared::cluster — Shared memory cluster scope
.multicast::cluster — Multicast cluster scope

sm_100 vs. sm_103 Differences

Both families share the full tcgen05 ISA. Observable differences in cicc:

__CUDA_ARCH: 1000 vs. 1030
Tensor core operand range: sm_103 may handle wider operand loops (offset 760 vs. 600 for simpler variants in cases 10303/10308)
Scale input accumulator: Not available on a sub-variants of either family

No sm_103-specific feature gates exist beyond the __CUDA_ARCH value. Hardware differences between B200 and GB300 are resolved at the ptxas level.

Feature Flag Configuration

At the sm_100+ threshold (qword_4F077A8 > 109999), the master configurator sub_60E7C0 enables:

Flag	Condition
`unk_4D04184`	Unconditional
`unk_4D04800`	Requires CUDA mode + C++20
`dword_4D041AC`	Guarded by `byte_4CF8172`

Key Binary Locations

Blackwell (sm120) — Consumer and Enterprise (sm_120, sm_121)

The sm_120 family targets the consumer RTX 50-series and enterprise RTX Blackwell Pro GPUs. Despite sharing the "Blackwell" marketing name with sm_100, the sm_120 microarchitecture is a distinct design — a chimera of Hopper and Ada Lovelace silicon, with fundamentally different tensor core hardware. sm_121 targets DGX Spark.

Critical architectural difference: sm_120 does NOT have tcgen05 tensor core instructions. The tcgen05 arch-conditional gate in cicc (sub_30462A0, sub_304E6C0, sub_36E9630) reads SmVersion at offset +0x154 and performs:

if (SmVersion > 1032):          // above sm_103f
    if (SmVersion - 1101) > 1:  // only 1101 (sm_110a) and 1102 (sm_110f) pass
        → ERROR "tcgen05 supported only on arch-conditional..."

sm_120's SmVersion is 1200 → 1200 - 1101 = 99 > 1 → rejected by cicc itself, not by ptxas. The values 1101/1102 correspond to sm_110a/sm_110f (Jetson Thor), confirming that Jetson Thor retains tcgen05/TMEM hardware while consumer Blackwell does not.

The upstream LLVM 22 NVPTX backend (NVPTXSubtarget.h) independently confirms this: hasTcgen05InstSupport() lists only {100, 110}, and hasMMABlockScale() lists only {120}.

The complete tcgen05 acceptance list from cicc's binary (all three gate functions use identical logic):

SmVersion	Target	tcgen05
1001	sm_100a	Allowed (bitmask bit 0)
1002	sm_100f	Allowed (bitmask bit 1)
1011	sm_101a	Allowed (bitmask bit 10)
1012	sm_101f	Allowed (bitmask bit 11)
1031	sm_103a	Allowed (bitmask bit 30)
1032	sm_103f	Allowed (bitmask bit 31)
1101	sm_110a	Allowed ((v-1101) <= 1)
1102	sm_110f	Allowed ((v-1101) <= 1)
1000, 1010, 1030, 1100	base variants	Blocked (no suffix)
1200–1212	all sm_120/121	Blocked (v-1101 > 1)

From the user-visible feature perspective in cicc v13.0, sm_120 adds exactly two compiler-visible features beyond the shared Blackwell base: .offset.bindless texture intrinsics and 16-bit texture element type support.

Architecture Identity

NVIDIA's internal naming places sm_120/sm_121 squarely in the Blackwell family:

NVVM Enum	Numeric Value	`__CUDA_ARCH`	Product
`NVVM_ARCH_BLACKWELL_12_0`	1200	1200	RTX 50xx / RTX Blackwell Pro
`NVVM_ARCH_BLACKWELL_12_1`	1210	1210	DGX Spark

The hardware SM enum NVVM_ARCH_HW_SM_10_4 maps to value 1200, revealing that NVIDIA internally considers sm_120 as "SM 10.4" — a continuation of the Blackwell 10.x line rather than a distinct generation.

SM Variant Table

Variant	`__CUDA_ARCH`	PTX Version	`a` flag	`f` flag
`sm_120`	1200	6	0	0
`sm_120a`	1200	7	1	0
`sm_120f`	1200	7	1	1
`sm_121`	1210	6	0	0
`sm_121a`	1210	7	1	0
`sm_121f`	1210	7	1	1

The PTX version pattern is identical to sm_100: base variants use PTX 6, accelerated and forward-compatible variants use PTX 7. sm_120 does not require a higher PTX version than sm_100.

Suffix Behavior

For the sm_120 family, the a and f suffixes have no behavioral impact on compiler internals in cicc v13.0:

unk_4D045E4 (accelerated flag): Read in exactly one location (sub_6C4D80 line 167), but only for unk_4D045E8 == 90 — the sm_90a gate. The flag is never checked for sm_120.
unk_4D045E0 (forward-compatible flag): Set during CLI parsing, reset in sub_615CB0, but never read anywhere in the compiler logic.

The suffixes exist for forward-proofing, __CUDA_ARCH macro consistency (all sub-variants share the same value), and potential ptxas-level differentiation not visible in cicc.

SM 120 Exclusive Feature Gates

The entire cicc codebase contains exactly two locations gated on sm_120. Both check __CUDA_ARCH >= 1200 (i.e., the arch value field at offset +8 must exceed 1199).

Feature 1: .offset.bindless Texture Intrinsics

Frontend gate: sub_1C36530 line 2724 Backend gate: sub_2C7B6A0 line 2160

When *(int*)(a1 + 8) <= 1199, the compiler emits: ".offset.bindless intrinsics are not supported on pre-Blackwell architectures". The error message is misleading — sm_100 IS Blackwell, yet .offset.bindless requires sm_120+. The message likely reflects an earlier internal naming convention or considers sm_120 the "true" consumer Blackwell.

The .offset.bindless intrinsics provide texture and surface operations using bindless handles with an additional offset parameter. This enables runtime-flexible texture resource indexing, indirect texture access via descriptor heaps, and offset-based resource aliasing within a descriptor pool.

68 intrinsic variants are classified by two functions:

Frontend: sub_1C303A0 — Checks three ID ranges:
- Range 1: IDs 4419–4469 (26 IDs, odd numbers only)
- Range 2: IDs 4722, 4725, 4726, 4731, 4734, 4736, 4739 (7 IDs)
- Range 3: IDs 5085–5153 (35 IDs, odd numbers only)
Backend: sub_CEA320 — Checks corresponding backend intrinsic IDs

These 68 intrinsics cover the full matrix of texture dimensions (1D, 2D, 3D, cube, array variants), data types (i32, f32, and others), and operation types (sample, fetch, gather). The sm_120 gate means these intrinsics physically require sm_120 hardware — the texture unit changes needed for offset-based bindless addressing are not present on sm_100 silicon.

Feature 2: 16-bit Texture Element Types

Frontend gate: sub_1C36530 line 3381 Backend gate: sub_2C7B6A0 line 2386

When *(int*)(a1 + 8) > 1199, 16-bit (f16) element types become legal for most texture intrinsics. The legalization logic at frontend line 3397:

type_legal = (elem_is_i8_or_i16_raw) || is_32bit(type) ||
             (is_16bit(type) && tex16_allowed_flag)

The tex16_allowed_flag differs by architecture:

sm < 120: True only for builtin ID 3811 (checked by sub_1C30390)
sm >= 120: True for all texture intrinsics except IDs 5116–5131 (checked by sub_1C30470 on frontend, sub_CEA3F0 for backend IDs 10462–10477)

This change reduces memory bandwidth requirements for texture operations on sm_120 by enabling native f16 texture reads without promotion to 32-bit.

sm_120 vs. sm_121

Both variants pass the same > 1199 gate. In cicc v13.0, there is no code path that differentiates sm_121 from sm_120. The only distinction is the __CUDA_ARCH macro value (1200 vs. 1210), which affects user-level #ifdef checks in CUDA source code.

sm_121 is a minor revision of sm_120, analogous to how sm_103 relates to sm_100 — both have different __CUDA_ARCH values but no compiler-internal behavioral difference beyond the macro.

Relationship to sm_100

What sm_120 Inherits from sm_100

sm_120 shares the Blackwell family identity and inherits most non-tensor-core features: Hopper cluster operations, TMA bulk copy, setmaxnreg, narrow FP conversion support (e2m3/e3m2/e2m1/ue8m0), tensormap.replace, and Blackwell ldstmatrix instructions.

What sm_120 Does NOT Have

sm_120 lacks the entire tcgen05 instruction family and its prerequisite Tensor Memory (TMEM) hardware:

No tcgen05.alloc / tcgen05.dealloc (no TMEM to allocate)
No tcgen05.mma (the async TMEM-based tensor core path)
No tcgen05.cp / tcgen05.commit / tcgen05.fence / tcgen05.wait
No tcgen05.relinquish.alloc

What sm_120 Has Instead

The sm_120 hardware extends the existing mma.sync instruction family (which has been the standard tensor core interface since Volta/sm_70) with new block_scale qualifiers and MX-format data types:

mma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32.row.col.f32.e4m3.e4m3.f32.ue8m0

This adds per-block MX-format scaling to the synchronous register-based MMA, supporting FP8 (e4m3, e5m2), FP6 (e3m2, e2m3), and FP4 (e2m1) operand types with ue8m0 scale factors. The tile shape is m16n8k32. Upstream LLVM 22 confirms this with hasMMABlockScale() returning true only for {120} and hasMMASparseBlockScaleF4() for {120, 121}.

The block_scale variant is restricted to TN layout (.row.col is hardcoded as a string literal in LLVM's tablegen — not parameterized, no NN/NT/TT variants exist). This is consistent with the broader mma.sync family where all post-Volta shapes are effectively TN-only (only the original m8n8k4 f16 from Volta supports all four layout combinations). By contrast, tcgen05.mma on sm_100/103/110 has no layout qualifier at all — data layout is implicit in the tensor memory descriptor (idesc).

cicc v13.0 does not yet emit mma.sync.block_scale for sm_120. The binary contains the string "nvvm.mma.blockscale currently supports non-sync aligned variants only!", confirming that block-scaled MMA is only available through the tcgen05 (async) path in this release — which sm_120 doesn't have access to. The mma.sync.block_scale support for sm_120 is present in upstream LLVM 22 and presumably coming in a future CUDA release (13.1+).

In cicc v13.0, sm_120 falls back to the standard HMMA/IMMA tensor core codegen inherited from sm_70–sm_90. The new Blackwell-generation tensor features (tcgen05 async path OR block_scale sync path) are both unavailable for sm_120 in this compiler version.

Tensor Core Instruction Timeline

Generation	SM	Instruction	Memory Model
Volta/Turing	sm_70/75	`mma.sync` (HMMA)	Register-to-register, synchronous
Ampere	sm_80	`mma.sync` (extended shapes)	Register-to-register, synchronous
Hopper	sm_90	`wgmma.mma_async`	Shared memory → registers, async warpgroup
Blackwell datacenter	sm_100/103/110	`tcgen05.mma`	Tensor Memory (TMEM), fully async
Blackwell consumer	sm_120/121	`mma.sync.block_scale` (LLVM 22+)	Register-to-register, synchronous + MX scaling

sm_110 — Jetson Thor

sm_110 (Jetson Thor, for automotive and robotics SoCs) sits between sm_100 and sm_120 in the architecture numbering. Despite the higher SM number, sm_110 is architecturally a datacenter Blackwell derivative (originally sm_101 before rename) and retains tcgen05/TMEM support — the tcgen05 gate explicitly allows sm_110a (SmVersion 1101) and sm_110f (1102). It lacks sm_120's .offset.bindless and f16 texture features but has full tensor core parity with sm_100/sm_103.

Variant	`__CUDA_ARCH`	PTX Version
`sm_110`	1100	6
`sm_110a`	1100	7
`sm_110f`	1100	7

Feature Flag Configuration

At the sm_120+ threshold (qword_4F077A8 > 119999), the master configurator sub_60E7C0 enables:

Flag	Purpose
`unk_4D047BC`	Disabled (set to 0) for sm_120+; enabled for all lower architectures
`unk_4D0428C`	Enabled at sm_120+

The unk_4D047BC flag is unconditionally assigned based on SM <= 119999, making it the only flag that is actively disabled at sm_120+. This likely controls a legacy optimization or codegen path that is incompatible with sm_120 hardware.

Key Binary Locations

Function	Address	Size	Role
`sub_CD09E0`	`0xCD09E0`	NVVM arch enum (`NVVM_ARCH_BLACKWELL_12_0/12_1`)	NVVM arch enum (`NVVM_ARCH_BLACKWELL_12_0/12_1`)
`sub_95EB40`	`0x95EB40`	CLI arch string mapping	CLI arch string mapping
`sub_617BD0`	`0x617BD0`	`compute_NNN` string parsing	`compute_NNN` string parsing
`ctor_605`	`0x584510`	Processor variant table (PTX versions)	Processor variant table (PTX versions)
`ctor_356`	`0x50C890`	LLVM processor description table	LLVM processor description table
`sub_1C36530`	`0x1C36530`	Frontend verifier (`.offset.bindless` + f16 texture gates)	Frontend verifier (`.offset.bindless` + f16 texture gates)
`sub_2C7B6A0`	`0x2C7B6A0`	Backend verifier (`.offset.bindless` + f16 texture gates)	Backend verifier (`.offset.bindless` + f16 texture gates)
`sub_1C303A0`	`0x1C303A0`	`.offset.bindless` intrinsic classifier (frontend)	`.offset.bindless` intrinsic classifier (frontend)
`sub_CEA320`	`0xCEA320`	`.offset.bindless` intrinsic classifier (backend)	`.offset.bindless` intrinsic classifier (backend)
`sub_1C30470`	`0x1C30470`	f16 texture exclusion list (frontend)	f16 texture exclusion list (frontend)
`sub_CEA3F0`	`0xCEA3F0`	f16 texture exclusion list (backend)	f16 texture exclusion list (backend)
`sub_6C4D80`	`0x6C4D80`	Accelerated flag reader (sm_90a only, not sm_120)	Accelerated flag reader (sm_90a only, not sm_120)
`sub_615CB0`	`0x615CB0`	Forward-compatible flag reset	Forward-compatible flag reset

NVVM IR Node Layout

The NVVM frontend in cicc v13.0 uses a custom intermediate representation distinct from LLVM's native IR. Each IR node is a variable-length structure allocated from a bump allocator, with operands stored backward from the node header pointer. The node uniquing infrastructure lives in sub_162D4F0 (49KB), which routes each opcode to a dedicated DenseMap inside the NVVM context object.

The pointer a1 returned from allocation points to the start of the fixed header. Operands are at negative offsets behind it.

Offset	Size	Type	Field	Notes
+0	1B	`uint8_t`	`opcode`	Switch key in `sub_162D4F0`; values 0x04..0x22+
+2	2B	`uint16_t`	`subopcode`	Intrinsic ID; read for opcodes 0x1C, 0x1D, 0x1E
+4	4B	--	(padding)	Not accessed directly
+8	4B	`uint32_t`	`num_operands`	Controls operand access range
+16	8B	`tagged_ptr`	`context_ptr`	Low 3 bits are tag; mask with `& ~7` for pointer
+24	8B	varies	`extra_A`	DWORD for opcodes 0x1A/0x1B; pointer for 0x10/0x22
+28	4B	`uint32_t`	`extra_B`	Present for opcode 0x1B
+32	8B	varies	`extra_C`	Present for opcode 0x10
+40	1B	`uint8_t`	`extra_flag`	Present for opcode 0x10

Minimum header size is 24 bytes. Total node allocation: 24 + 8 * num_operands bytes minimum, though opcode-specific extra fields extend the header region for certain node types.

Operand Storage

Operands are stored as 8-byte QWORD pointers at negative offsets from the header. The stride is exactly 8 bytes per operand. Access follows this pattern (decompiled from sub_162D4F0):

operand[k] = *(_QWORD *)(a1 + 8 * (k - num_ops))

For a node with num_operands = 3:

operand[0] is at a1 - 24
operand[1] is at a1 - 16
operand[2] is at a1 - 8

A 2-operand node occupies 40 bytes total (16 operand bytes + 24 header bytes). A node with opcode 0x1B and 5 operands requires approximately 88 bytes (40 operand bytes + ~48 header bytes including extra fields).

Tagged Pointer Semantics

The context_ptr at offset +16 uses low-bit tagging to encode indirection:

Bits [2:0] = 0: pointer is a direct reference to the context object.
Bit [2] = 1: pointer is an indirect reference (pointer-to-pointer).

The decompiled dereferencing pattern:

v = *(a1 + 16) & 0xFFFFFFFFFFFFFFF8;  // mask off tag bits
if (*(a1 + 16) & 4)                    // bit 2 set = indirect
    v = *v;                             // one extra dereference

This technique saves a field by encoding the indirection flag inside the pointer itself, relying on 8-byte alignment guarantees.

Opcode Dispatch Table

The uniquing function sub_162D4F0 performs a byte-level switch on *(_BYTE *)a1. Each case extracts the tagged context pointer, dereferences it, then probes an opcode-specific DenseMap for a structurally identical node.

Uniquing Opcode Dispatch (`sub_162D4F0`, 49KB)

The opcodes fall into two categories: "simple" opcodes that use sub-function tables at fixed stride, and "complex" opcodes that use dedicated DenseMap instances at individually-known offsets.

Simple opcodes (0x04--0x15) -- These 18 opcodes share a uniform dispatch pattern. Each routes to a sub-function table at a fixed byte offset within the context object, spaced 32 bytes apart:

Opcode	Context Byte Offset	Semantic Category
0x04	+496	Type / value constant
0x05	+528	Binary operation
0x06	+560	(simple node)
0x07	+592	(simple node)
0x08	+624	(simple node)
0x09	+656	Undef / poison
0x0A	+688	(simple node)
0x0B	+720	(simple node)
0x0C	+752	(simple node)
0x0D	+784	Integer constant
0x0E	+816	FP constant
0x0F	+848	Constant expression
0x10	--	Special: uses DenseMap at qw[178]
0x11	+912	(simple node)
0x12	+944	(simple node)
0x13	+976	Struct / aggregate type
0x14	+1008	(simple node)
0x15	+1072	(simple node)

Each sub-function table entry at these offsets is a 32-byte structure containing the callback address and metadata for hash-table probing.

Complex opcodes (0x16--0x22) -- These opcodes each own a full DenseMap within the context object. Each DenseMap occupies 4 qwords at the indicated base, plus associated dword counters:

Opcode	QWord Base	Byte Offset	DenseMap Dwords	Identified Semantic
0x16	qw[130]	+1040	dw[264..266]	Metadata node
0x17	--	+1104	--	(simple-table path at +1104)
0x18	--	+1136	--	Alloca (bitcode 0x18/0x58)
0x19	--	--	--	Load
0x1A	qw[146]	+1168	dw[296..298]	Branch (br)
0x1B	qw[150]	+1200	dw[304..306]	Switch
0x1C	qw[154]	+1232	dw[312..314]	Invoke (reads subopcode)
0x1D	qw[158]	+1264	dw[320..322]	Unreachable / resume (reads subopcode)
0x1E	qw[162]	+1296	dw[328..330]	LandingPad (reads subopcode)
0x1F	qw[166]	+1328	dw[336..338]	Call instruction
0x20	--	--	--	PHI node
0x21	--	--	--	IndirectBr
0x22	qw[178]	+1424	dw[360..362]	Special (extra_A = ptr)

Opcodes 0x1C, 0x1D, and 0x1E read the subopcode field at *(unsigned __int16 *)(a1 + 2) as part of the hash key, because these node types require the intrinsic ID to distinguish structurally identical nodes with different semantic meaning.

Hash Function

Every DenseMap in the uniquing tables uses the same hash:

hash(ptr) = (ptr >> 9) ^ (ptr >> 4)

Hash computation for multi-operand nodes (sub_15B3480) extends this by combining the hash of each operand pointer with a mixing step. The hash seed is the opcode byte, then each operand is folded in:

seed ^= hash(operand[i]) + 0x9E3779B9 + (seed << 6) + (seed >> 2);

Sentinel values: empty = -8 (0xFFFFFFFFFFFFFFF8), tombstone = -16 (0xFFFFFFFFFFFFFFF0).

Node Erasure (`sub_1621740`, 14KB)

The mirror of insertion. Dispatches by the same opcode byte, finds the node in the corresponding DenseMap, overwrites the bucket with the tombstone sentinel (-16), and decrements NumItems while incrementing NumTombstones. When tombstone count exceeds NumBuckets >> 3, a rehash at the same capacity is triggered to reclaim tombstone slots.

Bitcode Instruction Opcode Table

NVIDIA uses LLVM's standard instruction opcode numbering with minor adjustments. The bitcode reader sub_166A310 / sub_151B070 (parseFunctionBody, 60KB/123KB) dispatches on a contiguous range. The NVVM verifier sub_2C80C90 confirms the mapping via its per-opcode validation switch:

Opcode	Hex	LLVM Instruction	Verifier Checks
0x0B	11	`ret`	--
0x0E	14	`br`	--
0x0F	15	`switch`	--
0x15	21	`invoke`	"invoke" unsupported via `sub_2C76F10`
0x18	24	`alloca`	Alignment <= 2^23; AS must be Generic
0x19	25	`load`	--
0x1A	26	`br` (cond)	Validates "Branch condition is not 'i1' type!"
0x1B	27	`switch` (extended)	--
0x1C	28	`invoke` (extended)	--
0x1D	29	`unreachable`	--
0x1E	30	`resume`	--
0x1F	31	`call`	Pragma metadata validation
0x20	32	`phi`	--
0x21	33	`indirectbr`	"indirectbr" unsupported
0x22	34	`call` (variant)	Validates callee type signature
0x23	35	`resume` (verifier)	"resume" unsupported
0x23--0x34	35--52	Binary ops (add/sub/mul/div/rem/shift/logic)	--
0x35--0x38	53--56	Casts (trunc/zext/sext/fpcast)	--
0x3C	60	`alloca`	Alignment and address-space checks
0x3D	61	`load`	Atomic loads rejected; tensor memory AS rejected
0x3E	62	`store`	Atomic stores rejected; tensor memory AS rejected
0x40	64	`fence`	Only acq_rel/seq_cst in UnifiedNVVMIR mode
0x41	65	`cmpxchg`	Only i32/i64/i128; must be generic/global/shared AS
0x42	66	`atomicrmw`	Address space validation
0x4F	79	`addrspacecast`	"Cannot cast non-generic to different non-generic"
0x55	85	Intrinsic call	Routes to `sub_2C7B6A0` (143KB verifier)
0x58	88	`alloca` (inalloca)	Same as 0x18
0x5F	95	`landingpad`	"landingpad" unsupported

The binary opcodes in the 0x23--0x34 range follow LLVM's BinaryOperator numbering:

Opcode	Hex	Operation	IRBuilder Helper
0x23	35	`add`	--
0x24	36	`fadd`	--
0x25	37	`sub`	--
0x26	38	`fsub`	--
0x27	39	`mul`	--
0x28	40	`fmul`	--
0x29	41	`udiv`	--
0x2A	42	`sdiv`	--
0x2B	43	`fdiv`	--
0x2C	44	`urem`	--
0x2D	45	`srem`	--
0x2E	46	`frem`	--
0x2F	47	`shl`	--
0x30	48	`lshr`	--
0x31	49	`ashr`	--
0x32	50	`and`	--
0x33	51	`or`	--
0x34	52	`xor`	--

InstCombine Internal Opcode Table

The InstCombine mega-visitor sub_10EE7A0 (405KB, the single largest function in cicc) uses a different opcode numbering -- the full LLVM Instruction::getOpcode() values rather than the bitcode record codes. These are accessed via sub_987FE0 (getOpcode equivalent). Key ranges observed:

Opcode Range	LLVM Instructions
0x0B	Ret
0x0E	Br
0x0F	Switch
0x15	Invoke
0x1A	Unreachable
0x3F	FNeg
0x41--0x43	Add, FAdd, Sub
0x99	GetElementPtr
0xAA	Trunc
0xAC--0xAE	ZExt, SExt, FPToUI
0xB4--0xB5	PtrToInt, IntToPtr
0xCF--0xD2	ICmp, FCmp, PHI, Call
0xE3--0xEB	VAArg, ExtractElement, InsertElement, ShuffleVector, ExtractValue, InsertValue
0x11A	Fence
0x11D	AtomicCmpXchg
0x125	AtomicRMW
0x134--0x174	FPTrunc, FPExt, UIToFP, Alloca, Load, Store, FMul, UDiv, SDiv, ...
0x17D--0x192	BitCast, Freeze, LandingPad, CatchSwitch, CatchRet, CallBr, ...
0x2551, 0x255F, 0x254D	NVIDIA custom intrinsic operations

The NVIDIA custom opcodes (0x2551, 0x255F, 0x254D) are in a range far above standard LLVM and handle CUDA-specific operations (texture, surface, or warp-level ops encoded as custom IR nodes) that have no upstream LLVM equivalent.

NVVM Context Object

The context object referenced by context_ptr is a large structure (~3,656 bytes, confirmed by the destructor sub_B76CB0 at 97KB which tears down a ~3656-byte object) containing uniquing tables for every NVVM opcode, plus type caches, metadata interning tables, and allocator state.

Context Layout Overview

Byte Offset	Size	Field	Description
+0..+200	200B	Core state	Module pointer, allocator, flags
+200	8B	vtable_0	Points to `unk_49ED3E0`
+224	8B	vtable_1	Points to `unk_49ED440`
+248	8B	vtable_2	Points to `unk_49ED4A0`
+272..+792	520B	Hash table array	16 DenseMaps freed at stride 32 by destructor
+496..+1136	640B	Simple opcode tables	18 sub-function tables, 32B each (opcodes 0x04..0x15)
+1040..+1424	384B	Complex opcode DenseMaps	Dedicated DenseMaps for opcodes 0x16..0x22
+1424..+2800	~1376B	Extended tables	Additional hash tables, type caches, metadata maps
+2800..+3656	~856B	Allocator state	Bump allocator slabs, counters, statistics

Simple Opcode Table Region (+496..+1136)

The 18 entries for opcodes 0x04 through 0x15 (plus a few extras) are 32-byte structures at fixed offsets:

struct SimpleOpcodeTable {
    void  *buckets;       // +0:  heap-allocated bucket array
    int32  num_items;     // +8:  live entry count
    int32  num_tombstones; // +12: tombstone count
    int32  num_buckets;   // +16: always power-of-2
    int32  reserved;      // +20: padding
    void  *callback;      // +24: hash-insert function pointer (or NULL)
};

Byte offsets increase monotonically: +496, +528, +560, +592, +624, +656, +688, +720, +752, +784, +816, +848, +880, +912, +944, +976, +1008, +1072, +1104, +1136.

Complex Opcode DenseMap Region (+1040..+1424)

Each DenseMap for a complex opcode occupies 4 qwords plus associated dword counters:

struct OpcodeUniqueMap {
    int64  num_entries;    // qw[N]:   includes tombstones
    void  *buckets;        // qw[N+1]: heap-allocated bucket array
    int32  num_items;      // dw[2*N + offset]: live entries
    int32  num_tombstones; // dw[2*N + offset + 1]: tombstone count
    int32  num_buckets;    // dw[2*N + offset + 2]: capacity (power-of-2)
};

Complete mapping:

Opcode	qw Base	Byte Offset (qw)	dw Counters	Byte Offset (dw)
0x16	qw[130]	+1040	dw[264..266]	+2112..+2120
0x1A	qw[146]	+1168	dw[296..298]	+2368..+2376
0x1B	qw[150]	+1200	dw[304..306]	+2432..+2440
0x1C	qw[154]	+1232	dw[312..314]	+2496..+2504
0x1D	qw[158]	+1264	dw[320..322]	+2560..+2568
0x1E	qw[162]	+1296	dw[328..330]	+2624..+2632
0x1F	qw[166]	+1328	dw[336..338]	+2688..+2696
0x10	qw[178]	+1424	dw[360..362]	+2880..+2888

Destructor (`sub_1608300`, 90KB)

The context destructor confirms the layout by freeing resources in order:

Calls j___libc_free_0 on bucket pointers at offsets +272 through +792 (stride 32) -- frees all 16 simple opcode hash tables.
Destroys sub-objects via sub_16BD9D0, sub_1605960, sub_16060D0 -- these tear down the complex DenseMap instances and any heap-allocated overflow chains.
Releases vtable-referenced objects at offsets +200, +224, +248.

The separate LLVMContext destructor (sub_B76CB0, 97KB) frees 28+ hash tables from the full ~3,656-byte context structure, confirming that the uniquing tables are only part of the overall context.

Type Tag System

The context object's hash tables also serve as uniquing tables for type nodes. The byte at offset +16 in each IR node encodes the type tag (distinct from the opcode byte at +0):

Type Tag	Meaning	Notes
5	Instruction / expression	Binary ops, comparisons
8	Constant aggregate	ConstantArray, ConstantStruct
9	Undef / poison	`UndefValue`
13	Integer constant	APInt at +24, bitwidth at +32
14	FP constant	APFloat storage
15	Constant expression	ConstantExpr (GEP, cast, etc.)
16	Struct / aggregate type	Element list at +32
17	MDTuple / metadata node	Metadata tuple
37	Comparison instruction	ICmp / FCmp predicate

The type tag at +16 is used by InstCombine (sub_1743DA0) and many other passes to quickly classify nodes without reading the full opcode. The observed range is 5--75, considerably denser than standard LLVM's Value subclass IDs.

Instruction Creation Helpers

NVIDIA's LLVM fork provides a set of instruction creation functions that allocate nodes from the bump allocator, insert them into the appropriate uniquing table, and update use-lists. These are the core IR mutation API:

Primary Instruction Factories

Address	Size	Signature	LLVM Equivalent
`sub_B504D0`	--	`(opcode, op0, op1, state, 0, 0)`	`BinaryOperator::Create` / `IRBuilder::CreateBinOp`
`sub_B50640`	--	`(val, state, 0, 0)`	Result-typed instruction / `CreateNeg` wrapper
`sub_B51BF0`	--	`(inst, src, destTy, state, 0, 0)`	`IRBuilder::CreateZExtOrBitCast`
`sub_B51D30`	--	`(opcode, src, destTy, state, 0, 0)`	`CmpInst::Create` / `IRBuilder::CreateCast`
`sub_B52190`	--	`(...)`	`BitCastInst::Create`
`sub_B52260`	--	`(...)`	`GetElementPtrInst::Create` (single-index)
`sub_B52500`	--	`(...)`	`CastInst::Create` with predicate
`sub_B33D10`	--	`(ctx, intrinsicID, args, numArgs, ...)`	`IRBuilder::CreateIntrinsicCall`
`sub_BD2DA0`	--	`(80)`	`Instruction::Create` (allocates 80-byte IR node)
`sub_BD2C40`	--	`(72, N)`	`Instruction::Create` (72-byte base, N operands)

Opcode Constants for Creation

These numeric opcode values are passed as the first argument to sub_B504D0:

Value	Operation	Example
13	Sub	`sub_B504D0(13, a, b, ...)`
15	FNeg / FSub variant	`sub_B504D0(15, ...)`
18	SDiv	`sub_B504D0(18, ...)`
21	FMul	`sub_B504D0(21, ...)`
25	Or	`sub_B504D0(25, ...)`
26	And	`sub_B504D0(26, a, mask, ...)`
28	Xor	`sub_B504D0(28, ...)`
29	Add	`sub_B504D0(29, ...)`
30	Sub	`sub_B504D0(30, zero, operand)`
32	Shl	`sub_B504D0(32, ...)`
33	AShr	`sub_B504D0(33, ...)`
38	And (FP context)	`sub_B504D0(38, ...)`
40	ZExt (via sub_B51D30)	`sub_B51D30(40, source, resultType)`
49	CastInst	`sub_B51D30(49, src, destTy, ...)`

Node Builder / Cloner (`sub_16275A0`, 21KB)

The IR builder at sub_16275A0 creates new nodes by cloning operand lists from a source node, using the tagged pointer Use-list encoding described above. It dispatches to three specialized constructors:

Address	Role
`sub_1627350`	Multi-operand node create (MDTuple::get equivalent). Takes `(ctx, operand_array, count, flag0, flag1)`. Called 463+ times from func-attrs and metadata passes.
`sub_15B9E00`	Binary node create. Fixed 2-operand layout, minimal header.
`sub_15C4420`	Variadic node create. Variable operand count, allocates backward operand storage.

All three ultimately route through the uniquing function sub_162D4F0 to deduplicate structurally-identical nodes.

Infrastructure Functions

Address	Call Count	Role
`sub_1623A60`	349x	`IRBuilder::CreateBinOp` or SCEV type extension
`sub_1623210`	337x	`IRBuilder::CreateUnaryOp` or SCEV use registration
`sub_15FB440`	276x	Create node with 5 args: `(opcode, type, op1, op2, flags)`
`sub_161E7C0`	463x	Node accessor / property query (most-called IR function)
`sub_164B780`	336x	Use-chain linked list manipulation
`sub_1648A60`	406x	Memory allocator: `(size, alignment)`

Allocation

NVVM IR nodes are allocated from a slab-based bump allocator:

Slab growth: 4096 << (slab_index >> 7) -- exponential, capped at 4TB.
Alignment: 8 bytes (pointer aligned via (ptr + 7) & ~7).
Deallocation: no individual free; entire slabs are released at once.
Overflow: triggers a new slab via malloc().

This is the standard LLVM BumpPtrAllocator pattern, consistent with how upstream LLVM manages IR node lifetimes. The lack of per-node deallocation means the NVVM frontend cannot reclaim memory for dead nodes until the entire context is destroyed.

Cross-References

DenseMap / Hash Infrastructure -- universal hash function and DenseMap layout
DAG Node -- SelectionDAG-level node layout (104-byte SDNode)
NVVM Container -- the NVVMPassOptions/container that wraps the context
Bitcode I/O -- bitcode opcode encoding and parseFunctionBody
InstCombine -- the 405KB mega-visitor that consumes these nodes
NVVM Verifier -- per-opcode validation rules

Function Map

Function	Address	Size	Role
Node uniquing: lookup-or-insert, opcode dispatch	`sub_162D4F0`	49KB	--
Node erase from uniquing tables (tombstone writer)	`sub_1621740`	14KB	--
IR builder / node cloner	`sub_16275A0`	21KB	--
Multi-operand node create (MDTuple::get)	`sub_1627350`	--	--
Binary node create	`sub_15B9E00`	--	--
Variadic node create	`sub_15C4420`	--	--
Hash computation for multi-operand nodes	`sub_15B3480`	--	--
Context destructor (frees 20+ hash tables)	`sub_1608300`	90KB	--
LLVMContext destructor (~3,656-byte object)	`sub_B76CB0`	97KB	--
`BinaryOperator::Create` / `IRBuilder::CreateBinOp`	`sub_B504D0`	--	--
Result-typed instruction create / `CreateNeg`	`sub_B50640`	--	--
`IRBuilder::CreateZExtOrBitCast`	`sub_B51BF0`	--	--
`CmpInst::Create` / `IRBuilder::CreateCast`	`sub_B51D30`	--	--
`BitCastInst::Create`	`sub_B52190`	--	--
`GetElementPtrInst::Create` (single-index)	`sub_B52260`	--	--
`CastInst::Create` with predicate	`sub_B52500`	--	--
`IRBuilder::CreateIntrinsicCall`	`sub_B33D10`	--	--
`Instruction::Create` (80-byte allocation)	`sub_BD2DA0`	--	--
`Instruction::Create` (variable-size)	`sub_BD2C40`	--	--
`create_empty_ir_node` (204 callers, EDG front-end)	`sub_72C9A0`	--	--
IR builder / node constructor (349x calls)	`sub_1623A60`	--	--
IR builder / node constructor variant (337x calls)	`sub_1623210`	--	--
Create node with 5 args (276x calls)	`sub_15FB440`	--	--
Node accessor / property query (463x calls)	`sub_161E7C0`	--	--
`BitcodeReader::parseFunctionBody` (stock LLVM)	`sub_166A310`	60KB	--
`parseFunctionBody` (two-phase compilation path)	`sub_151B070`	123KB	--
`parseFunctionBody` (standalone libNVVM path)	`sub_9F2A40`	185KB	--
`InstCombinerImpl::visitInstruction` (full opcode switch)	`sub_10EE7A0`	405KB	--
InstCombine master visit dispatcher	`sub_F2CFA0`	--	--
NVVMModuleVerifier (per-opcode validation)	`sub_2C80C90`	51KB	--

Instruction Constraint Table (Pattern Database)

The instruction selection backend in cicc v13.0 uses a global constraint table to map target opcodes to their operand requirements. This table drives the sub_B612D0 constraint emission function (104KB), which consults a packed 16-bit word array to determine register classes and constraint patterns for each machine instruction. The constraint table is the single authoritative source of truth for every NVPTX MachineInstr's register requirements -- any reimplementation of the backend codegen must reproduce it exactly.

Global Table: `word_3F3E6C0`

The constraint table is a statically allocated array of 16-bit words in the .data section at address 0x3F3E6C0, indexed by (opcode - 1). Each entry packs two pieces of information into a single 16-bit word:

Bits	Field	Meaning
Low byte (bits 0..7)	`constraint_class`	Index into the constraint switch (0x00..0xB2)
High byte (bits 8..15)	`register_class_id`	Target register class for the result

The access pattern from sub_B612D0:

// sub_B612D0(a1, a2)  where a2 = MachineInstr opcode
v4 = HIBYTE(word_3F3E6C0[a2 - 1]);    // register class for output
switch (LOBYTE(word_3F3E6C0[a2 - 1]))  // constraint class -> switch case

There are exactly 179 distinct constraint classes (0x00 through 0xB2), each encoding a specific operand pattern for a category of instructions. Multiple opcodes can share the same constraint class if they have identical operand signatures.

Constraint Descriptor Layout

Each constraint descriptor is a stack-allocated array of 16-byte entries built within sub_B612D0's frame. The frame is approximately 0x160 bytes deep. Stack slots span [rsp-0x158] through [rsp-0x20]:

Offset	Size	Field
+0	4B	`constraint_kind` (int32)
+4	4B	(padding / alignment)
+8	8B	`value` (int64: register class ID or operand reference)

Entry stride: 16 bytes (8-byte aligned pairs of {int32 kind, int32 pad, int64 value}).

The constraint_kind values determine the role of each entry in the descriptor array:

Kind	Meaning
-1	Output/result operand (always the last entry in the array)
0	Input operand at position 0
1	Input operand at position 1
2	Input operand at position 2
3..N	Input operands at higher positions

The output entry (kind = -1) carries the result register class. Input entries carry the register class constraint for each source operand. The maximum observed operand count is 17 (constraint class 0xB0, corresponding to opcode 176 in the table), requiring 18 descriptor entries = 288 bytes of stack space.

Register Class IDs

The register_class_id in the high byte maps to NVIDIA GPU register files. Values recovered from sub_A778C0 (register class constraint creator), sub_B5BA00 (register class set builder, 111 cases), and sub_2163730 (PTX emission naming):

These IDs are specific to the pattern database constraint system and differ from the 4-bit class tags used in register encoding (see Register Classes for vtable addresses, PTX types, prefixes, and encoded IDs).

ID	Register Class	Width
14	Int32Regs (`%r`)	32 bits
22	Int16Regs (`%rs`)	16 bits
24	Int16HalfRegs (`%h`)	16 bits (f16/bf16)
27	Int32HalfRegs (`%hh`)	32 bits (v2f16/v2bf16)
29	(unidentified)	--
32	(unidentified)	--
36	(unidentified)	--
39	(unidentified)	--
40	Float32Regs (`%f`)	32 bits
41	(unidentified)	--
43	Float16Regs (`%h`, alias of Int16HalfRegs)	16 bits
50	Int64Regs (`%rd`)	64 bits
51	Float64Regs (`%fd`)	64 bits
52	Int128Regs (`%rq`)	128 bits
67	(unidentified)	--
72	(unidentified)	--
76	(unidentified)	--
78	Int1Regs (`%p`)	1 bit
86	SpecialRegs (internal-only, `off_4A026E0`)	varies

IDs 29, 32, 36, 39, 41, 67, 72, 76 appear in the sub_B612D0 table but have not been definitively mapped to named register classes. They likely correspond to sub-register classes, tied-operand classes, or WMMA accumulator classes that cicc defines beyond the 9 primary classes documented in reference/register-classes.md.

Constraint Type Classification

A secondary classification table at byte_3F252E0 categorizes constraint entries into four families (recovered from sub_A7A6D0 constraint merge/intersection logic at 0xA78000):

Classification Byte	Family	Applies To
0x00	Simple/scalar	Single-register operands; the vast majority of ALU constraints
0x08	Ordered	Operands with fixed positional requirements (tied operands)
0x10	Sized/ranged	Operands with explicit bit-width requirements (sub-register extracts)
0x18	Compound	Multi-register operands; types 86-97 in the classification table

The merge function sub_A7A6D0 (7KB) performs set intersection across constraint families when two constraint sets must be unified (e.g., during register coalescing or inline asm constraint resolution). The "compound" family (0x18) covers instructions that require register pairs or wider groupings -- tensor core MMA instructions fall into this category.

Key Sub-Functions

The constraint emission pipeline involves these collaborating functions:

Address	Size	Function	Purpose
`sub_A778C0`	--	`createRegClassConstraint(a1, regclass, flags)`	Build a register-class constraint entry; stores class ID in `value` field
`sub_A77AD0`	--	`createAnyRegConstraint(a1, flags)`	Build an "any register" constraint (unconstrained operand)
`sub_A79C90`	--	`composeConstraints(a1, &desc, N)`	Compose N descriptor entries into a single constraint record
`sub_A7A6D0`	7KB	`mergeConstraints(a1, a2)`	Merge/intersect two constraint sets using `byte_3F252E0` classification
`sub_B5BA00`	21KB	`createOutputConstraint(a1, regclass_id)`	Build the output register constraint; 111-case switch on class ID
`sub_A78010`	--	`emitConstraint(a1, &desc_array, N)`	Emit the final constraint with N entries to the instruction descriptor
`sub_B612D0`	104KB	`emitInstrConstraint(a1, opcode)`	Top-level: lookup `word_3F3E6C0`, dispatch on constraint class, build and emit

The sub_B5BA00 function (21KB) is itself a 111-case switch that translates register class IDs into the internal constraint representation. It produces the value field for output constraint entries. Its size suggests that it handles not just the 9 primary register classes but also sub-register classes, paired classes, and special accumulator classes for tensor operations.

Constraint Switch Structure

The 179-case switch in sub_B612D0 is the heart of the pattern database. Each case constructs a fixed sequence of constraint descriptors on the stack, then calls sub_A78010 to emit them. The cases can be organized into major families based on operand count and register class patterns.

Family 1: Unary Instructions (1 input, 1 output)

These are the simplest constraints: one input operand and one result. Two descriptor entries (32 bytes on stack). Representative constraint classes:

// Constraint class 0x01 — Unary ALU, same type in/out
// Example: MOV, NEG, NOT, ABS for Int32Regs
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x0E01  (class=0x01, regclass=14=Int32)
case 0x01:
    desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) }   // input[0]: same class as output
    desc[1] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output: regclass from high byte
    sub_A78010(a1, desc, 2)

Constraint classes in this family include 0x01 through approximately 0x08, covering unary operations across all scalar register classes. The register class v4 (from the high byte) determines whether the instruction operates on Int32, Int64, Float32, Float64, Pred, or another class. The same constraint class is reused for multiple opcodes that share the same operand signature.

Family 2: Binary ALU Instructions (2 inputs, 1 output)

The most common family. Three descriptor entries (48 bytes on stack). Covers all two-operand arithmetic and logic instructions:

// Constraint class 0x09 — Binary ALU, all same type
// Example: ADD, SUB, MUL, AND, OR, XOR for Int32Regs
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x0E09  (class=0x09, regclass=14=Int32)
case 0x09:
    desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) }   // input[0]: Int32
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: Int32
    desc[2] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  Int32
    sub_A78010(a1, desc, 3)

Variants within this family differ in whether inputs are constrained to the same class as the output or to a different class. For instance, shift instructions constrain the shift amount (input[1]) to Int32 regardless of the data type of input[0]:

// Constraint class 0x0C — Binary with mixed types (shift-like)
// Example: SHL.b64, SHR.b64  (data=Int64, shift_amount=Int32)
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x320C  (class=0x0C, regclass=50=Int64)
case 0x0C:
    desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) }     // input[0]: Int64 (data)
    desc[1] = { kind=1, value=sub_A778C0(a1, 14, 0) }     // input[1]: Int32 (shift amount)
    desc[2] = { kind=-1, value=sub_B5BA00(a1, v4) }        // output:  Int64
    sub_A78010(a1, desc, 3)

Family 3: Comparison / Predicate-Producing Instructions (2 inputs, predicate output)

Comparison instructions produce a predicate register result regardless of the input type. Three descriptor entries:

// Constraint class 0x10 — Compare, predicate output
// Example: SETP.EQ.s32, SETP.LT.f32
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x4E10  (class=0x10, regclass=78=Pred)
case 0x10:
    desc[0] = { kind=0, value=sub_A778C0(a1, <input_class>, 0) }  // input[0]: operand type
    desc[1] = { kind=1, value=sub_A778C0(a1, <input_class>, 0) }  // input[1]: operand type
    desc[2] = { kind=-1, value=sub_B5BA00(a1, 78) }               // output: Pred (%p)
    sub_A78010(a1, desc, 3)

The input register class is determined by the instruction variant (integer comparison vs. float comparison), while the output is always predicate register class 78.

Family 4: Ternary / FMA Instructions (3 inputs, 1 output)

Fused multiply-add and select instructions require four descriptor entries (64 bytes on stack):

// Constraint class 0x18 — Ternary FMA, all same float type
// Example: FMA.RN.f32 (a * b + c)
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x2818  (class=0x18, regclass=40=Float32)
case 0x18:
    desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) }   // input[0]: Float32 (a)
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: Float32 (b)
    desc[2] = { kind=2, value=sub_A778C0(a1, v4, 0) }   // input[2]: Float32 (c)
    desc[3] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  Float32 (result)
    sub_A78010(a1, desc, 4)

Select/conditional-move instructions also fall here, with one predicate input and two data inputs:

// Constraint class 0x1A — Select (pred, trueval, falseval)
// Example: SELP.b32 (predicated select)
case 0x1A:
    desc[0] = { kind=0, value=sub_A778C0(a1, 78, 0) }   // input[0]: Pred (condition)
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: data (true value)
    desc[2] = { kind=2, value=sub_A778C0(a1, v4, 0) }   // input[2]: data (false value)
    desc[3] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  data (selected)
    sub_A78010(a1, desc, 4)

Family 5: Memory Instructions (load/store with address operands)

Load instructions produce a data result from an address operand. Store instructions consume both data and address. These constraint classes handle the different address space qualifiers and vector widths:

// Constraint class 0x20 — Scalar load from address
// Example: LD.GLOBAL.b32 (global memory load)
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x0E20  (class=0x20, regclass=14=Int32)
case 0x20:
    desc[0] = { kind=0, value=sub_A778C0(a1, 50, 0) }   // input[0]: Int64 (address pointer)
    desc[1] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  Int32 (loaded data)
    sub_A78010(a1, desc, 2)

Vector load variants (LoadV2, LoadV4) use additional output entries for each vector lane:

// Constraint class 0x22 — Vector load V2 (two-element)
// Example: LD.GLOBAL.V2.b32 (load 2x Int32)
case 0x22:
    desc[0] = { kind=0, value=sub_A778C0(a1, 50, 0) }   // input[0]: Int64 (address)
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: (offset/predicate)
    desc[2] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  data element 0
    // Second output encoded separately via sub_A79C90 composition
    sub_A78010(a1, desc, 3)

Store instructions have no result output (kind = -1 carries a sentinel value or void class):

// Constraint class 0x28 — Scalar store
// Example: ST.GLOBAL.b32 (global memory store)
case 0x28:
    desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) }   // input[0]: data to store
    desc[1] = { kind=1, value=sub_A778C0(a1, 50, 0) }   // input[1]: Int64 (address)
    desc[2] = { kind=-1, value=sub_B5BA00(a1, 86) }      // output:  SpecialRegs (chain/token)
    sub_A78010(a1, desc, 3)

Family 6: Type Conversion Instructions (input and output differ)

Conversion instructions have an input class that differs from the output class. The constraint class encodes the specific pair:

// Constraint class 0x30 — CVT from Int32 to Float32
// Example: CVT.RN.f32.s32
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x2830  (class=0x30, regclass=40=Float32)
case 0x30:
    desc[0] = { kind=0, value=sub_A778C0(a1, 14, 0) }   // input[0]: Int32 (source)
    desc[1] = { kind=-1, value=sub_B5BA00(a1, 40) }      // output:  Float32 (result)
    sub_A78010(a1, desc, 2)

// Constraint class 0x32 — CVT from Float64 to Int64
// Example: CVT.RTZ.s64.f64
// Opcode lookup: word_3F3E6C0[opcode - 1] = 0x3232  (class=0x32, regclass=50=Int64)
case 0x32:
    desc[0] = { kind=0, value=sub_A778C0(a1, 51, 0) }   // input[0]: Float64 (source)
    desc[1] = { kind=-1, value=sub_B5BA00(a1, 50) }      // output:  Int64 (result)
    sub_A78010(a1, desc, 2)

Widening/narrowing conversions between integer sizes and float-to-half conversions each have their own constraint class.

Family 7: Copy / Move Instructions (register transfer)

The copy family (opcodes 440-503) maps to constraint classes that encode same-class and cross-class register transfers:

// Constraint class 0x40 — Same-class copy
// Example: MOV.b32  (Int32 -> Int32)
// Used by opcodes 440-443 (type-preserving moves)
case 0x40:
    desc[0] = { kind=0, value=sub_A778C0(a1, v4, 0) }   // input[0]: same class
    desc[1] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  same class
    sub_A78010(a1, desc, 2)

// Constraint class 0x42 — Cross-class copy (Int32 <-> Float32)
// Example: MOV from Int32Regs to Float32Regs (bitcast-level move)
// Used by opcodes 444+ (cross-class moves)
case 0x42:
    desc[0] = { kind=0, value=sub_A778C0(a1, <source_class>, 0) }   // input: source class
    desc[1] = { kind=-1, value=sub_B5BA00(a1, <dest_class>) }        // output: dest class
    sub_A78010(a1, desc, 2)

Cross-class copies are never coalesced by the register coalescer (they remain as explicit mov instructions in PTX output). The constraint table enforces this by assigning distinct source and destination classes.

Family 8: Call ABI Instructions (parameter declaration and passing)

The NVPTX calling convention uses special opcodes for .param space management. These have unique constraint classes with no data register operands:

// Constraint class 0x50 — DeclareParam (opcode 505)
// Declares a .param space allocation for function argument passing
case 0x50:
    desc[0] = { kind=0, value=sub_A77AD0(a1, 0) }       // input[0]: "any" (chain token)
    desc[1] = { kind=-1, value=sub_B5BA00(a1, 86) }      // output:  SpecialRegs (chain)
    sub_A78010(a1, desc, 2)

Call sequence opcodes (315=CallSeqBegin, 514=CallStart, 517=CallSeqEnd, 518=CallProto) all use constraint classes that operate on chain tokens rather than data registers. Their inputs and outputs are in the SpecialRegs class (ID 86).

Family 9: Atomic Instructions (address + data + result)

Atomic operations require an address, a data operand, and produce a result of the same data type:

// Constraint class 0x60 — Atomic RMW (read-modify-write)
// Example: ATOM.ADD.s32 (atomic add on Int32)
// Opcodes 294-297 (atom.add family)
case 0x60:
    desc[0] = { kind=0, value=sub_A778C0(a1, 50, 0) }   // input[0]: Int64 (address)
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: data (value to add)
    desc[2] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  data (old value)
    sub_A78010(a1, desc, 3)

Atomic compare-and-swap (opcode 462 = atom.cas) requires four operands (address, expected, desired, result):

// Constraint class 0x62 — Atomic CAS
// Example: ATOM.CAS.b32 (compare-and-swap)
case 0x62:
    desc[0] = { kind=0, value=sub_A778C0(a1, 50, 0) }   // input[0]: Int64 (address)
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: data (expected)
    desc[2] = { kind=2, value=sub_A778C0(a1, v4, 0) }   // input[2]: data (desired)
    desc[3] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  data (old value)
    sub_A78010(a1, desc, 4)

Family 10: Tensor Core / MMA Instructions (many inputs, many outputs)

The most complex constraint classes handle tensor core matrix operations. These instructions consume multiple register-pair or register-quad operands and produce multiple results. Constraint class 0xB0 is the extreme case with 17 input operands:

// Constraint class 0xB0 — Complex MMA (17 inputs, 1+ outputs)
// Example: tcgen05.mma variants (Blackwell, opcodes 4905-4940)
// This is the maximum-operand constraint class.
case 0xB0:
    for (i = 0; i < 17; i++) {
        desc[i] = { kind=i, value=sub_A778C0(a1, <operand_class[i]>, 0) }
    }
    desc[17] = { kind=-1, value=sub_B5BA00(a1, v4) }
    sub_A78010(a1, desc, 18)

HMMA/IMMA/BMMA instructions (the SM70+ tensor core families at sub_21E0360-sub_21E2280) use constraint classes in the 0x90-0xAF range, typically with 4-8 register inputs (accumulator fragments) and 4-8 register outputs. The operand classes include Int32HalfRegs (ID 27) for packed f16 pairs and Int128Regs (ID 52) for wide accumulator state.

Family 11: Predicated Instructions (extra predicate input)

Many NVPTX instructions support predication, where execution is conditional on a predicate register. Predicated variants append an extra Pred-class input:

// Constraint class 0x70 — Predicated binary ALU
// Example: @%p0 ADD.s32 %r1, %r2, %r3  (conditional add)
case 0x70:
    desc[0] = { kind=0, value=sub_A778C0(a1, 78, 0) }   // input[0]: Pred (guard)
    desc[1] = { kind=1, value=sub_A778C0(a1, v4, 0) }   // input[1]: data (src0)
    desc[2] = { kind=2, value=sub_A778C0(a1, v4, 0) }   // input[2]: data (src1)
    desc[3] = { kind=-1, value=sub_B5BA00(a1, v4) }      // output:  data (result)
    sub_A78010(a1, desc, 4)

Family 12: Special / Barrier Instructions (chain-only)

Barrier and synchronization instructions have no data operands. They operate purely on the chain token for ordering:

// Constraint class 0x80 — Barrier/Fence (chain-only)
// Example: BAR.SYNC (opcodes 287-290)
case 0x80:
    desc[0] = { kind=0, value=sub_A77AD0(a1, 0) }       // input[0]: "any" (chain in)
    desc[1] = { kind=-1, value=sub_B5BA00(a1, 86) }      // output:  SpecialRegs (chain out)
    sub_A78010(a1, desc, 2)

Pattern Matching Dispatch

The constraint table is consumed during instruction selection by the three-level dispatch hierarchy:

Driver (sub_3090F90, 91KB): Builds a cost table for function arguments via hash(key*37), uses a min-heap priority queue for topological-order traversal, iterates with budget = 4 * numInstructions * maxBlockSize.
Matcher (sub_308FEE0): Called per-SDNode from the driver. Dispatches to the hand-written selector or the TableGen-generated selector.
Hand-written selector (sub_347A8D0, 309KB): Giant switch on ISD/NVPTXISD opcodes. Calls sub_969240 (SDNode accessor) 263 times. Recursive with 42 self-calls. Handles tex/surf, wmma, atomics, barriers.
TableGen-generated selector (sub_348D3E0, 256KB): Auto-generated from NVPTX .td instruction pattern definitions. Calls sub_969240 45 times, sub_32889F0 38 times.
Complex addressing mode selector (sub_33D4EF0, 114KB): Handles NVPTX load/store addressing with address space qualifiers. Calls sub_969240 399 times -- the single function with the most SDNode accesses in the entire binary.

After pattern matching selects a MachineInstr opcode, the constraint table is queried via sub_B612D0 to determine register requirements. The selected opcode is the index into word_3F3E6C0.

Operand Binding

When the constraint emission function sub_B612D0 builds the descriptor array, operand binding follows this protocol:

Lookup: Read word_3F3E6C0[opcode - 1]. Extract constraint_class (low byte) and register_class_id (high byte, stored as v4).
Switch dispatch: Branch to the case for constraint_class.
Input construction: For each input operand position i:
- Call sub_A778C0(a1, class_id, flags) to create a register-class constraint entry.
- The class_id is either v4 (same class as output) or a hardcoded value (different class for mixed-type instructions).
- The flags parameter encodes operand modifiers (tied, early-clobber, etc.).
- Store the result in desc[i] with kind = i.
Output construction: Call sub_B5BA00(a1, v4) to create the output constraint.
- sub_B5BA00 is a 21KB function with 111 switch cases that translates the register class ID into the internal output representation.
- Store in desc[N] with kind = -1.
Emission: Call sub_A78010(a1, desc, N+1) to finalize. This function walks the descriptor array, validates constraint consistency, and writes the constraint record into the instruction's operand descriptor table.

For instructions that use sub_A77AD0 ("any register" constraint), the operand accepts any register class. This is used for chain tokens, inline asm operands with unconstrained registers, and certain special-purpose slots.

For composition of multi-output instructions, sub_A79C90 merges multiple descriptor sub-arrays into a single compound constraint. This is needed for vector loads (LoadV2, LoadV4) and MMA instructions that produce multiple result registers.

Allocation

The global table word_3F3E6C0 is in the .data section, allocated at link time. It is read-only after cicc process startup. Constraint descriptors are purely stack-allocated within sub_B612D0's frame (approximately 0x160 bytes deep). No heap allocation occurs during constraint emission. This makes the constraint emission path allocation-free and safe for use in concurrent compilation (the function is reentrant as long as each thread has its own stack frame).

Cross-References

reference/register-classes.md -- Authoritative register class table with encoding scheme
reference/nvptx-opcodes.md -- NVPTX MachineInstr opcode inventory (consumers of this constraint table)
llvm/isel-patterns.md -- ISel pattern matching that feeds opcodes into this table
llvm/selectiondag.md -- SelectionDAG construction that precedes constraint emission
llvm/register-allocation.md -- Greedy RA that consumes the emitted constraints
llvm/register-coalescing.md -- Copy family opcodes 440-503 and coalescing constraints

Function Map

Function	Address	Size	Role
`createRegClassConstraint`	`sub_A778C0`	--	Build register-class input constraint entry
`createAnyRegConstraint`	`sub_A77AD0`	--	Build unconstrained ("any") input constraint
`composeConstraints`	`sub_A79C90`	--	Merge N descriptor entries into compound constraint
`mergeConstraints`	`sub_A7A6D0`	7KB	Set-intersection of constraints using `byte_3F252E0`
`emitConstraint`	`sub_A78010`	--	Finalize and emit constraint record
`createOutputConstraint`	`sub_B5BA00`	21KB	111-case switch: class ID to output representation
`emitInstrConstraint`	`sub_B612D0`	104KB	Top-level: 179-case constraint class dispatch
`decodeOperandType`	`sub_B6B200`	44KB	101-case operand type decoder from bytecode stream

SelectionDAG Node Structure

The SelectionDAG (SDNode) is the central data structure in cicc's code generation backend. Nodes represent operations in the target-independent DAG before instruction selection lowers them to machine instructions. The DAG builder (sub_2081F00, 267KB) converts LLVM IR into an initial DAG by visiting each IR instruction through a dispatch chain rooted at sub_2065D30. Nodes are deduplicated via a CSE hash table (sub_F4CEE0, 41KB) and allocated from a bump allocator embedded in the builder context object. The complete SelectionDAG pipeline then runs type legalization, operation legalization, DAG combining, and instruction selection over this graph before emitting PTX machine instructions.

SDNode Layout (104 Bytes, Two Views)

Every SDNode is allocated as exactly 104 bytes, hardcoded in sub_163D530. After allocation, all fields are zeroed. Two complementary views of the layout have been recovered: the "allocator view" from the zeroing pattern in sub_163D530, and the "accessor view" from field access patterns across the combiner (sub_F20C20), legalization (sub_1FFB890), and known-bits engine (sub_33D4EF0).

Allocator View (from `sub_163D530`)

The raw 104 bytes are zeroed via a combination of qword and dword stores:

qw[0..5] = 0, dw[6] = 0, qw[8..10] = 0, dw[11] = 0, byte[96] = 0

The statistics counter at context offset +96 is incremented by 104 for every allocation: *(_QWORD *)(v4 + 96) += 104LL.

Accessor View (Composite from Combiner, Legalizer, KnownBits)

The following table reconciles field accesses across sub_F20C20 (DAG combiner visitor), sub_1FFB890 (LegalizeOp), sub_33D4EF0 (computeKnownBits, 114KB), and sub_1FCE100 (LegalizeOp dispatcher):

Offset	Size	Type	Field	Evidence
+0	8B	`SDNode*`	`chain_next` / first operand value	D03: `(qword)(N+0)` used as first operand in single-operand patterns
+4	4B	`uint32_t`	`NumOperands_packed`	D03: `(dword)(N+4) & 0x7FFFFFF` = NumOperands (low 27 bits); bits 27--30 = flags; bit 30 (0x40 in byte +7) = hasChainOps
+7	1B	`uint8_t`	`node_flags_byte`	D03: bit 4 = hasDebugLoc; bit 6 = hasChainPtr (operand list at `N-8`)
+8	8B	`SDVTList*`	`VTList` / ValueType pointer	D03: `(qword)(N+8)` = result value type descriptor; D05: read for MVT extraction
+16	8B	`SDUse*`	`UseList`	D03: head of use-def chain (doubly-linked list)
+24	4B	`uint16_t`	`opcode`	D02: `(uint16_t)(node+24)` = SDNode::getOpcode(); D05: `*(a3+24)` switched upon
+28	4B	`uint32_t`	`opcode_flags`	D05: `*(a3+28)` = sub-flags (nsw/nuw/exact bits)
+32	8B	`SDUse*`	`operand_list`	D02: `*(node+32)` = pointer to first operand SDUse; operand stride = 40 bytes
+33	1B	`uint8_t`	`extension_mode`	D05: `*(a3+33)` bits[2:3] = load extension mode (0=none, 1=zext, 2=sext, 3=zext)
+40	8B	`ptr`	`value_list` / operand[0] type	D02: `*(node+40)` = SDValue type info; D01: result type descriptor
+48	8B	`EVT`	`result_VT`	D05: `*(a3+48)` = result VT list, 16-byte entries `{u16 MVT, pad, u64 ext}`
+60	4B	`uint32_t`	`num_values`	D02: number of result values
+64	4B	`uint32_t`	`flags` / `num_operands_alt`	D05: `*(a3+64)` = operand count (alternate access path in KnownBits)
+72	8B	`SDValue`	`chain_operand` / result EVT	D03: `(qword)(N+72)` = result value type; D01: chain operand for memory ops
+80	8B	`ptr`	`metadata` / mem operand	D01: `*(node+80)` = predicate for CAS; extra metadata
+88	4B	`uint32_t`	`address_space` / ordering	D01: `*(node+88)` = memory operand / address-space descriptor
+96	8B	`uint64_t`	`immediate_value`	D05: `*(a3+96)` = constant value for ConstantSDNode (width <= 64)
+104	8B	`ptr`	`extended_data`	D05: `*(a3+104)` = second immediate, type info for wide constants
+112	8B	`ptr`	`mem_chain` / alignment	D05: `*(a3+112)` = MemSDNode chain / alignment info

Note on dual access patterns. The combiner accesses opcodes at N+24 as a 4-byte field with flags, while the legalizer reads *(uint16_t*)(node+24) for a clean 16-bit opcode. The KnownBits engine (sub_33D4EF0) accesses fields at offsets up to +112, confirming that ConstantSDNode and MemSDNode subclasses extend beyond the base 104-byte allocation. These extended nodes are allocated via sub_BD2DA0 (80 bytes for lightweight variants) or sub_22077B0 (128 bytes for MemSDNode), while the base SDNode remains 104 bytes.

Operand Storage

Operands are stored in a contiguous array of SDUse structures. Two storage modes exist:

Mode A -- backward inline (common for small operand counts). Operands are stored before the node in memory, growing toward lower addresses:

operand[i] = *(qword*)(N + 32*(i - NumOps))
// or equivalently: N - 32*NumOps = first operand address

This 32-byte operand stride is confirmed across sub_F3D570, sub_F20C20, and sub_F5A610.

Mode B -- indirect pointer (when node_flags_byte bit 6 is set). An 8-byte pointer at N-8 points to a separately allocated operand array:

if (*(byte*)(N+7) & 0x40):
    operand_base = *(qword*)(N - 8)

The SDUse structure (each operand slot) has a 40-byte stride in the legalizer view (sub_1FFB890) and a 32-byte stride in the combiner view. The 40-byte stride includes use-chain forward/backward pointers:

Offset	Size	Field	Description
+0	8B	`Val`	Pointer to the SDNode this use points to
+8	4B	`ResNo`	Result number within the pointed-to node
+16	8B	`Next`	Next SDUse in the use-list of the defining node
+24	8B	`Prev`	Previous SDUse (for doubly-linked list)
+32	8B	`User`	Back-pointer to the node that owns this operand

Use-list traversal functions: sub_B43C20 (add to use list), sub_B43D60 (remove from use list).

SDValue

An SDValue is a lightweight {SDNode*, unsigned ResNo} pair identifying a specific result of a specific DAG node. In the decompiled code, SDValues appear as 16-byte pairs at various points:

struct SDValue {
    SDNode *Node;     // +0: pointer to the defining node
    uint32_t ResNo;   // +8: which result of that node (0-based)
};

SDValues are passed by value in registers (packed into __m128i in many decompiled signatures) and stored in operand arrays. The SDUse structure wraps an SDValue with use-chain linkage for the def-use graph.

SelectionDAG Builder Context

The builder context is the a1/v4 parameter to sub_163D530. It holds the function being compiled, target information, the bump allocator state, and several DenseMaps for node deduplication.

Offset	Size	Field	Description
+0	8B	`func_ptr`	The LLVM function being compiled (a2)
+8	8B	`target_ptr`	Target machine info (a4)
+16	8B	`alloc_cursor`	Bump allocator current position
+24	8B	`alloc_end`	Bump allocator end boundary
+32	8B	`slab_array`	Pointer to array of slab pointers
+40	4B	`slab_index`	Current slab number (dword)
+44	4B	`slab_capacity`	Max slabs in array (dword)
+48	var	`inline_slab`	Start of first allocation region
+80	8B	`bb_list_head`	Basic block list sentinel (points to +96)
+88	8B	`bb_list_count`	Number of basic blocks (init 0)

Embedded DenseMaps

Three DenseMap/DenseSet instances are embedded inline in the context for node deduplication and worklist tracking. All use the standard DenseMap infrastructure with NVVM-layer sentinels (-8 / -16); see Hash Table and Collection Infrastructure for the hash function, probing strategy, and growth policy.

Map A (CSE node mapping) at offsets +120..+148:

Offset	Size	Field
+120	8B	`NumEntries`
+128	8B	`Buckets` pointer
+136	4B	`NumItems`
+140	4B	`NumTombstones`
+144	4B	`NumBuckets`

Map B (secondary set) at offsets +152..+176, same layout.

Set C (worklist) at offsets +184..+208, same layout.

Total minimum context size: 212 bytes.

Map A uses 16-byte bucket stride (key + value pairs), confirmed by the decompiled access pattern:

v30 = (_QWORD *)(v28 + 16LL * v29);   // 16-byte stride
*v30 = v11;                             // key
v30[1] = v19;                           // value

DAG Builder Algorithm (SelectionDAGBuilder)

The SelectionDAGBuilder converts LLVM IR to an initial SelectionDAG. The main entry is sub_2081F00 (267KB, ~9,000 lines), with the visit dispatcher at sub_2065D30 (25KB). The builder processes one basic block at a time, walking the IR instruction list and emitting corresponding SDNode subgraphs.

Entry and Dispatch

sub_2081F00(SelectionDAGBuilder *this, BasicBlock *BB):
    // this+552 = SelectionDAG pointer
    // this+560 = DataLayout pointer
    // Walk BB instruction list via linked list at BB+40/+48
    for each instruction I in BB:
        sub_2065D30(this, I)     // main visit dispatch

The visit dispatcher (sub_2065D30) contains a DenseMap for node deduplication (hash function: (key >> 9) ^ (key >> 4)). It switches on the IR opcode and delegates to per-instruction visitors:

IR Instruction	Visitor Function	Size	Notes
Binary ops	`sub_206E5B0`--`sub_206F0D0`	2.3KB each	8 identical template instantiations for different ISD opcodes
Call	`sub_208CF60`	56KB	Calls `sub_20C7CE0` (NVPTX ComputeCalleeInfo)
Load	`sub_209B000`	15KB	Chains via `sub_2051C20`
Store	`sub_2090780`	14KB	Alignment, volatile, chain tokens
Switch/Br	`sub_20912B0`	18KB	Jump tables, range checks
PHI	`sub_20920A0`	13KB	Block ordering, vreg setup
GEP	`sub_209FCA0`	13KB	Recursive address building
Intrinsic	`sub_208C8A0`	9KB	Dispatches to intrinsic handlers
Debug	`sub_208C270`	7KB	Debug value/location handling
Inline Asm	`sub_2079C70`	83KB	Full constraint parsing
NVVM Tex/Surf	`sub_2077400`	20KB	`"nvvm_texsurf_handle"` metadata, NVIDIA custom
NVVM Args	`sub_2072590`	38KB	CUDA argument coercion, NVIDIA custom

Chain Management

Every memory-touching SDNode carries a chain operand (token type) that enforces memory ordering. The chain is a linked sequence of token-typed SDValues threading through all memory operations in program order.

Chain creation. The builder maintains a "current chain" (PendingChain) that is updated after every memory operation. When a load or store is emitted, the current chain becomes its chain input, and the node's token result becomes the new current chain.

TokenFactor merging. When multiple independent memory operations can be reordered (e.g., independent loads), the builder creates a TokenFactor (opcode 2/55 depending on context) node that merges multiple chains into one:

// sub_F429C0: merge node creation
TokenFactor = getNode(ISD::TokenFactor, dl, MVT::Other, chains[])

Chain handling utilities in the builder:

sub_20993A0 (11KB) -- chain/token helper for load/store sequences
sub_2098400 -- chain token node creator
sub_20989A0 -- memory scheduling chain builder
sub_F6C1B0 (16KB) -- chain management in combining, uses sub_B46970 (isTokenFactor)

Glue (flag) chains. Certain node pairs must be scheduled adjacently (e.g., CopyToReg + CALL). These use a "glue" value type (MVT::Glue) as an additional operand/result. The call lowering in sub_3040BF0 threads glue through the entire call sequence: CallSeqBegin -> DeclareParam* -> Store* -> CallProto -> CallStart -> LoadRetParam* -> CallSeqEnd.

Per-Node Analysis Structure

During DAG construction, sub_163D530 creates per-node analysis objects (accessed via v381) with the following layout:

Offset	Size	Field
+8	8B	`array_ptr`
+16	4B	`array_count`
+24	4B	`array_capacity`
+72	8B	`set.Buckets`
+80	4B	`set.NumItems`
+84	4B	`set.NumTombstones`
+88	4B	`set.NumBuckets`

Operations: sub_163BE40(v381, ptr) inserts into the +8 array; sub_163BBF0(context, key) looks up the analysis structure for a node in the context's DenseMap.

CSE (Common Subexpression Elimination) Hash Table

The getNode() family of functions deduplicates SDNodes via a CSE hash table. The primary implementation is sub_F4CEE0 (41KB):

sub_F4CEE0(SelectionDAG *DAG, unsigned Opcode, SDVTList VTs, SDValue *Ops, unsigned NumOps):
    // 1. Compute profile hash via sub_F4B360 (SDNode::Profile)
    //    Hash combines: opcode, VTs, all operand node pointers
    // 2. Lookup in CSE hash table:
    //    hash = ((profile >> 4) ^ (profile >> 9)) & (capacity - 1)
    //    Quadratic probing: step 1, 2, 3, ...
    //    Sentinels: -4096 (empty), -8192 (tombstone)
    // 3. If found: return existing node
    // 4. If not found:
    //    Allocate via sub_BD2C40 (bump allocator)
    //    Initialize via sub_B44260 (SDNode constructor)
    //    Insert into hash table
    //    Add to AllNodes list (global sentinel: qword_4F81430)
    //    Return new node

Node builder variants handle different operand counts:

sub_F49030 (38KB) -- complex node construction with operand/result type setup
sub_F429C0 (34KB) -- merge/TokenFactor/indexed node creation
sub_F44160 (22KB) -- CSE rebuild after modification
sub_F40FD0 (16KB) -- node construction with chain initialization

The AllNodes list (qword_4F81430) is a doubly-linked intrusive list of all SDNodes in the current DAG, used for iteration during combining and legalization passes.

NVPTX-Specific Node Types (NVPTXISD)

NVPTX target-specific ISD opcodes begin at ISD::BUILTIN_OP_END = 0x1DC9 (confirmed by sub_2095B00 delegation threshold for getTargetNodeName()). In the decompiled code, target opcodes are referenced by small integers (the NVPTXISD enum value minus BUILTIN_OP_END). The following table consolidates all NVPTXISD opcodes discovered across sub_3040BF0, sub_32E3060, sub_33B0210, and the legalization infrastructure:

Call ABI Nodes

Opcode	Name	Operands	Description
315	`CallSeqBegin`	chain, seqId, frameSize	Mark start of call frame
316	`CallSeqEnd_Outer`	chain, ...	Outer call-sequence-end wrapper
505	`DeclareParam`	chain, align, idx, size	Declare `.param` (byval/aggregate)
506	`DeclareScalarParam`	chain, align, idx, size	Declare `.param` (scalar, widened)
507	`DeclareRetParam`	chain, ...	Declare `.param` for return (byval callee)
508	`DeclareRetScalarParam`	chain, ...	Declare `.param` for return (scalar callee)
510	`CallDirect`	chain, callee, ...	Direct call (callee not extern)
511	`CallDirectNoProto`	chain, callee, ...	Direct call without prototype
512	`CallIndirect`	chain, ptr, ...	Indirect call via function pointer
513	`CallIndirectNoProto`	chain, ptr, ...	Indirect call without prototype
514	`CallStart`	chain, ...	Actual call instruction emission
515	`LoadRetParam`	chain, offset	Load return value from `.param` (not last)
516	`LoadRetParamLast`	chain, offset	Load last return value from `.param`
517	`CallSeqEnd`	chain, seqId, ...	End of call sequence (inner chain)
518	`CallProto`	chain, paramCount	Declare call prototype (`.callprototype`)
521	`DeclareRetParam_Ext`	chain, ...	Declare `.param` for return (extended path)
527	`StoreCalleeRetAddr`	chain, ...	Store callee return address in `.param`
528	`StoreRetValToParam`	chain, ...	Store return value to `.param` (return path)

Memory / Vector Nodes

Opcode	Name	Operands	Description
568	`LoadV1`	chain, ptr, offset	Load 1-element from `.param` (scalar return)
569	`LoadV2`	chain, ptr, offset	Load 2-element vector from `.param`
570	`LoadV4`	chain, ptr, offset	Load 4-element vector from `.param`
571	`StoreV1`	chain, val, ptr, offset	Store 1-element to `.param` (`st.param`)
572	`StoreV2`	chain, val, ptr, offset	Store 2-element vector to `.param`
573	`StoreV4`	chain, val, ptr, offset	Store 4-element vector to `.param`

Math / Rounding-Mode Nodes

Opcode	Name	Description
245	`ADD_RM`	Add, round toward -inf
246	`SQRT_RP`	Sqrt, round toward +inf
248	`SQRT_RZ`	Sqrt, round toward zero
249	`ADD_RZ`	Add, round toward zero
250	`DIV_RZ`	Div, round toward zero
251	`MUL_RN`	Mul, round to nearest
252	`ADD_RN`	Add, round to nearest
253	`FMA_RN`	FMA, round to nearest
254	`SQRT_RM`	Sqrt, round toward -inf
255	`MUL_RZ`	Mul, round toward zero
256	`DIV_RM`	Div, round toward -inf
267	`FMA_RZ`	FMA, round toward zero
268	`DIV_RN`	Div, round to nearest
269	`DIV_RP`	Div, round toward +inf
270	`ADD_RP`	Add, round toward +inf
271	`FMA_RM`	FMA, round toward -inf
272	`MUL_RP`	Mul, round toward +inf
273	`FMA_RP`	FMA, round toward +inf
274	`MUL_RM`	Mul, round toward -inf

Address Space / Miscellaneous Nodes

Opcode	Name	Description
22	`TargetAddr`	Target address computation
24	`Wrapper`	Global address wrapping
149	`ATOMIC_LOAD`	Atomic load with scope
152	`SELECT_CC`	Ternary select on condition code
154	`SQRT_RN`	Sqrt, round to nearest
189	`MoveParam`	Read thread index / special register
193--196	`MIN/MAX`	Integer min/max variants
197	`CTPOP`	Population count
198--204	`ConstPool*`	Constant pool variants by size
208	`CMPXCHG`	Compare-and-exchange atomic
230	`DeclareLocal`	Declare local `.param` / address of param
233--234	`AddrSpaceCast`	Bidirectional address space cast pair
287--290	`Barrier/Fence`	Memory barrier/fence variants
310	`Annotation`	Annotation metadata node
321	`StackRestore`	Restore stack pointer
322	`StackAlloc`	Dynamic stack allocation
330	`FunctionAddr`	Function address
335	`BinaryArith`	Generic binary arithmetic
371	`DynAreaOffset`	Dynamic alloca offset
499	`ConditionalBranch`	Conditional branch with chain

Atomic Opcodes (from `sub_20BED60`)

Opcode Range	Operation	Widths
294--297	`atom.add`	f32/f64/i32/i64
302--305	`atom.min`	s32/s64/u32/u64
314--317	`atom.max`	s32/s64/u32/u64
462	`atom.cas`	generic

DAG Legalization Flow

After the initial DAG is built, three legalization phases transform it into a form the NVPTX backend can select:

Phase 1: Type Legalization (`sub_20019C0`, 348KB)

The DAGTypeLegalizer iterates to fixpoint. For each node, it reads the result/operand types and checks the legality table at TLI + 259 * VT + opcode + 2422. If illegal, it applies one of: promote, expand, soften, scalarize, or split-vector. The worklist iterates until no node has an illegal type.

NVPTX legal vector types are extremely limited (only v2f16, v2bf16, v2i16, v4i8 -- all packing into 32-bit registers via Int32HalfRegs). This means virtually all LLVM-IR vector operations pass through the split/scalarize paths.

Type legalization workers:

sub_201E5F0 (81KB) -- promote/expand secondary dispatch (441 case labels, 6 switches)
sub_201BB90 (75KB) -- ExpandIntegerResult (632 case labels)
sub_2029C10 -- SplitVectorResult dispatcher (reads opcode at node+24)
sub_202E5A0 -- SplitVectorOperand dispatcher
sub_2036110 -- ScalarizeVectorResult
sub_2035F80 -- ScalarizeVectorOperand

Phase 2: Operation Legalization (`sub_1FFB890`, 169KB)

After types are legal, the operation legalizer checks whether each operation at its now-legal type is supported. The action lookup:

action = *(uint8_t*)(TLI + 259*VT + opcode + 2422)

Actions dispatch through a five-way switch:

Action	Code	Behavior
Legal	0	Return immediately
Custom	1	Call `TLI->LowerOperation()` via vtable slot #164 (offset 1312)
Expand	2	Try `sub_20019C0` (LegalizeTypes), then `sub_1FF6F70` (ExpandNode)
LibCall	3	Call `sub_1FF6F70` directly
Promote	4	Find next legal type, rebuild at promoted type

Custom lowering invokes NVPTXTargetLowering::LowerOperation() (sub_32E3060, 111KB) through the vtable. This is where all NVPTX-specific operation lowering happens: BUILD_VECTOR splat detection, VECTOR_SHUFFLE three-level lowering, EXTRACT_VECTOR_ELT three-path dispatch, and the .param-space calling convention.

Additional action tables:

Second table at TLI + opcode + 2681 -- for BSWAP/CTLZ/CTTZ/BITREVERSE (opcodes 43--45, 199)
Third table at TLI + opcode + 3976 -- for FSINCOS (opcode 211)
Fourth table at TLI + 18112 -- packed nibble format for FP_TO_SINT/FP_TO_UINT/SELECT_CC, indexed by (VT_id >> 3) + 15 * condcode_type

Phase 3: DAG Combining (Three Passes)

DAG combining runs after each legalization phase. The orchestrator (sub_F681E0, 65KB) manages a worklist of SDNodes and calls the per-node visitor (sub_F20C20, 64KB) for each. The visitor implements a six-phase combine algorithm:

Opcode-specific combine via sub_100E380 -- target-independent pattern matching
Known-bits narrowing -- for constants, calls sub_11A3F30 (computeKnownBits/SimplifyDemandedBits) and narrows if fewer bits demanded
Operand type-narrowing loop -- walks all operands, promotes/truncates to legal types, creates SIGN_EXTEND/TRUNCATE casts
All-constant-operand fold -- 4x-unrolled check via sub_1028510 (ConstantFold)
Division-by-constant strength reduction -- shift+mask replacement for power-of-2 divisors
Vector stride / reassociation -- sub_F15770 (shift-fold), sub_F17ED0 (stride patterns)

NVPTX-specific combines run as a post-legalize pass:

sub_33C0CA0 (62KB) -- PerformDAGCombine, the NVPTX target hook
sub_32EC4F0 (92KB) -- post-legalize combine
sub_3425710 (142KB) -- the NVIDIA DAGCombiner with internal "COVERED"/"INCLUDED" debug tracing strings (not present in upstream LLVM)

The worklist uses the same DenseMap infrastructure as the builder context, with the hash at DAG+2072 (capacity at DAG+2088, count at DAG+2080). Node replacement goes through sub_F162A0 (CombineTo/ReplaceAllUsesWith), which walks the use-list, hashes each user into the worklist map, then calls sub_BD84D0 for the actual use-chain splice.

Bump Allocator

The builder context uses a slab-based bump allocator identical to the one used for NVVM IR nodes:

Slab growth: 4096 << (slab_index >> 7) -- exponential, capped at 4TB.
Alignment: 8 bytes.
No per-node free: entire slabs are released when the DAG is destroyed.
Overflow: allocates a new slab via malloc().

Since every base SDNode is exactly 104 bytes (13 qwords), a single 4096-byte initial slab holds approximately 39 nodes before overflow triggers slab growth. Extended node types (ConstantSDNode, MemSDNode) may be larger and are allocated via separate paths:

sub_BD2C40 -- standard SDNode allocation (bump allocator)
sub_BD2DA0 -- SDNode allocation variant (80 bytes, for lightweight nodes)
sub_22077B0 -- operator new[] (128 bytes, for MemSDNode with chain/alignment fields)

Basic Block Iteration

The builder iterates over the function's basic blocks via a linked list rooted at a2 + 72 (the function parameter). Each list node embeds the data pointer at offset -24 from the node:

bb_data = node_ptr - 24

Within each basic block, instructions are iterated via an inner list:

Inner list sentinel at bb_data + 40
Inner list head at bb_data + 48

This matches the LLVM ilist intrusive linked list pattern where the list hook is embedded at a fixed offset within the contained object.

Differences from Upstream LLVM

Area	NVIDIA (cicc v13.0)	Upstream LLVM 20.0
Type legalizer structure	Single 348KB monolithic function (`sub_20019C0`)	Split across 4 files (`LegalizeIntegerTypes.cpp`, etc.)
NVIDIA DAGCombiner	142KB `sub_3425710` with `"COVERED"`/`"INCLUDED"` internal tracing	No equivalent; target combines via `PerformDAGCombine` hook only
computeKnownBits	114KB `sub_33D4EF0`, covers 112+ ISD opcodes including NVPTX target nodes	~30 opcodes in generic `computeKnownBits`, target extends via hook
Inline asm	162KB total (`sub_2079C70` + `sub_338BA40`)	~200 lines per target
Intrinsic lowering	343KB switch covering 200+ intrinsic IDs up to 14196	~300 standard intrinsic IDs
Address spaces	AS 101 (param alt), AS 7 (`.param`), CTA/GPU/SYS scope atomics	No AS 101; no scope atomics
Libcall metadata	`"nvptx-libcall-callee"` metadata for custom libcall routing	Not present
Legal vector types	Only v2f16, v2bf16, v2i16, v4i8 (packed into 32-bit registers)	Varies by target; typically much wider vectors

Function Map

Function	Address	Size	Role
SelectionDAG builder context init	`sub_163D530`	73KB	Allocator, DenseMaps, BB iteration
SelectionDAGBuilder::visit	`sub_2081F00`	267KB	IR-to-DAG main lowering
SelectionDAGBuilder visit dispatch	`sub_2065D30`	25KB	Per-instruction routing
visitCall	`sub_208CF60`	56KB	Call lowering into DAG
visitLoad	`sub_209B000`	15KB	Load chain emission
visitStore	`sub_2090780`	14KB	Store alignment/chain
visitSwitch/Br	`sub_20912B0`	18KB	Control flow lowering
visitPHI	`sub_20920A0`	13KB	PHI node handling
visitGEP	`sub_209FCA0`	13KB	Address computation
visitInlineAsm	`sub_2079C70`	83KB	Inline asm constraint parsing
visitNVVMTexSurf	`sub_2077400`	20KB	NVIDIA tex/surf handle lowering
NVPTX argument coercion	`sub_2072590`	38KB	CUDA kernel argument lowering
getNode / CSE hash table	`sub_F4CEE0`	41KB	Node deduplication
SelectionDAG node builder	`sub_F49030`	38KB	Complex node construction
Merge/TokenFactor creation	`sub_F429C0`	34KB	Chain merging, indexed nodes
DAG combiner orchestrator	`sub_F681E0`	65KB	Worklist management
DAG combiner visitor	`sub_F20C20`	64KB	Per-node combine algorithm
combine() opcode dispatch	`sub_100E380`	--	Target-independent combines
CombineTo / RAUW	`sub_F162A0`	--	Use-chain replacement + worklist push
SDNode allocation	`sub_BD2C40`	--	Bump allocator
SDNode constructor	`sub_B44260`	--	Initialization
SDUse add to use list	`sub_B43C20`	--	Use-chain linkage
SDUse remove from use list	`sub_B43D60`	--	Use-chain unlinkage
ReplaceAllUsesWith	`sub_BD84D0`	--	Raw use-chain splice
transferDbgValues	`sub_BD6B90`	--	Debug info transfer
setOperand	`sub_B91C10`	--	Operand mutation
replaceOperand	`sub_B99FD0`	--	Single operand swap
DAGTypeLegalizer::run	`sub_20019C0`	348KB	Type legalization master dispatch
LegalizeOp	`sub_1FFB890`	169KB	Operation legalization
ExpandNode	`sub_1FF6F70`	--	Full node expansion fallback
NVPTXTargetLowering::LowerOperation	`sub_32E3060`	111KB	NVPTX custom operation lowering
NVPTXTargetLowering::LowerCall	`sub_3040BF0`	88KB	`.param` calling convention
Intrinsic lowering switch	`sub_33B0210`	343KB	200+ CUDA intrinsic IDs
PerformDAGCombine (NVPTX)	`sub_33C0CA0`	62KB	Post-legalize NVPTX combines
NVIDIA DAGCombiner	`sub_3425710`	142KB	NVIDIA-specific combine engine
computeKnownBits (NVPTX)	`sub_33D4EF0`	114KB	112-opcode known-bits transfer
ISel::Select driver	`sub_3090F90`	91KB	Pattern matching entry
getOperationName	`sub_2095B00`	35KB	ISD opcode -> string mapping

Cross-References

SelectionDAG & Instruction Selection -- pipeline overview, NVPTX lowering, combine detail
Type Legalization -- 348KB type legalizer deep-dive
ISel Patterns -- instruction selection pattern database
Register Classes -- NVPTX register class constraints
Address Spaces -- address space encoding
Hash Infrastructure -- universal DenseMap documentation
IR Node Structure -- NVVM IR node layout (pre-SelectionDAG)
Pattern Database -- ISel pattern constraint classes

DenseMap, Symbol Table, and EDG Frontend Structures

The EDG 6.6 frontend layered on LLVM's DenseMap maintains its own declaration nodes, type nodes, and scope stack for C/C++/CUDA semantic analysis. This page documents the EDG-level structures that ride on top of the DenseMap. For the DenseMap implementation itself -- layout, hash function, probing, sentinel values, and growth policy -- see Hash Table and Collection Infrastructure.

The EDG symbol tables in this subsystem use the NVVM-layer sentinel pair (-8 / -16) and the pointer hash (ptr >> 9) ^ (ptr >> 4). See the sentinel reference table for other subsystems.

EDG Declaration Node Layout

The EDG 6.6 frontend represents every C/C++ declaration as a variable-length structure. The canonical declaration node layout was recovered from the top-level declarator parser sub_662DE0 and the declaration-specifier resolver sub_7C0F00.

Declaration Node (a_decl_node) -- 456+ bytes

Offset	Size	Type	Field	Evidence
+0	8B	`ptr`	`decl_id` / entity pointer	`*v31` in `sub_662DE0`
+8	8B	`uint64_t`	`decl_flags` bitfield (see below)	`v31[1]`
+16	8B	`uint64_t`	`decl_extra_flags`	`v31[2]`
+24	16B		`name` / identifier info	`v31[3..4]`
+40	8B		name string for "main" check	strcmp target
+72	4B	`uint32_t`	`saved_specifier_word1`	v239 in `sub_662DE0`
+76	2B	`uint16_t`	`saved_specifier_word2`	v240
+80	1B	`uint8_t`	`entity_kind` (for scope dispatch)	checked in `sub_860B80`
+120	1B	`uint8_t`	`accessibility` (bits 0-6, bit 7 reserved)	v241 = `*(a1+120) & 0x7F`
+124	1B	`uint8_t`	`context_flags_124`	bit 5=explicit_spec, bit 6=class_member
+125	1B	`uint8_t`	`context_flags_125`	bit 5=was_friend, bit 6=in_class_body, bit 7=template_decl_head
+126	1B	`uint8_t`	`state_flags` (see below)	mask tests throughout `sub_662DE0`
+127	1B	`uint8_t`	`extra_state`	bit 0=class_scope_pushed, bit 1=needs_deferred_parse
+128	8B	`ptr`	`entity_ptr` / scope pointer	compared early in `sub_739430`
+130	1B	`uint8_t`	`modifier_flags`	bit 5=deferred_parse, bit 6=virtual_specifier
+131	1B	`uint8_t`	inline/constexpr flag	bit 4
+132	1B	`uint8_t`	needs_semicolon_check	bit 1
+140	1B	`uint8_t`	`type_kind` (for type_def nodes)	switch discriminant in `sub_766570` case 6
+160	8B	`ptr`	`underlying_type` (for typedef)	typedef unwrap chain
+168	8B	`ptr`	`flags_ptr`	bit 3 checked for fn-pointer
+173	1B	`uint8_t`	`elaborate_kind`	primary switch in `sub_739430`
+176	var		`elaborate_sub_kind` / secondary	sub-switch in case 12
+184	8B	`ptr`	`parm_list`	`v31[23]` via `sub_5CC190(1)`
+224	4B	`uint32_t`	`init_kind`	bit 0 = brace-init
+256	4B	`uint32_t`	`additional_flags`
+268	1B	`uint8_t`	`decl_kind_enum`	0=variable, 4=function, 6=namespace
+269	1B	`uint8_t`	`storage_class_kind`	0=none, 1=extern, 2=static
+272	8B	`ptr`	`decl_type`	`v31[34]`
+280	8B	`ptr`	`result_type`	`v31[35]`
+288	8B	`ptr`	`entity_type` / return_type	`v31[36]`
+304	8B	`ptr`	`template_info`
+352	8B	`ptr`	`body_ptr`	`v31[44]`
+360	8B	`ptr`	`scope_or_context`	`v31[45]`
+368	8B	`ptr`	`forward_decl_chain`	linked list
+416	8B	`ptr`	`pending_list`	`v31[52]`
+456	8B	`ptr`	`extra_entity`	`v31[57]`

decl_flags (+8) Bit Definitions

Bit	Mask	Meaning
0	0x1	`is_definition` / linkage related
1	0x2	`has_initializer` / needs init check
4	0x10	`is_typedef`
5	0x20	`is_template_decl` / friend declaration
6	0x40	`is_inline`
7	0x80	`is_extern`
14	0x4000	`structured_binding` / decomposition decl

state_flags (+126) Bit Definitions

Bit	Mask	Meaning
0	0x1	`has_saved_tokens`
1	0x2	`abstract_declarator_mode`
2	0x4	`has_leading_attributes`
3	0x8	`no_declarator_needed` (typedef etc.)
4	0x10	`suppress_error_recovery`
5	0x20	`in_declarator_parsing` (set on entry)
6	0x40	`in_multi_declarator_loop`
7	0x80	`scope_pushed`

entity_kind (+80) Dispatch Values

Used by sub_860B80 and sub_7C0F00 phase 3:

Value	Entity Kind
3	class
4	enum (variant A)
5	enum (variant B)
6	namespace
10	function
11	variable
16	typedef
17	template
19	class template
22	dependent name
23	using-declaration
24	injected-class-name

Declaration Node Allocation

sub_84DCB0 allocates 152-byte declaration entries from a free-list at qword_4D03C68, with fallback to the global allocator sub_823970(152). The full node size table at qword_4B6D500 provides per-tag sizes for all 87 IL node types; the declaration tag (6) indexes into this table for memcpy during template instantiation.

EDG Type Node Layout

Type nodes are the central representation for C/C++ types throughout the EDG frontend. Two distinct layouts exist: the IL-level type node used by the tree walker (sub_7506E0) and the semantic type node used by the type comparison engine (sub_7386E0). The type translation system (sub_91AED0) bridges between these and LLVM types.

IL-Level Type Node (from `sub_7506E0` tree walker)

The IL tree walker addresses fields as a1[N] (8-byte indexed), with byte-level sub-kind tags at specific offsets:

Offset	Size	Type	Field	Evidence
-16	8B	`ptr`	`parent_ptr` / owner	shared-node check path
-8	1B	`uint8_t`	`flags_byte`	bit 0=shared, bit 2=visit-mark
+0..+N*8	var	`ptr[]`	child pointers (typed per kind)	`a1[0]..a1[N]`
+24	1B	`uint8_t`	expression sub-kind (case 13)	switch discriminant
+28	1B	`uint8_t`	scope sub-kind (case 23)	18 sub-kinds
+40	1B	`uint8_t`	declaration sub-kind (case 21)	25 sub-kinds
+48	1B	`uint8_t`	template_arg sub-kind (case 30)	9 sub-kinds
+140	1B	`uint8_t`	`type_def_sub_kind` (case 6)	17 sub-kinds
+161	1B	`uint8_t`	`type_def_flags`
+168-177	var		type sub-kind / sub-sub-kind
+173	1B	`uint8_t`	`type_main_kind` (case 2)	14 sub-kinds by `+173`
+176	1B	`uint8_t`	`type_sub_sub_kind`	case 6 elaborated

Semantic Type Node (from `sub_7386E0` comparison engine)

Offset	Size	Type	Field	Evidence
+0	8B	`ptr`	`associated_decl`	`v10 == v14` comparison
+24	1B	`uint8_t`	`type_kind` (0..37)	primary switch discriminant
+25	1B	`uint8_t`	`cv_qualifiers`	bits 0-1 = const/volatile, bit 6 = restrict
+26	1B	`uint8_t`	`type_flags_1`	bit 2 compared
+27	1B	`uint8_t`	`type_flags_2`	bit 1 compared (case 1)
+56	8B		`type_payload` / sub_kind	case 1: char at +56 = base_type_kind
+58	1B	`uint8_t`	`type_extra_flags`	case 1: bits 0x3A compared
+64	8B		varies per kind	case 30: word at +64
+72	8B	`ptr`	`type_child` / pointer	case 1 integer path
+80	8B	`ptr`	`linkage_chain`	case 33: namespace list `+80` = next

EDG-to-LLVM Type Translation Node (from `sub_918E50`)

The type translation system reads a third view of the type node with offsets optimized for LLVM type construction:

Offset	Size	Type	Field	Evidence
-72	8B	`ptr`	`grandparent_type`	nested lookups
-48	8B	`ptr`	`parent_type_A`
-24	8B	`ptr`	`parent_type_B` / first child
-8	8B	`ptr`	`indirect_child_array`	if flag 0x40 at +23
+0	8B	`ptr`	`llvm_type_descriptor`	`*node` -> LLVM type info
+8	8B	`ptr`	`member_chain_head`	linked list of class members
+16	1B	`uint8_t`	`type_kind`	see kind table below
+18	2B	`uint16_t`	`qualifier_word`	bits 0-14: qualifier ID, bit 15: negation
+20	4B	`uint32_t`	`child_count`	low 28 bits masked `& 0xFFFFFFF`
+23	1B	`uint8_t`	`flags`	bit 6 (0x40) = indirect children
+24	8B		type-specific data	varies by kind
+32	8B		`bitwidth`	enum/integer types
+33	1B	`uint8_t`	`additional_flags`	bit 5 (0x20) = special treatment
+36	4B	`uint32_t`	`sub_kind_discriminator`	nested types
+40	8B	`ptr`	`scope_linkage_ptr`
+48	8B	`ptr`	`member_list_head`	linked list

type_kind Enumeration (semantic type comparison)

The full enumeration recovered from sub_7386E0:

Value	Name	Comparison Strategy
0	`tk_none` / void	trivially equal
1	`tk_fundamental`	sub_kind + base type + class scope
2	`tk_pointer`	delegate to `sub_739430` on pointee
3	`tk_class`	scope identity, unique_id, template args
4	`tk_enum`	scope identity, unique_id
5	`tk_function`	`sub_73A280` pair compare
6	`tk_bitfield`	width + base compare
7	`tk_member_pointer`	multi-field descriptor
8	`tk_reference`	referent descriptor
10	`tk_array`	element type recursion
11	`tk_qualified`	child + qualifier bit
12	`tk_elaborated`	sub_kind switch (typedef/class/enum)
13	`tk_pack_expansion`	sub_kind switch
14	`tk_typeof_expr`	sub_kind switch
15	`tk_decltype`	sub_kind switch
16	`tk_nullptr`	trivially equal
17	`tk_auto`	identity on entity
18	`tk_function_alt`	`sub_73A280`
20	`tk_dependent_name`	scope identity + unique_id
22	`tk_unresolved`	`sub_8D97D0` decl compare
23	`tk_attributed`	attribute kind + child
24	`tk_decltype_auto`	identity on entity
25	`tk_paren`	child list compare
26	`tk_adjusted`	child type recursion
27	`tk_typeof_decl`	resolve decl -> type, recurse
30	`tk_complex`	element type(s) recursion
32	`tk_template_template_param`	identity + template args
33	`tk_using_decl`	child list + base class hash table
34	`tk_atomic`	child + qualifier bit
35	`tk_vla`	element type recursion
37	`tk_concept_constraint`	identity on entity

EDG-to-LLVM Type Kind Encoding (byte at node+16)

Value	Hex	Kind
0-16	0x00-0x10	Primitive / scalar types
17	0x11	Void (special)
5	0x05	Qualified type (const/volatile/restrict)
13	0x0D	Enum type
14	0x0E	Function type
26	0x1A	Array type (subscript form)
27	0x1B	Compound type (struct/union/class)
50	0x32	Union variant A
51	0x33	Union variant B
54	0x36	Typedef / using declaration
55	0x37	Using declaration variant
75	0x4B	Pointer type
76	0x4C	Reference type (lvalue or rvalue)
77	0x4D	Member pointer type
78	0x4E	Dependent / nested type

Qualifier Word Values (node+18 & 0x7FFF)

Value	CUDA Memory Space
1	Address space 1 (global memory)
9	Address space 9 (generic, gated by `sub_5F3280`)
14	Function / method qualifier
26	Array subscript context A
27	Array subscript context B
32	Address space 32 (shared memory)
33	Address space 33 (constant memory)

Type Canonicalization -- `sub_72EC50`

Before any type comparison, both sides are canonicalized by stripping non-template typedef aliases:

fn edg_canonicalize_type(type) -> type:
    while type.type_kind == 2:               // tk_elaborated
        scope = type.payload_at_56
        if scope.elaborate_kind != 12:       // not typedef_name
            break
        if scope.elaborate_sub_kind != 1:    // not single-member typedef
            break
        if scope.class_flags & 0x10:         // has template specialization
            break
        type = sub_72E9A0(type)              // unwrap one layer
    return type

This peels through chains like typedef int MyInt; typedef MyInt YourInt; down to the fundamental type. Template specialization aliases are never unwrapped.

EDG Scope Stack

The scope stack is a global array of 776-byte entries, indexed by a scope depth counter. It represents the C++ scope nesting at parse time (file scope -> namespace -> class -> function -> block).

Global State

Address	Type	Name	Purpose
`qword_4F04C68`	`ptr`	Scope stack base	heap-allocated array of 776B entries
`dword_4F04C64`	`int32_t`	Current scope index	top of the scope stack
`dword_4F04C5C`	`int32_t`	Previous scope index	saved parent index
`dword_4F04C44`	`int32_t`	Namespace scope index	deepest enclosing namespace
`dword_4F04C34`	`int32_t`	Class scope index	deepest enclosing class
`dword_4F04C40`	`int32_t`	Another scope index	auxiliary scope tracking
`dword_4F04C3C`	`int32_t`	Module linkage flag	C++20 module scope state
`unk_4F04C48`	`int32_t`	Parent scope check	used by using-declaration handler

Scope Stack Entry Layout (776 bytes)

Each entry at qword_4F04C68[0] + 776 * index:

Offset	Size	Type	Field
+0	4B	`uint32_t`	`scope_id`
+4	2B	`uint16_t`	`scope_kind` (see table below)
+6	1B	`uint8_t`	`flags_a`
+7	1B	`uint8_t`	`flags_b`
+8	1B	`uint8_t`	`flags_c`
+9	1B	`uint8_t`	`flags_d`
+10	1B	`uint8_t`	`flags_e`
+24	8B	`ptr`	`name_list_head`
+32	8B	`ptr`	`name_list_tail`
+208	8B	`ptr`	`class_type_ptr`
+232	8B	`ptr`	`deferred_list`
+328	8B	`ptr`	`template_info`
+552	4B	`int32_t`	`parent_scope_index`
+624	8B	`ptr`	`declaration_ptr`
+680	8B		field used by `sub_7C0F00`
+688	4B	`uint32_t`	`entity_number_counter` (for mangling)
+696	4B	`uint32_t`	`entity_number_counter_2`

scope_kind Values

Value	Scope Kind
5	namespace
6	class
7	function
8	block (compound statement)
9	enum
12	template parameter

Push / Pop Operations

sub_854590(0)   // push_scope -- increments dword_4F04C64, initializes new entry
sub_854430()    // pop_scope  -- decrements dword_4F04C64, restores parent
sub_854AB0(...) // pop_declarator_scope (context-specific cleanup)
sub_854B40()    // push_declarator_scope (declarator-specific init)

The scope depth counter at qword_4F061C8 + 64 is bumped independently for declarator nesting depth tracking. Class scope depth lives at qword_4F061C8 + 81.

Scope Chain Traversal Algorithm

The declaration-specifier resolver sub_7C0F00 performs scope chain traversal to resolve qualified names. The algorithm was recovered from Phase 4 (lines 1197-1600) of that function.

Unqualified Name Lookup

fn lookup_unqualified(name, scope_index) -> entity:
    // Phase 2 of sub_7C0F00
    // Try each lookup strategy in priority order:

    result = sub_7D5DD0(name)                    // unqualified lookup in current scope
    if result:
        return result

    result = sub_7D2AC0(name, flags)             // lookup with specific flags
    if result:
        return result

    result = sub_7ACA80(name)                    // ADL / ambiguity resolution
    return result

Qualified Name Lookup (`A::B::C`)

The scope iteration loop at LABEL_282/283/285/288 walks the scope chain:

fn lookup_qualified(base_entity, remaining_name) -> entity:
    current = base_entity
    while true:
        // Check if "::" follows the current entity
        if current_token != TK_SCOPE_RESOLUTION:  // token 37
            return current

        consume_token()  // sub_7B8B50

        // Classify the current entity
        kind = current.entity_kind  // byte at +80
        switch kind:
            case 6:   // namespace
                result = sub_7D4A40(current, remaining_name)  // namespace lookup
            case 3:   // class
            case 19:  // class template
                result = sub_7D2AC0(current, remaining_name, MEMBER_FLAG)
            case 17:  // template
                result = sub_830940(current, remaining_name)  // class template lookup
            default:
                result = sub_7D4600(current, remaining_name)  // generic qualified lookup

        if !result:
            // Error: member not found in scope
            sub_6851C0(error_code, context)
            return null

        current = result

        // Check member access, visibility, redeclaration
        sub_8841F0(current, scope_entry)  // access check for C++ members

Self-Recursive Qualified Resolution

When the declaration-specifier resolver encounters a :: after resolving a name, it recurses into itself at sub_7C0F00(20, a2) where flags=20 decodes as:

bit 2 (0x04) = nested declarator sub-parse context
bit 4 (0x10) = restrict parse to type-specifiers only

This handles arbitrarily deep qualified names like A::B::C::D. Recursion depth is bounded by the nesting depth of the qualified name.

Scope Chain Walking for Declaration Resolution

sub_868D90 (ADL / instantiation lookup) walks the scope chain upward:

fn walk_scope_chain(start_index) -> entity:
    index = start_index
    while index >= 0:
        entry = scope_table_base + 776 * index
        // Check if this scope contains the target declaration
        // ... name lookup within the scope's name list ...

        // Move to parent scope
        index = entry.parent_scope_index  // at offset +552

Type Comparison Engine

sub_7386E0 implements structural type comparison for the EDG frontend. It performs a parallel tree walk over two type nodes, comparing them field-by-field with mode-dependent strictness.

Calling Convention

sub_7386E0(packed_pair: __int128, flags: int) -> bool
    packed_pair.low  = type_A pointer
    packed_pair.high = type_B pointer
    flags bits:
        0-1: cv_compare_mode (0=strict, 1=relaxed, 2=overload)
        2:   template_matching_mode
        5:   anonymous_class_structural_compare

Comparison Algorithm

fn compare_types(type_A, type_B, flags) -> bool:
    // 1. Null handling
    if both null: return true
    if either null: return false

    // 2. Canonicalize (strip non-template typedefs)
    type_A = sub_72EC50(type_A)
    type_B = sub_72EC50(type_B)

    // 3. Quick-reject on header bytes
    if type_A.type_kind != type_B.type_kind: return false
    if (type_A.cv_quals ^ type_B.cv_quals) & 0x43: return false  // const/volatile/restrict
    if (type_A.flags_1 ^ type_B.flags_1) & 0x04: return false

    // 4. Type-specific structural comparison
    switch type_A.type_kind:
        case 3 (class):
            if type_A.scope == type_B.scope: return true     // identity shortcut
            if unique_id_enabled:
                if scope_A.unique_id == scope_B.unique_id: return true
            if template_mode:
                return sub_89BAF0(...)  // template arg list compare
            if anonymous_mode && both_anonymous:
                return sub_739430(member_list_A, member_list_B)

        case 7 (member_pointer):
            // Compare: flags, class ptr, scope ptr, return type,
            // params, exception spec -- 6 sub-comparisons

        case 33 (using_decl) in overload mode:
            // Hash table lookup at qword_4D03BF8 for base class lists
            // Element-by-element comparison of 24-byte triples

        // ... 35 other cases ...

    // 5. Post-switch: declaration pointer compare
    if type_A.decl != type_B.decl:
        if !sub_8D97D0(type_A.decl, type_B.decl): return false
    return true

Helper Functions

Address	Name	Purpose
`sub_7386E0`	`edg_compare_type_nodes`	Top-level structural compare
`sub_739370`	`edg_compare_type_lists`	Linked-list comparator (next at +16)
`sub_739430`	`edg_compare_decl_types`	Declaration-level comparator (661 lines)
`sub_73A280`	`edg_compare_type_pair_triv`	Trivial wrapper: null=equal
`sub_72EC50`	`edg_canonicalize_type`	Strip typedef / elaborated aliases
`sub_8D97D0`	`edg_compare_decl_identity`	Name/entity identity comparison
`sub_8C7520`	`edg_class_same_template`	Same primary class template check
`sub_89AB40`	`edg_compare_template_args`	Template argument list comparison
`sub_89BAF0`	`edg_compare_template_arg_lists_full`	Full template context compare

Key Global: `dword_4F07588` -- unique_id optimization

When set, enables O(1) identity comparison via the unique_id field at scope+32. This avoids recursive structural comparison for named classes and enums. The field is compared as a non-null integer; matching non-null values prove the two types refer to the same entity.

IL Tree Walker and Copier

Tree Walker -- `sub_7506E0` (190KB, 7283 lines)

The generic IL tree walker visits every node in the EDG intermediate representation. It dispatches on 83 node kinds (1-86 with gaps at 24-26) using a massive switch statement.

Callback table at .bss 0x4F08014..0x4F08040:

Address	Type	Callback	Call Sites
`dword_4F08014`	`bool`	`skip_shared_nodes`	flag
`dword_4F08018`	`bool`	`clear_back_pointers`	49 sites
`qword_4F08020`	`fn(node, kind) -> node`	`list_node_rewrite_fn`	206 sites
`qword_4F08028`	`fn(node, kind) -> node`	`child_rewrite_fn`	926 sites
`qword_4F08030`	`fn(node, kind) -> bool`	`pre_visit_fn`	2 sites
`qword_4F08038`	`fn(str, kind, len)`	`string_visitor_fn`	80 sites
`qword_4F08040`	`fn(node, kind)`	`post_visit_fn`	14 sites

Visit-mark protocol: Each node has a flag byte at node[-8]. Bit 2 tracks "visited in current pass" with polarity toggled per walk pass via dword_4D03B64. This avoids clearing visited marks between walks.

Linked-list traversal pattern (60+ lists walked):

for cursor = node.field; cursor; cursor = cursor.next:
    if list_node_rewrite_fn:
        cursor = list_node_rewrite_fn(cursor, child_kind)
    if cursor:
        walk_il_node(cursor, child_kind)
        cursor = node.field  // re-read (rewrite may have changed it)

Next-pointer stride varies by node kind: +0, +16, +24, +32, +56, +112, +120 bytes.

Tree Copier -- `sub_766570` (148KB, 5187 lines)

The copier is driven by template instantiation (sub_8C5CD0 -> sub_8C4EC0 -> sub_8C2C50 -> sub_766570). It uses the walker's callback infrastructure:

sub_8C38E0 = copy_ref callback: resolves pending copy destinations
sub_8C3810 = copy_scope callback: resolves scope-level copies
Node sizes from qword_4B6D500[tag] (87+ entries, one per IL node type)

Copy protocol using flag bits at node[-8]:

Bits	Meaning
0x1	needs copy, not yet started
0x2	copy in progress
0x3	pending copy (both bits)
0x4	copy destination allocated

Copy destination stored at *(node - 24). When both bits 0 and 1 are set, sub_8C3650 forces the copy by allocating qword_4B6D500[tag] bytes and performing memcpy followed by pointer rewriting.

EDG-to-LLVM Type Translation System

Entry: sub_91AED0 -> sub_91AB30. Uses a worklist-driven fixed-point iteration.

Translation Context Object (at `a1+160`)

Offset	Size	Field
+0x000	8B	`debug_logger`
+0x008	8B	`pass_list_ptr`
+0x038	8B	`edg_node_map` (DenseMap: EDG -> LLVM values)
+0x058	8B	`visited_set` (DenseSet for dedup)
+0x060	4B	`visited_count`
+0x064	4B	`visited_capacity`
+0x068	4B	`bucket_count`
+0x090	8B	`type_cache` (DenseMap: EDG type -> LLVM Type*)
+0x168	4B	`threshold`
+0x2A0	8B	`pending_replacements`
+0x2A8	4B	`pending_count`

Fixed-Point Algorithm

fn translate_all_types(ctx, module):
    // Phase 1: iterate module members
    for member in module.member_list:
        sub_AA3700(member)  // gather initial flags

    // Phase 2: fixed-point iteration
    do:
        ordering = sub_919CD0(module)        // topological sort (10-level BFS)
        for type in ordering.reverse():
            sub_913880(ctx, type)            // invalidate stale cache entries
        for type in ordering.reverse():
            changed |= sub_9197C0(ctx, type) // process single declaration
    while changed

    // Phase 3: optional late fixup (byte_3C35480-gated)
    if optimization_enabled:
        do:
            changed = sub_917E30(ctx)
        while changed

    // Phase 4: cleanup
    sub_909590(ctx)

Bitmask for Scope-Tracking Types

The expression 0x100000100003FF >> (kind - 25) selects which type kinds in the range [25..78] require scope tracking during translation. This covers compound types, pointer types, and dependent types that carry CUDA address-space qualifiers.

Usage Across the Compiler

DenseMap instances appear at these known locations:

NVVM context object: 8+ tables for IR node uniquing (opcodes 0x10..0x1F), plus sub-function tables for opcodes 0x04..0x15.
SelectionDAG builder context: Map A (+120), Map B (+152), Set C (+184) for node deduplication and worklist.
Per-node analysis: embedded DenseSet at +72 inside analysis structures created during DAG construction.
Instruction constraint table: the global word_3F3E6C0 array is a flat table rather than a DenseMap, but the constraint emission functions use DenseMaps for lookup caching.
EDG type translation: 5 distinct caches -- visited set, type cache, type-value map, scope table, and type index table.
Base class comparison: qword_4D03BF8 hash table for overload-resolution base class triple lookup.

The consistency of the hash function, sentinel values, and growth policy across all instances is documented in Hash Table and Collection Infrastructure.

Cross-References

IR Node Layout -- NVVM IR node structure and operand access
DAG Node -- SelectionDAG builder that consumes DenseMap instances
Pattern Database -- instruction selection patterns indexed by DenseMap
Address Spaces -- CUDA memory space qualifier values
Hash Infrastructure -- comprehensive DenseMap documentation
EDG Frontend -- EDG tokenizer and keyword dispatch
IRGen Types -- EDG-to-LLVM type translation detail

Function Map

Function	Address	Size	Role
`edg_parse_declarator`	`sub_662DE0`	--	Top-level declarator parser
`edg_parse_decl_specifiers_core`	`sub_672A20`	--	While/switch token dispatcher
`edg_resolve_decl_specifiers`	`sub_7C0F00`	--	Scope chain + qualified name resolver
`edg_compare_type_nodes`	`sub_7386E0`	--	Structural type tree comparison
`edg_compare_type_lists`	`sub_739370`	--	Linked-list type comparator
`edg_compare_decl_types`	`sub_739430`	--	Declaration-level type comparator
`edg_canonicalize_type`	`sub_72EC50`	--	Typedef / elaborated alias stripper
`edg_type_to_string`	`sub_74A390`	--	Type-to-string for diagnostics
`edg_walk_il_node`	`sub_7506E0`	--	190KB IL tree walker (297 recursive calls)
`edg_copy_il_node`	`sub_766570`	--	148KB IL tree copier
`edg_push_scope`	`sub_854590`	--	Push scope stack entry
`edg_pop_scope`	`sub_854430`	--	Pop scope stack entry
`edg_emit_scope_chain`	`sub_82BDA0`	--	Scope chain emission
`edg_unqualified_lookup`	`sub_7D5DD0`	--	Unqualified name lookup
`edg_qualified_lookup`	`sub_7D4600`	--	Qualified name lookup (after `::`)
`edg_lookup_with_flags`	`sub_7D2AC0`	--	Lookup with specific mode flags
`edg_namespace_lookup`	`sub_7D4A40`	--	Lookup in namespace scope
`edg_compare_decl_identity`	`sub_8D97D0`	--	Entity identity comparison
`edg_type_translation_entry`	`sub_91AED0`	--	Top-level EDG-to-LLVM type translation
`edg_type_translation_driver`	`sub_91AB30`	--	Fixed-point iteration driver
`edg_type_kind_dispatch`	`sub_918E50`	--	Type-kind dispatch for translation
`edg_type_pair_compare`	`sub_911D10`	--	Core type-pair comparison + replacement
`edg_alloc_decl_node`	`sub_84DCB0`	--	152-byte declaration node allocator

NVVM Container Binary Format

The NVVM container is a proprietary binary envelope that wraps LLVM bitcode with compiler metadata for transport between pipeline stages in cicc v13.0. It carries target architecture, optimization options, fast-math flags, memory window configurations, per-kernel resource tables, and the IR payload itself -- all in a single serializable blob. Two serialization paths exist: a compact binary wire format used in production (nvcc / ptxas pipelines) and an XML-based format used for debugging and interchange. This page specifies the binary format in sufficient detail to write a conformant parser and serializer.

The format is implemented across 26 functions in the 0xCCBB10--0xCDD2D0 address range (Cluster C in the binary layout). The six top-level entry points:

Function	Address	Size	Role
`NvvmContainer_serialize`	`0xCDD2D0`	47,540 B	Binary + XML serializer
`NvvmContainer_deserialize_options`	`0xCD1D80`	51,859 B	Binary tag/value decoder
`NvvmContainer_parse_header`	`0xCDCA30`	10,206 B	XML path header parser
`NvvmContainer_check_versions`	`0xCD41B0`	16,708 B	Version compatibility gate
`NvvmContainer_validate_versions`	`0xCCD5F0`	8,987 B	Standalone version validator
`NvvmContainer_init_options_struct`	`0xCCBB10`	small	Zero-init 248-byte container struct

Supporting parsers called from NvvmOptions_parse_compile_options (0xCDB4D0, 26,643 bytes):

Function	Address	Size	Role
`NvvmOptions_parse_arch_enum`	`0xCD09E0`	14,516 B	ArchVariant enum string-to-int
`NvvmOptions_parse_fast_math`	`0xCCF590`	12,771 B	FastMathOptions sub-structure
`NvvmOptions_parse_multi_view`	`0xCD6D20`	12,188 B	MultiViewOptions sub-structure
`NvvmOptions_parse_cb_reserved_area`	`0xCCE780`	9,802 B	CB reserved area config
`NvvmOptions_parse_reg_targets`	`0xCD7CE0`	9,542 B	Register target config
`NvvmOptions_parse_serialize_helper`	`0xCD58A0`	9,579 B	Option serialization helper
`NvvmOptions_parse_shader_const_iface`	`0xCCEEA0`	8,355 B	ShaderConstIface (DCI)
`NvvmOptions_parse_align_entries`	`0xCD8610`	6,739 B	Alignment entry config
`NvvmOptions_parse_pgo_section`	`0xCD02C0`	5,482 B	PGO configuration
`NvvmOptions_parse_section`	`0xCD5510`	5,166 B	Nested YAML section parser
`NvvmOptions_parse_memory_windows`	`0xCCE100`	5,042 B	Memory window config
`NvvmOptions_parse_cbank_config`	`0xCCE4B0`	4,173 B	Constant bank config
`NvvmOptions_parse_bool_or_int`	`0xCCC4A0`	small	Boolean/int option parser
`NvvmOptions_parse_tristate`	`0xCCCFB0`	small	Tri-state option parser
`NvvmOptions_parse_string`	`0xCD5150`	small	String option parser

The finalizer knobs parser (0xCD9990, 31,702 bytes) is called separately to ingest the full set of NVIDIA-specific backend knobs (see NVVMPassOptions).

Binary-level helpers:

Function	Address	Role
`NvvmContainer_write_tag_value`	`0xCD17A0`	Write one tag/value pair (called 121 times from serializer)
`NvvmContainer_write_blob`	`0xCD1AB0`	Write blob data + tag reference
`NvvmContainer_compute_crc`	`0xCCD2B0`	CRC with seeds `0x8DF5D74C`, `0xBAA56A96`

Global state: qword_4F87148 holds the NVVM options global state pointer, checked by many downstream consumers.

Binary Header

Every binary container begins with a fixed 24-byte header. The header is self-describing: HeaderSize at offset 0x0E stores its own length (always 24), and two size fields partition the remainder into a scalar tag region and a blob data region.

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Magic  (0x7F4E5C7D)                       |  0x00
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Ver.Major    |  Ver.Minor    | NvvmIR.Major  | NvvmIR.Minor  |  0x04
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| NvvmDbg.Major | NvvmDbg.Minor | Llvm.Major    | Llvm.Minor    |  0x08
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         IRLevel (u16)         |       HeaderSize (u16)        |  0x0C
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     ScalarFieldsEnd (u32)                     |  0x10
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      BlobDataEnd (u32)                        |  0x14
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

struct NvvmContainerBinaryHeader {
    uint32_t magic;              /* 0x00: must be 0x7F4E5C7D              */
    uint8_t  version_major;      /* 0x04: container format major (1)      */
    uint8_t  version_minor;      /* 0x05: container format minor (<=0x41) */
    uint8_t  nvvm_ir_major;      /* 0x06: NVVM IR version major (2)       */
    uint8_t  nvvm_ir_minor;      /* 0x07: NVVM IR version minor (<=0x62)  */
    uint8_t  nvvm_debug_major;   /* 0x08: debug info version major (3)    */
    uint8_t  nvvm_debug_minor;   /* 0x09: debug info version minor (<=2)  */
    uint8_t  llvm_major;         /* 0x0A: LLVM version (see encoding)     */
    uint8_t  llvm_minor;         /* 0x0B: LLVM version (see encoding)     */
    uint16_t ir_level;           /* 0x0C: IRLevel enum                    */
    uint16_t header_size;        /* 0x0E: always 24 (0x0018)              */
    uint32_t scalar_fields_end;  /* 0x10: byte offset past scalar region  */
    uint32_t blob_data_end;      /* 0x14: byte offset past blob region    */
};

The three data regions in order:

[0 .. 24)                         -- Header (fixed)
[24 .. scalar_fields_end)         -- Scalar tag/value pairs
[scalar_fields_end .. blob_data_end) -- Blob data region

The total container size is blob_data_end bytes. After the blob data region, the IR payload (LLVM bitcode, optionally compressed) follows immediately.

LLVM Version Encoding

The llvm_major and llvm_minor bytes encode the LLVM version as a combined integer: llvm_major * 100 + llvm_minor. For cicc v13.0 (LLVM 20), this yields 20 * 100 + 0 = 2000. The version check compares the combined value, not the individual bytes.

IRLevel Enum

Value	Name	Meaning
0	`NVVM_IR_LEVEL_UNIFIED_AFTER_DCI`	Default: IR after Device-Code-Interface unification
1	`NVVM_IR_LEVEL_LTO`	Link-Time Optimization IR (partially optimized)
2	`NVVM_IR_LEVEL_OPTIX`	OptiX pipeline IR

Scalar Tag/Value Encoding

Immediately after the 24-byte header, a sequence of (tag, value) pairs encodes every container field that differs from its default value. The encoding is a variable-length scheme optimized for small values:

Case 1 -- value fits in 16 bits (0x0000..0xFFFE):
  [tag : int16] [value : int16]          -- 4 bytes total

Case 2 -- value needs 32 bits:
  [tag : int16] [0xFFFF : int16] [value : int32]  -- 8 bytes total

Terminator:
  [0x0000 : int16]                       -- tag 0 ends the sequence

All multi-byte fields are little-endian. The sentinel value 0xFFFF in the value slot signals that a full 32-bit value follows. This means the maximum encodable 16-bit value is 0xFFFE (65534); values of exactly 0xFFFF or larger require the extended form.

The serializer (sub_CD17A0, called 121 times from NvvmContainer_serialize) writes each tag/value pair using this scheme. The deserializer enters a switch loop over tags 1--402, decoding each value and writing it to the appropriate offset in the deserialized container struct.

Delta Encoding Strategy

The serializer allocates a default-initialized 440-byte Options struct and compares each field in the current Options against the corresponding default. Only fields that differ from the default are written as tag/value pairs. This makes typical containers very compact -- a standard compilation targeting SM 89 with -O2 might emit fewer than 20 tag/value pairs, covering just SmMajor, SmMinor, CompileMode, and a handful of target-specific flags.

The deserializer reverses this: it allocates a default Options struct first, then overwrites individual fields as tags are encountered. Unknown tags are silently skipped, which is the mechanism that provides forward compatibility -- a newer serializer can emit tags that an older deserializer simply ignores.

Blob Data Region

Tags in the 200+ and 400+ ranges reference variable-length data stored in the blob region. The scalar value for a blob tag is the byte offset into the blob region where the data begins. The blob region starts at scalar_fields_end bytes from the container start.

To resolve a blob reference: blob_ptr = container_base + scalar_fields_end + offset_value.

Blob entries do not carry explicit length fields in the tag/value stream. The deserializer knows each blob type's expected size from the tag ID (e.g., tag 201 is always 24 bytes, tag 203 is always 40 bytes). Variable-length blobs like strings (tags 209, 210, 213, 216, 217) are null-terminated. Length-prefixed blobs (tag 218) carry a 4-byte length prefix.

Complete Tag Table

144 distinct tag IDs organized into six ranges. The "Offset" column refers to the byte position within the deserialized 440-byte Options struct.

Range 1--39: Core Scalar Options

Tag	Type	Name	Options Offset	Notes
1	int32	`SmMajor`	+0 (ArchVariant)	SM major version (e.g., 8 for SM 89)
2	int32	`SmMinor`	+0 (ArchVariant)	SM minor version (e.g., 9 for SM 89)
3	int32	`NumRegs`	+216	Register count hint
4	int32	`NumBarriers`	+220	Barrier count
5	int32	`SharedMemorySize`	+224	Shared memory size in bytes
6	int32	`VertexMode`	+72	See VertexMode enum
7	bit	`ReserveLocalAddressZero`	+20 bit 0	Reserve address 0 in local memory
8	bit	`FastMath.IgnoreInf`	+200 bit 0	Treat infinities as NaN
9	bit	`FastMath.IgnoreNaN`	+200 bit 1	Assume no NaN values present
10	bit	`FastMath.IgnoreSignedZero`	+200 bit 2	Ignore sign of zero
11	bit	`FastMath.ReorderFloat`	+200 bit 3	Allow float reordering
12	bit	`FastMath.ReorderHalf`	+200 bit 4	Allow half-precision reordering
13	bit	`FastMath.Ftz`	+200 bit 5	Flush denormals to zero
14	bit	`FastMath.FastSqrt`	+200 bit 6	Use fast sqrt approximation
15	bit	`FastMath.Fmad`	+200 bit 7	Allow fused multiply-add
16	bit	`FastMath.AllowRcpRsqToSqrt`	+201 bit 0	Allow `rcp(rsqrt(x))` to `sqrt(x)`
17	bit	`FastMath.CanReorderFloatDistribute`	+201 bit 1	Allow distributive reordering
18	int32	`FastMath.Reserved`	+204	Reserved fast-math field
19	int32	`MaxRRegsAllowed`	+216	Maximum registers per thread (primary)
20	int32	`SchedRegTarget`	+220	Scheduling register pressure target
21	int32	`UnrollControl`	+224	Unroll factor control
22	bool	`AcceleratedArch`	+232	True for `sm_XXa` variants
23	bool	`StdELF`	+233	Use standard ELF output format
24	int32	`MaxRRegsAllowed2`	+216	Secondary max-regs (override)
25	int32	`SchedRegTarget2`	+220	Secondary sched target
26	bit	`FastMath.ReassociateFloatAddOverMad`	+201 bit 2	Float add reassociation over MAD
27	bit	`ForceImmediateConstants`	+20 bit 1	Force immediate constant loading
28	bit	`HideFunctions`	+20 bit 2	Hide internal functions from output
29	bit	`UseDX10AddressInRange`	+20 bit 3	DX10 address range mode
30	int32	`UnrollControl2`	+224	Secondary unroll control
31	bit	`FastMath.NoFloatMAD`	+201 bit 3	Disable float MAD formation
32	bool	`AcceleratedArch2`	+232	Secondary accelerated-arch flag
33	bit	`FastMath.LaxFP16ApproximateDivision`	+201 bit 4	Lax FP16 approximate division
34	bool	`StdELF2`	+233	Secondary StdELF
35	int32	`ShaderCodegenSelMask`	+236	Shader codegen selection bitmask
36	bool	`OmegaPtxErrorHandling`	+240	Enable Omega-style PTX error handling
37	int32	`FDLInsertMode`	+244	See FDLInsertMode enum
38	bit	`IsPIC`	+20 bit 4	Position-independent code flag
39	bit	`NoSpillsConstraint`	+20 bit 5	Hard constraint: no register spills

Tag 99: Compression Metadata

Tag	Type	Name	Notes
99	int32	`CompressAlgoId`	Compression algorithm selector for IR payload

When present, the IR payload following the blob region is compressed. The value selects a codec via sub_16886D0(algo_id). If the value is 0, the runtime substitutes the default algorithm ID 0x75D49913 (1,977,119,507 decimal). The codec is a pluggable compression/encryption layer accessed through four function pointers:

/* Compression codec API (addresses in the 0x1688xxx range) */
void *codec_acquire(uint32_t algo_id);            /* sub_16886D0 */
int   codec_compress(void *codec, void *data,
                     size_t size);                 /* sub_1688730 */
int   codec_decompress(void *codec, void *data,
                       size_t size);               /* sub_16887A0 */
void  codec_release(void *codec);                  /* sub_1688720 */

The write path in NvvmContainer_serialize (0xCDD2D0) compresses the LLVM bitcode payload via sub_C8D290, then computes a CRC hash via NvvmContainer_compute_crc (0xCCD2B0) with the two seed values -1914584148 (0x8DF5D74C) and -1162247642 (0xBAA56A96). The CRC value is stored as the CompressAlgoId tag 99 value, which doubles as an integrity check token: the deserializer uses the same CRC seeds to verify the payload before decompression.

The compression subsystem lives outside the main container cluster at addresses 0x16886D0--0x16887A0, in the utility library region of the binary.

Range 101--173: Extended Target Options

These tags configure per-kernel and target-specific hardware parameters. Most map into a sub-structure accessed through the Options struct. The "Byte.Bit" column indicates the packed bitfield location within the target options sub-structure.

Tag	Type	Name	Location	Notes
101	bool	`HasTextureOps`	offset 0	Target supports texture operations
102	bool	`HasSurfaceOps`	offset 0	Target supports surface operations
103	bool	`HasAtomics`	offset 0	Target supports atomic operations
104	bool	`HasVote`	offset 0	Target supports warp vote intrinsics
105	int32	`MaxThreadsPerBlock`	offset 4	Maximum CTA thread count
106	byte	`PreferL1SizeFlag`	offset 8	L1 cache vs shared memory preference
107	bool	`HasWarpShuffle`	offset 0	Target supports warp shuffle
108	bool	`HasFunnelShift`	offset 0	Target supports funnel shift
109	int32	`CBankOfstLow`	offset 12	Constant bank offset lower bound
110	int32	`CBankOfstHi`	offset 16	Constant bank offset upper bound
111	int32	`CBankSize`	offset 20	Constant bank size in bytes
112	bit	`Bit0_68`	byte 68, bit 0	Target capability flag
113	bit	`Bit1_68`	byte 68, bit 1	Target capability flag
114	bit	`Bit2_68`	byte 68, bit 2	Target capability flag
115	bit	`Bit3_68`	byte 68, bit 3	Target capability flag
116	bit	`Bit4_68`	byte 68, bit 4	Target capability flag
117	bit	`Bit5_68`	byte 68, bit 5	Target capability flag
118	bit	`Bit7_68`	byte 68, bit 7	Target capability flag (bit 6 skipped)
119	bit	`EnableCoalesce`	byte 69, bit 0	Enable memory coalescing optimization
120	bit	`EnableVectorize`	byte 69, bit 2	Enable auto-vectorization
121	2-bit	`CompactionMode`	byte 69, bits 3--4	Thread compaction strategy (0--3)
122	int32	`StackFrameSize`	offset 96	Stack frame size in bytes
123	int32	`StackAlignment`	offset 100	Stack alignment requirement
124	int32	`ParamSpaceSize`	offset 104	Parameter space size
125	int32	`ParamAlignment`	offset 108	Parameter space alignment
126	int32	`LocalMemSize`	offset 116	Local memory size per thread
127	int32	`SharedBankConfig`	offset 156	Shared memory bank configuration
128	int32	`MinGridSize`	offset 248	Minimum grid size for occupancy
129	int32	`MaxGridDimX`	offset 252	Maximum X-dimension grid size
130	int32	`SharedMemPerBlock`	offset 264	Shared memory per block
131	2-bit	`WarpScheduleMode`	byte 70, bits 0--1	Warp scheduling strategy
132	bit	`EnablePrefetch`	byte 70, bit 2	Enable memory prefetch instructions
133	bit	`Bit4_70`	byte 70, bit 4	Target capability flag
134	bit	`Bit5_70`	byte 70, bit 5	Target capability flag
135	bit	`Bit6_70`	byte 70, bit 6	Target capability flag
136	bit	`Bit7_70`	byte 70, bit 7	Target capability flag
137	int32	`MaxDynShared`	offset 268	Maximum dynamic shared memory
138	bool	`HasLDG`	offset 5	Target supports LDG instruction
139	bit	`Bit1_71`	byte 71, bit 1	Target capability flag
140	bit	`Bit2_71`	byte 71, bit 2	Target capability flag
141	bool	`HasBarrierReduce`	offset 40	Target supports barrier-reduce
142	int32	`CacheConfig`	offset 280	Cache configuration selector
143	bit	`Bit6_68`	byte 68, bit 6	Target capability flag
144	bit	`Bit3_71`	byte 71, bit 3	Target capability flag
145	bit	`Bit0_71`	byte 71, bit 0	Target capability flag
146	int32	`ConstBankSize`	offset 256	Constant bank total size
147	int32	`ShMemBankStride`	offset 152	Shared memory bank stride
148	2-bit	`ScheduleMode2`	byte 71, bits 4--5	Secondary scheduling mode
149	bit	`Bit6_71`	byte 71, bit 6	Target capability flag
150	bit	`Bit7_71`	byte 71, bit 7	Target capability flag
151	int32	`LocalMemAlignment`	offset 112	Local memory alignment
152	bit	`EnableBarrierOpt`	byte 69, bit 5	Enable barrier optimization
153	bit	`Bit0_72`	byte 72, bit 0	Target capability flag
154	bit	`Bit6_69`	byte 69, bit 6	Target capability flag
155	bit	`Bit7_69`	byte 69, bit 7	Target capability flag
156	bit	`Bit1_72`	byte 72, bit 1	Target capability flag
157	bool	`HasDP4A`	offset 1	Target supports DP4A dot-product
158	bit	`Bit3_72`	byte 72, bit 3	Target capability flag
159	int32	`ConstBankSize2`	offset 260	Secondary constant bank size
160	int32	`MaxRegsPerThread`	offset 284	Hard limit on registers per thread
161	int32	`ClusterSize`	offset 276	Thread block cluster size (SM 90+)
162	bit	`Bit4_72`	byte 72, bit 4	Target capability flag
163	bit	`Bit5_72`	byte 72, bit 5	Target capability flag
164	bit	`Bit6_72`	byte 72, bit 6	Target capability flag
165	bit	`Bit7_72`	byte 72, bit 7	Target capability flag
166	int32	`MaxCTAPerSM`	offset 160	Maximum CTAs per SM
167	int32	`TexIndirectLimit`	offset 272	Texture indirect access limit
168	bit	`Bit0_432`	byte 432, bit 0	Extended capability flag
169	bit	`Bit1_432`	byte 432, bit 1	Extended capability flag
170	bit	`Bit2_432`	byte 432, bit 2	Extended capability flag
171	bool	`HasTMAOps`	offset 289	Target supports TMA operations (SM 90+)
172	bit	`Bit3_70`	byte 70, bit 3	Target capability flag
173	bool	`HasTCGen05`	offset 290	Target supports TCGen05 (SM 100+)

Range 201--218: Blob Data Tags

Tag	Size	Name	Description
201	24 B	`MemoryWindowCBank`	3 memory window entries for constant bank (see below)
202	24 B	`MemoryWindowLocal`	3 memory window entries for local memory
203	40 B	`MemoryWindowShared`	10 x `uint32_t` for shared memory windows + flags
204	48 B	`MultiViewOptions`	Multi-view rendering header + typed arrays
205	var	`TargetResourceTable`	24-byte header + 36 bytes per entry
206	var	`PerKernelCBankOffsets`	4-byte count + 4 bytes per kernel
207	var	`PerKernelStackSizes`	4-byte count + 4 bytes per kernel
208	var	`PerKernelSMEMSizes`	8-byte count + 8 bytes per kernel
209	var	`TargetFuncName`	Null-terminated string
210	var	`TargetEntryName`	Null-terminated string
211	8 B	`PerKernelQWORD`	8-byte per-kernel datum
212	12 B	`ExtraMemParams`	8 + 4 bytes of memory parameters
213	var	`AuxString1`	Null-terminated auxiliary string
214	var	`PerKernelRegisters`	4-byte count + 4 bytes per kernel
215	var	`PerKernelBarriers`	4-byte count + 4 bytes per kernel
216	var	`AuxString2`	Null-terminated auxiliary string
217	var	`AuxString3`	Null-terminated auxiliary string
218	var	`AuxByteArray`	4-byte length prefix + raw bytes

Range 301--309: Extended Int32 Fields

Tag	Type	Name	Options Offset	Notes
301	int32	`ExtOpt.Field344`	+344	Cluster/group configuration selector
302	int32	`ExtOpt.Field348`	+348	Extended option
303	int32	`ExtOpt.Field352`	+352	Extended option
304	int32	`ExtOpt.Field356`	+356	Extended option
305	int32	`ExtOpt.Field360`	+360	Extended option
306	int32	`ExtOpt.Field400`	+400	Extended option
307	int32	`ExtOpt.Field364`	+364	Extended option
308	int32	`ExtOpt.Field368`	+368	Extended option
309	int32	`ExtOpt.Field372`	+372	Extended option

Range 351--353: Extended Int64 Blob References

Tag	Size	Name	Options Offset
351	8 B	`ExtOpt.QWord376`	+376
352	8 B	`ExtOpt.QWord384`	+384
353	8 B	`ExtOpt.QWord392`	+392

Range 401--402: Structured Blob Data

These tags are conditionally parsed based on the value of tag 301 (ExtOpt.Field344):

Tag	Condition	Size	Name	Notes
401	`Field344 == 1`	56+ B	`TMADescriptor`	SM 90 Hopper TMA bulk-copy descriptors. 44-byte fixed header + 16 bytes per entry.
402	`Field344 == 4`	40+ B	`TCGen05Config`	SM 100 Blackwell TCGen05 tensor configurations. 32-byte fixed header + 12 bytes per entry.

The conditional parsing means a single container cannot carry both TMA and TCGen05 data -- the Field344 value selects which hardware generation's tensor memory interface is active.

TMADescriptor Layout (Tag 401, Field344 == 1)

TMA (Tensor Memory Access) descriptors configure cp.async.bulk operations on SM 90 Hopper. The TMA descriptor extraction is performed by sub_9483E0 during intrinsic lowering. The blob layout:

struct TMADescriptor {
    /* +0  */ uint32_t num_entries;          /* Number of TMA descriptors     */
    /* +4  */ uint32_t dimensionality;       /* 1d..5d tensor rank            */
    /* +8  */ uint32_t element_size;         /* Bytes per element             */
    /* +12 */ uint32_t interleave_layout;    /* Memory interleave pattern     */
    /* +16 */ uint32_t swizzle_mode;         /* Swizzle mode selector         */
    /* +20 */ uint32_t fill_mode;            /* Out-of-bounds fill behavior   */
    /* +24 */ uint32_t [5] global_dims;      /* Global tensor dimensions      */
    /* +44 */ /* --- 16 bytes per entry --- */
    /*        uint32_t box_dim;              Per-entry box dimension          */
    /*        uint32_t stride;               Per-entry stride                 */
    /*        uint32_t elem_stride;          Per-entry element stride         */
    /*        uint32_t reserved;             Reserved/padding                 */
};

See SM 90 Hopper for the TMA instruction format and the cp.async.bulk.tensor.g2s.tile.{1d,2d,3d,4d,5d} intrinsic family.

TCGen05Config Layout (Tag 402, Field344 == 4)

TCGen05 (Tensor Core Generation 5) configurations describe Blackwell SM 100 tensor memory operations. The TCGen05 instruction set includes tcgen05.alloc, tcgen05.dealloc, tcgen05.commit, tcgen05.fence, tcgen05.wait, and tcgen05.relinquish.alloc -- all gated by the SM 100 arch-conditional check at sub_30462A0. The blob layout:

struct TCGen05Config {
    /* +0  */ uint32_t num_entries;          /* Number of TCGen05 configs     */
    /* +4  */ uint32_t accumulator_size;     /* Accumulator memory size       */
    /* +8  */ uint32_t commit_mode;          /* Commit mode (multicast flags) */
    /* +12 */ uint32_t fence_mode;           /* Fence mode selector           */
    /* +16 */ uint32_t [4] reserved;         /* Reserved fields               */
    /* +32 */ /* --- 12 bytes per entry --- */
    /*        uint32_t config_id;            TCGen05 config identifier        */
    /*        uint32_t fragment_count;       Number of fragments              */
    /*        uint32_t flags;                Per-config flags                 */
};

See SM 100 Blackwell for the TCGen05 instruction set and the tcgen05.* intrinsic family.

Deserialized Container Struct

After parsing, the container is represented as a 248-byte in-memory structure allocated by NvvmContainer_init_options_struct (0xCCBB10). This struct holds the container metadata plus a pointer to the full 440-byte Options struct.

struct NvvmContainerHeader {          /* 248 bytes total                   */
    /* 0x00 */ uint32_t sm_major;     /* Tag 1: SM major version           */
    /* 0x04 */ uint32_t sm_minor;     /* Tag 2: SM minor version           */
    /* 0x08 */ uint32_t num_regs;     /* Tag 3                             */
    /* 0x0C */ uint32_t num_barriers; /* Tag 4                             */
    /* 0x10 */ uint32_t shared_mem_size; /* Tag 5                          */
    /* 0x14 */ uint8_t  flags_14;     /* Packed bits: tags 7,27,28,29,38,39*/
    /*         bit 0: ReserveLocalAddressZero  (tag 7)                     */
    /*         bit 1: ForceImmediateConstants  (tag 27)                    */
    /*         bit 2: HideFunctions            (tag 28)                    */
    /*         bit 3: UseDX10AddressInRange     (tag 29)                   */
    /*         bit 4: IsPIC                    (tag 38)                    */
    /*         bit 5: NoSpillsConstraint       (tag 39)                   */
    /* 0x15 */ uint8_t  _pad15[3];
    /* 0x18 */ uint8_t  multi_view_options[48]; /* Tag 204 blob            */
    /* 0x48 */ uint32_t vertex_mode;  /* Tag 6                             */
    /* 0x4C */ uint8_t  _pad4c[4];
    /* 0x50 */ uint32_t max_rregs;    /* Tag 19                            */
    /* 0x54 */ uint32_t sched_reg_target; /* Tag 20                        */
    /* 0x58 */ uint32_t unroll_control; /* Tag 21                          */
    /* 0x5C */ uint8_t  _pad5c[4];
    /* 0x60 */ uint8_t  mem_win_cbank[24];  /* Tag 201 blob                */
    /* 0x78 */ uint8_t  mem_win_local[24];  /* Tag 202 blob                */
    /* 0x90 */ uint8_t  mem_win_shared[40]; /* Tag 203 blob                */
    /* 0xB8 */ uint8_t  _padb8[12];
    /* 0xC4 */ uint8_t  accelerated_arch; /* Tag 22                        */
    /* 0xC5 */ uint8_t  std_elf;      /* Tag 23                            */
    /* 0xC6 */ uint8_t  _padc6[2];
    /* 0xC8 */ uint8_t  fast_math[8]; /* Tags 8-17,26,31,33 bitfields     */
    /* 0xD0 */ uint8_t  _padd0[8];
    /* 0xD8 */ uint32_t max_rregs_2;  /* Tag 24                            */
    /* 0xDC */ uint32_t sched_reg_2;  /* Tag 25                            */
    /* 0xE0 */ uint32_t unroll_ctl_2; /* Tag 30                            */
    /* 0xE4 */ uint32_t compress_algo_id; /* Tag 99                        */
    /* 0xE8 */ uint8_t  omega_ptx_err; /* Tag 32                           */
    /* 0xE9 */ uint8_t  std_elf_2;    /* Tag 34                            */
    /* 0xEA */ uint8_t  _padea[2];
    /* 0xEC */ uint32_t shader_cg_sel; /* Tag 35                           */
    /* 0xF0 */ uint8_t  fdl_bit;      /* Tag 36                            */
    /* 0xF1 */ uint8_t  _padf1[3];
    /* 0xF4 */ uint32_t fdl_insert_mode; /* Tag 37                         */
};
/* sizeof(NvvmContainerHeader) == 248 (0xF8) */

The Options pointer is stored at offset 208 (0xD0) of the container header during deserialization -- the container header acts as both a data holder and an index into the full Options struct.

Options Struct (440 bytes)

The full compiler options structure is allocated separately and linked from the container header. It is parsed by NvvmOptions_parse_compile_options (0xCDB4D0, 26,643 bytes) in the XML path, or populated field-by-field from tags in the binary path.

struct NvvmOptions {                   /* 440 bytes total                  */
    /* +0   */ uint32_t arch_variant;  /* ArchVariant enum                 */
    /* +4   */ uint32_t compile_mode;  /* CompileMode enum                 */
    /* +8   */ uint32_t opt_level;     /* OptLevel enum                    */
    /* +12  */ uint32_t debug_info;    /* DebugInfo enum                   */
    /* +16  */ uint32_t client_version;
    /* +20  */ uint8_t  flags_20;      /* Packed booleans: 6 bits          */
    /*          bit 0: ReserveLocalAddressZero                             */
    /*          bit 1: ForceImmediateConstants                             */
    /*          bit 2: HideFunctions                                       */
    /*          bit 3: UseDX10AddressInRange                               */
    /*          bit 4: IsPIC                                               */
    /*          bit 5: NoSpillsConstraint                                  */
    /* +21  */ uint8_t  _pad21[3];
    /* +24  */ uint8_t  multi_view[48]; /* MultiViewOptions sub-structure  */
    /* +72  */ uint32_t vertex_mode;    /* VertexMode enum                 */
    /* +76  */ uint8_t  _pad76[4];
    /* +80  */ uint8_t  dci_info[120];  /* DCIInfo sub-structure           */
    /* +200 */ uint8_t  fast_math_byte0; /* FastMath bits 0-7              */
    /* +201 */ uint8_t  fast_math_byte1; /* FastMath bits 8-12             */
    /* +202 */ uint8_t  _pad202[2];
    /* +204 */ uint32_t fast_math_reserved;
    /* +208 */ uint8_t  _pad208[8];
    /* +216 */ uint32_t max_rregs_allowed;
    /* +220 */ uint32_t sched_reg_target;
    /* +224 */ uint32_t unroll_control;
    /* +228 */ uint32_t okey;           /* CompressAlgoId / OKey           */
    /* +232 */ uint8_t  accelerated_arch;
    /* +233 */ uint8_t  std_elf;
    /* +234 */ uint8_t  _pad234[2];
    /* +236 */ uint32_t shader_codegen_sel_mask;
    /* +240 */ uint8_t  omega_ptx_error_handling;
    /* +241 */ uint8_t  _pad241[3];
    /* +244 */ uint32_t fdl_insert_mode;
    /* +248 */ uint8_t  target_opts[192]; /* Extended target options (tags 101-173) */
};
/* sizeof(NvvmOptions) == 440 (0x1B8) */

DCIInfo Sub-Structure (Options +80, 120 bytes)

The Device-Code-Interface sub-structure at offset +80 contains the shader constant interface and constant bank reserved area configurations. Parsed by NvvmOptions_parse_shader_const_iface (0xCCEEA0, 8,355 bytes) and NvvmOptions_parse_cb_reserved_area (0xCCE780, 9,802 bytes).

ShaderConstIface XML fields (from sub_CCEEA0):

Field	Type	Description
`OptimizerConstBank`	int32	Constant bank index used by the optimizer
`DriverConstBank`	int32	Constant bank index used by the driver
`BindlessTextureBank`	int32	Constant bank for bindless texture handles
`LocalMemoryWindow`	struct	Memory window config for local memory
`SharedMemoryWindow`	struct	Memory window config for shared memory
`VectorizeAndRemapTLD`	bool	Enable vectorization and TLD remapping
`ELFControlsDCI`	bool	ELF controls DCI interface layout
`DiscardDefaultValueOutputs`	bool	Discard outputs that match default values

CBReservedArea XML fields (from sub_CCE780):

Field	Type	Description
`ByteOffsetToEndOfReservedArea`	int32	End-of-reserved-area offset in constant bank
`CbAddressBitsInReservedVABase`	int32	Address bits for reserved virtual address base
`CbBankToReservedVABase`	int32	Constant bank index for reserved VA base
`ForceHighLatencyConstExpr`	bool	Force high-latency constant expression evaluation
`ReservedCbReadBank`	int32	Reserved constant bank read bank index

MultiViewOptions Sub-Structure (Options +24, 48 bytes)

The multi-view rendering options sub-structure at offset +24 carries graphics pipeline multi-view configuration. Parsed by NvvmOptions_parse_multi_view (0xCD6D20, 12,188 bytes). Serialized as blob tag 204.

Field	Type	Description
`NumViews`	int32	Number of rendering views
`NominalViewIDs`	int32[]	Array of nominal view identifiers
`PerViewRTIndexConstants`	int32[]	Per-view render target index constants
`EnableViewInstanceMask`	bool	Enable per-view instance masking
`ComputePerPatchAttribsForViewZero`	bool	Compute per-patch attributes for view 0
`IsImplicit`	bool	Implicit multi-view mode

CompileMode Enum

Value	Name	Meaning
0	`NVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI`	Whole-program with ABI compliance
1	`NVVM_COMPILE_MODE_WHOLE_PROGRAM_NOABI`	Whole-program without ABI (internal)
2	`NVVM_COMPILE_MODE_SEPARATE_ABI`	Separate compilation (relocatable, `--device-c`)
3	`NVVM_COMPILE_MODE_EXTENSIBLE_WHOLE_PROGRAM_ABI`	Extensible whole-program with ABI

OptLevel Enum

Value	Name
0	`NVVM_OPT_LEVEL_NONE`
1	`NVVM_OPT_LEVEL_1`
2	`NVVM_OPT_LEVEL_2` (default)
3	`NVVM_OPT_LEVEL_3`

DebugInfo Enum

Value	Name
0	`NVVM_DEBUG_INFO_NONE` (default)
1	`NVVM_DEBUG_INFO_LINE_INFO`
2	`NVVM_DEBUG_INFO_DWARF`

VertexMode Enum

Value	Name
0	`NVVM_VERTEX_MODE_SINGLE`
1	`NVVM_VERTEX_MODE_A`
2	`NVVM_VERTEX_MODE_B`
3	`NVVM_VERTEX_MODE_AB`

FDLInsertMode Enum

Value	Name
0	`NVVM_FDL_MODE_NONE`
1	`NVVM_FDL_MODE_ALL`
2	`NVVM_FDL_MODE_APP`

ArchVariant Enum

The architecture enum uses a numeric encoding where the value equals major * 10 + minor for older architectures and major * 10 + minor (with 3-digit major) for Blackwell. There are two parallel enum spaces: "virtual" architecture variants (used for compute_XX targets) and "HW" variants (used for sm_XX real silicon targets). The virtual variants are serialized by name in the XML format via NvvmOptions_parse_arch_enum (0xCD09E0, 14,516 bytes).

Virtual Architecture Variants

Enum Name	Numeric Value	Generation	SM
`NVVM_ARCH_KEPLER_3_0`	30	Kepler	3.0
`NVVM_ARCH_KEPLER_3_2`	32	Kepler	3.2
`NVVM_ARCH_KEPLER_3_5`	35	Kepler	3.5
`NVVM_ARCH_KEPLER_3_7`	37	Kepler	3.7
`NVVM_ARCH_MAXWELL_5_0`	50	Maxwell	5.0
`NVVM_ARCH_MAXWELL_5_2`	52	Maxwell	5.2
`NVVM_ARCH_MAXWELL_5_3`	53	Maxwell	5.3
`NVVM_ARCH_PASCAL_6_0`	60	Pascal	6.0
`NVVM_ARCH_PASCAL_6_1`	61	Pascal	6.1
`NVVM_ARCH_PASCAL_6_2`	62	Pascal	6.2
`NVVM_ARCH_VOLTA_7_0`	70	Volta	7.0
`NVVM_ARCH_VOLTA_7_2`	72	Volta	7.2
`NVVM_ARCH_TURING_7_3`	73	Turing	7.3
`NVVM_ARCH_TURING_7_5`	75	Turing	7.5
`NVVM_ARCH_AMPERE_8_0`	80	Ampere	8.0
`NVVM_ARCH_AMPERE_8_2`	82	Ampere	8.2
`NVVM_ARCH_AMPERE_8_6`	86	Ampere	8.6
`NVVM_ARCH_AMPERE_8_7`	87	Ampere	8.7
`NVVM_ARCH_AMPERE_8_8`	88	Ampere	8.8
`NVVM_ARCH_ADA_8_9`	89	Ada Lovelace	8.9
`NVVM_ARCH_HOPPER_9_0`	90	Hopper	9.0
`NVVM_ARCH_BLACKWELL_10_0`	100	Blackwell	10.0
`NVVM_ARCH_BLACKWELL_10_1`	101	Blackwell	10.1
`NVVM_ARCH_BLACKWELL_10_3`	103	Blackwell	10.3
`NVVM_ARCH_BLACKWELL_11_0`	110	Blackwell (Jetson Thor)	11.0
`NVVM_ARCH_BLACKWELL_12_0`	120	Blackwell (RTX 50xx / Pro)	12.0
`NVVM_ARCH_BLACKWELL_12_1`	121	Blackwell (DGX Spark)	12.1

Note: NVVM_ARCH_BLACKWELL_10_1 maps to __CUDA_ARCH 1010, while NVVM_ARCH_BLACKWELL_11_0 maps to __CUDA_ARCH 1100. Despite both being in the BLACKWELL family, they are distinct architectures with separate entries in the processor table. sm_110 (Jetson Thor) was originally designated sm_101 before being renumbered to its own 11.x line.

HW Architecture Variants

The HW variants use a major * 1000 + minor * 10 encoding for their internal numeric values. These map to real silicon rather than virtual compute capabilities:

Enum Name	Internal Value	Notes
`NVVM_ARCH_HW_SM_5_0`	500	Maxwell HW baseline
...	...	One entry per supported HW SM through 9.0
`NVVM_ARCH_HW_SM_10_0`	1000	Blackwell datacenter
`NVVM_ARCH_HW_SM_10_1`	1010	Blackwell Ultra (GB300)
`NVVM_ARCH_HW_SM_10_3`	1030	Blackwell variant
`NVVM_ARCH_HW_SM_10_4`	1200	Maps to SM 120 value -- not publicly documented

The HW_SM_10_4 = 1200 mapping is notable: SM 10.4 in the HW enum space corresponds to the SM 120 consumer architecture. This reveals that "SM 120" is internally considered a Blackwell 10.4 die variant, not a separate generation.

FastMathOptions Bitfields

The fast-math configuration occupies two bytes at Options offset +200 and +201, with an additional int32 at +204. Each bit independently controls one floating-point relaxation.

Byte +200 (tags 8--15)

  Bit 7   Bit 6   Bit 5   Bit 4   Bit 3   Bit 2   Bit 1   Bit 0
+-------+-------+-------+-------+-------+-------+-------+-------+
|  Fmad | Fast  |  Ftz  |Reorder|Reorder|Ignore | Ignore|Ignore |
|       | Sqrt  |       | Half  | Float | Sign0 |  NaN  |  Inf  |
+-------+-------+-------+-------+-------+-------+-------+-------+
  tag 15  tag 14  tag 13  tag 12  tag 11  tag 10  tag 9   tag 8

Byte +201 (tags 16--17, 26, 31, 33)

  Bit 7   Bit 6   Bit 5   Bit 4   Bit 3   Bit 2   Bit 1   Bit 0
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       | Lax   | No    |Reassoc|CanReor| Allow |
|       |       |       | FP16  | Float | Float |derDist| Rcp   |
|       |       |       | Div   | MAD   |AddMAD |ribute | Rsq   |
+-------+-------+-------+-------+-------+-------+-------+-------+
                          tag 33  tag 31  tag 26  tag 17  tag 16

FastMath Divide Sub-Enum

The Divide field within FastMathOptions is a nested enum serialized by name in the XML path:

Value	Name	Meaning
0	`NVVM_FAST_MATH_DIVIDE_PRECISE_NO_FTZ`	IEEE-compliant division, no flush-to-zero
1	`NVVM_FAST_MATH_DIVIDE_PRECISE_ALLOW_FTZ`	IEEE division with FTZ permitted
2	`NVVM_FAST_MATH_DIVIDE_FULL_RANGE_APPROX`	Full-range approximation
3	`NVVM_FAST_MATH_DIVIDE_FAST_APPROX`	Fast approximation (least precise)

These correspond to the nvcc flags -prec-div=1 (precise) and -prec-div=0 (fast), with FTZ interaction determined by -ftz.

Complete FastMath XML Field Inventory

The full set of XML field names parsed by NvvmOptions_parse_fast_math (0xCCF590, 12,771 bytes):

XML Field Name	Binary Tag	Type	Description
`IgnoreInf`	8	bit	Treat infinities as NaN
`IgnoreNaN`	9	bit	Assume no NaN values present
`IgnoreSignedZero`	10	bit	Ignore sign of zero
`ReorderFloat`	11	bit	Allow float reordering
`ReorderHalf`	12	bit	Allow half-precision reordering
`Ftz`	13	bit	Flush denormals to zero
`FastSqrt`	14	bit	Use fast sqrt approximation
`Fmad`	15	bit	Allow fused multiply-add
`AllowRcpRsqToSqrt`	16	bit	Allow `rcp(rsqrt(x))` to `sqrt(x)`
`CanReorderFloatDistribute`	17	bit	Allow distributive reordering
`ReassociateFloatAddOverMad`	26	bit	Float add reassociation over MAD
`NoFloatMAD`	31	bit	Disable float MAD formation
`LaxFP16ApproximateDivision`	33	bit	Lax FP16 approximate division
`Divide`	--	enum	Division precision sub-enum (above)

The Divide field is serialized as a nested enum element in XML; in the binary format it is encoded as part of the fast-math reserved int32 at Options +204 (tag 18).

Memory Window Configuration

Memory windows define how the compiler maps address spaces to hardware memory banks. Three window types are serialized as blobs via tags 201--203, parsed by NvvmOptions_parse_cbank_config (0xCCE4B0) and NvvmOptions_parse_memory_windows (0xCCE100).

MemoryWindow Type Enum

Value	Name	Meaning
0	`NVVM_MEMORY_WINDOW_SPECIAL_REGISTER`	Accessed via special registers
1	`NVVM_MEMORY_WINDOW_CBANK`	Constant bank window
2	`NVVM_MEMORY_WINDOW_IMMEDIATE`	Immediate offset addressing

Window Entry Layout (8 bytes)

struct MemoryWindowEntry {
    uint32_t window_type;   /* MemoryWindow type enum      */
    uint32_t cbank;         /* Constant bank index         */
    /* The following are part of the containing blob: */
    /* uint32_t cbank_ofst_low;  -- lower bound of offset range */
    /* uint32_t cbank_ofst_hi;   -- upper bound of offset range */
};

Tag 201 (MemoryWindowCBank): 24 bytes = 3 entries of {window_type, cbank, low, hi} truncated to fit, or 3 x 8 bytes depending on sub-field packing.
Tag 202 (MemoryWindowLocal): 24 bytes, same structure.
Tag 203 (MemoryWindowShared): 40 bytes = 10 x uint32_t values encoding shared memory bank strides, offsets, and configuration flags.

Version Compatibility Logic

Version checking is the first operation performed on a container buffer, implemented in NvvmContainer_check_versions (0xCD41B0). The logic is conservative on major versions and lenient on minor versions:

1. Verify magic == 0x7F4E5C7D
   Fail: return NULL (not a container)

2. Version.Major must == 1
   Fail: "NvvmContainer major version N not compatible" → return NULL

3. Version.Minor compared to 0x41 (65)
   If container minor > tool minor:
     Warning: "Linked container's NvvmContainer minor version N newer than tool"
   Parse continues regardless.

4. NvvmIRVersion.Major must == 2
   Fail: "NvvmIR major version N not compatible" → return NULL

5. NvvmIRVersion.Minor compared to 0x62 (98)
   If container minor > tool minor: warning, parse continues.

6. NvvmDebugVersion.Major must == 3
   Fail: "NvvmDebug major version N not compatible" → return NULL

7. NvvmDebugVersion.Minor compared to 2
   If container minor > tool minor: warning, parse continues.

8. LlvmVersion (major*100 + minor) must be <= 2000
   Fail: "LLVM version N not compatible" → return NULL

A separate standalone validator (0xCCD5F0) adds a mode-dependent check: in binary dump mode (a5=0), the LLVM version must be exactly 20; in normal mode (a5=1), it must be <= 20.

The philosophy is clear: major version bumps signal breaking format changes and are hard failures. Minor version bumps add new tags but never change existing tag semantics -- the delta encoding and unknown-tag-skipping design ensures forward compatibility.

Current Version Constants (cicc v13.0)

Field	Major	Minor
Version (container format)	1	0x41 (65)
NvvmIRVersion	2	0x62 (98)
NvvmDebugVersion	3	2
LlvmVersion	20	0

XML Serialization Format

The XML path (NvvmContainer_parse_header at 0xCDCA30) uses NVIDIA's YAML-based serialization framework with virtual dispatch. The top-level XML document contains these elements:

<NvvmContainer>
  <Version major="1" minor="65"/>
  <NvvmIRVersion major="2" minor="98"/>
  <NvvmDebugVersion major="3" minor="2"/>
  <LlvmVersion major="20" minor="0"/>
  <IRLevel>NVVM_IR_LEVEL_UNIFIED_AFTER_DCI</IRLevel>
  <Options>
    <ArchVariant>NVVM_ARCH_ADA_8_9</ArchVariant>
    <CompileMode>NVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI</CompileMode>
    <OptLevel>NVVM_OPT_LEVEL_2</OptLevel>
    <DebugInfo>NVVM_DEBUG_INFO_NONE</DebugInfo>
    <FastMathOptions>
      <Ftz>1</Ftz>
      <Fmad>1</Fmad>
      <Divide>NVVM_FAST_MATH_DIVIDE_FAST_APPROX</Divide>
      ...
    </FastMathOptions>
    <MaxRRegsAllowed>255</MaxRRegsAllowed>
    ...
  </Options>
  <IsBinary>1</IsBinary>
  <Module>... base64-encoded LLVM bitcode ...</Module>
</NvvmContainer>

All enum values are serialized by their full string names (e.g., NVVM_COMPILE_MODE_SEPARATE_ABI), not by numeric value. The XML format does not use delta encoding -- every field is written regardless of whether it matches the default, making XML containers significantly larger but human-readable.

Serialization Flow

The serializer (0xCDD2D0) has two modes controlled by parameter a3: binary (a3=1) and XML (a3=0).

Binary Serialization (`a3=1`)

 1. Compute version fields (use defaults if not set):
      Version        = {1, 0x41}
      NvvmIRVersion  = {2, 0x62}
      NvvmDebugVersion = {3, 2}
      LlvmVersion    = {20, 0}

 2. Allocate 248-byte NvvmContainerHeader (zeroed)
 3. Allocate 440-byte default Options struct
 4. Allocate two growable arrays:
      scalar_tags[]   -- int32 entries for tag/value pairs
      blob_data[]     -- byte entries for blob payloads

 5. For each field in current Options vs. default Options:
      If field differs:
        Scalar → sub_CD17A0(scalar_tags, tag_id, value)
        Blob   → sub_CD1AB0(blob_data, scalar_tags, tag_id, ptr, size)

 6. Optional IR compression:
      If a4 flag set:
        Compress LLVM bitcode via sub_C8D290
        Compute CRC via sub_CCD2B0 → store as tag 99
        Compress via sub_1688730(codec, data, size)

 7. Append terminator: tag 0 to scalar_tags
 8. Write 24-byte header (with computed ScalarFieldsEnd, BlobDataEnd)
 9. Write scalar_tags array
10. Write blob_data array
11. Write compressed or raw IR payload

Deserialization (`0xCD1D80`)

 1. Verify magic == 0x7F4E5C7D
 2. Allocate 248-byte NvvmContainerHeader
 3. Allocate 440-byte Options struct with defaults
 4. Store Options pointer at container header offset 208
 5. Compute tag_ptr = buffer + header_size  (from offset 0x0E)
 6. Compute blob_base = buffer + scalar_fields_end  (from offset 0x10)
 7. Enter switch loop:
      Read tag (int16), decode value (int16 or sentinel + int32)
      Switch on tag (103 unique case labels):
        Tags 1-39:    → write scalar to Options field
        Tag 99:       → store compression algo ID
        Tags 101-173: → write to extended target options
        Tags 201-218: → resolve blob offset, copy blob data
        Tags 301-309: → write to extended int32 fields
        Tags 351-353: → copy 8-byte blob to extended fields
        Tags 401-402: → conditionally parse structured blob
      Tag 0 → exit loop
 8. If tag 99 present: decompress IR payload
 9. Return container pointer

Annotated Hex Dump

A minimal container targeting SM 89 (Ada Lovelace) with default options (only SmMajor and SmMinor differ from defaults):

Offset  Hex                                        Decoded
------  -----------------------------------------  ---------------------------------
0x0000  7D 5C 4E 7F                                Magic: 0x7F4E5C7D
0x0004  01 41                                      Version: 1.65
0x0006  02 62                                      NvvmIRVersion: 2.98
0x0008  03 02                                      NvvmDebugVersion: 3.2
0x000A  14 00                                      LlvmVersion: 20.0
0x000C  00 00                                      IRLevel: 0 (UNIFIED_AFTER_DCI)
0x000E  18 00                                      HeaderSize: 24
0x0010  2C 00 00 00                                ScalarFieldsEnd: 44
0x0014  2C 00 00 00                                BlobDataEnd: 44 (no blobs)

--- Scalar tag/value region ---
0x0018  01 00 08 00                                Tag 1 (SmMajor) = 8
0x001C  02 00 09 00                                Tag 2 (SmMinor) = 9
0x0020  0D 00 01 00                                Tag 13 (Ftz) = 1
0x0024  0F 00 01 00                                Tag 15 (Fmad) = 1
0x0028  00 00                                      Terminator (tag 0)
0x002A  00 00                                      Padding to alignment

--- Blob data region ---
(empty -- ScalarFieldsEnd == BlobDataEnd)

--- IR payload follows at offset 0x002C ---
0x002C  DE C0 17 0B ...                            LLVM bitcode (0xDEC0170B magic)

This example shows the efficiency of delta encoding: only 4 tag/value pairs (16 bytes of tags) plus the 24-byte header produce a fully-specified container. All other fields (CompileMode, OptLevel, DebugInfo, all target options) inherit their defaults during deserialization.

A container with a 32-bit value would look like:

0x00XX  13 00 FF FF  00 04 00 00                   Tag 19 (MaxRRegsAllowed) = 1024
                                                   (0xFFFF sentinel, then 0x0400 LE)

Pipeline Integration

The container serves as the inter-stage transport format within the cicc compilation pipeline. Two entry paths exist:

Path	Entry Function	Address	Pipeline
Path A (LibNVVM)	`nvvmCompileProgram` dispatcher	`0x9047E0`	3-phase: LNK -> OPT -> LLC
Path B (standalone)	`cicc_main` orchestrator	`0x12642A0`	4-stage: LNK -> OPT -> OPTIXIR -> LLC

Both paths deserialize the container at phase 1, then translate Options into per-stage compiler flags:

SmMajor / SmMinor from tags 1--2 become -mcpu=sm_XX
FastMath.Ftz from tag 13 becomes -nvptx-f32ftz
FastMath.Fmad from tag 15 becomes the IEEE mode flag
OptLevel becomes -nvptx-opt-level=N
CompileMode == 2 (SEPARATE_ABI) adds --device-c
IRLevel == 1 (LTO) enters the LTO pipeline with partially-optimized bitcode
IRLevel == 2 (OPTIX) activates the OptiX IR stage (bit 6 of pipeline bitmask) and disables LICM and IP-MSP

The container format is the single source of truth for all compilation parameters. When cicc is invoked by nvcc, the driver serializes its accumulated flags into a container, passes the container as input, and cicc deserializes it back into compiler options. This round-trip through binary serialization ensures that all pipeline stages see exactly the same configuration, eliminating the flag-parsing divergence that would otherwise arise from each stage having its own CLI parser.

YAML Serialization Framework

The XML/YAML path uses a generic serialization framework built on a bundled YAML parser/emitter library (Cluster A: 0xCB0000--0xCBFA60). The library provides:

Function	Address	Role
`yaml_parser_main`	`0xCB9640`	Top-level YAML parser (25,873 bytes)
`yaml_emitter_main_loop`	`0xCBDA10`	Main YAML emitter loop (23,583 bytes)
`yaml_scanner_scan_tokens`	`0xCB7E40`	Token scanner (17,924 bytes)
`yaml_parser_parse_flow`	`0xCB8C00`	Flow-style parsing (15,188 bytes)
`yaml_parser_load_document`	`0xCBA570`	Document loader/resolver (9,695 bytes)

The serialization framework uses virtual dispatch: each serializable type registers a serialize/deserialize function pair, and the framework dispatches based on the YAML node type (scalar=1, sequence, mapping). All enum values are serialized by their full string names (NVVM_COMPILE_MODE_SEPARATE_ABI, NVVM_ARCH_ADA_8_9, etc.), not by numeric value.

Finalizer Knobs Integration

The container Options struct also feeds into the NVIDIA finalizer knobs system through NvvmOptions_parse_finalizer_knobs (0xCD9990, 31,702 bytes -- the 7th largest function in the binary). This parser ingests the complete set of NVIDIA-specific backend configuration knobs:

Shader pipeline controls: PromoteHalf, PromoteFixed, USePIXBAR, VSIsVREnabled, VSIsLastVTGStage
Codegen controls: DisablePredication, DisableXBlockSched, EnableJumpTable, ScheduleKils
Memory controls: DoMMACoalescing, AssumeConvertMemoryToRegProfitable
Barrier controls: DisableERRBARAfterMEMBAR, GenConvBranchForWarpSync
PGO controls: PGOEpoch, PGOBatchSize, PGOCounterMemBaseVAIndex
Per-CTA controls: CTASizeX, CTASizeY, CTASizeZ, SharedMemorySize, SMemScratchBase
Register controls: MaxActiveWarpsPerSM, NumReservedUReg, NumScratchURegs

These knobs are distinct from the NVVMPassOptions system (see NVVMPassOptions) -- the finalizer knobs configure the backend code generator, while NVVMPassOptions configure the optimization pipeline.

Tag Summary Statistics

Range	Count	Description
1--39	38	Core scalar options (SM version, fast-math, unroll, flags)
99	1	Compression metadata
101--173	73	Extended target options (hardware capabilities, memory config)
201--218	18	Blob data (memory windows, resource tables, strings)
301--309	9	Extended int32 fields (cluster config, extended options)
351--353	3	Extended int64 blob references
401--402	2	Structured conditional blobs (TMA / TCGen05)
Total	144	Distinct tag IDs across 6 ranges

The deserializer switch statement has 103 unique case labels -- the remaining 41 tags share code paths with other tags (e.g., all single-bit tags in a byte share a case that reads the bit position from a secondary table).

Cross-References

NVVMPassOptions -- 222-slot optimization pipeline configuration
Pipeline Entry -- LibNVVM API and CLI entry points
OptiX IR -- IRLevel=2 OptiX pipeline
LTO Pipeline -- IRLevel=1 link-time optimization
SM 90 Hopper -- TMA descriptor usage (tag 401)
SM 100 Blackwell -- TCGen05 config usage (tag 402)
Bitcode I/O -- LLVM bitcode reader/writer wrapping the IR payload
nvcc Interface -- Driver-to-cicc container passing

NVPTX Target Infrastructure

The NVPTXTargetMachine, NVPTXSubtarget, and NVPTXTargetTransformInfo form the target description layer that the entire LLVM backend consults for every decision from type legality through instruction cost to vectorization factor selection. In upstream LLVM, these are three separate source files totaling roughly 1,500 lines; in cicc v13.0 they are spread across the 0xDF0000-0xE00000 address range (TTI hooks), the 0x330-0x35B range (NVPTXTargetLowering), the type legalization tables embedded in NVPTXSubtarget, and the pipeline assembler at 0x12EA000-0x12F0000 (TargetMachine construction). The NVIDIA delta relative to upstream is moderate -- the TTI hooks return GPU-specific constants rather than CPU ones, the SubtargetFeatures carry NVIDIA-proprietary math precision flags, and the TargetMachine creation path has a dual-path design that handles both the cicc standalone pipeline and the LibNVVM API pipeline.

Key Facts

Property	Value
SM processor table	`qword_502A920` (45 entries, stride-2, `ctor_605` at `0x584510`)
Target lookup	`sub_12EA530` (4KB, calls `sub_16D3AC0` = `TargetRegistry::lookupTarget`)
TargetMachine creation	`sub_12F4060` (16KB, NVIDIA options) / `sub_12E54A0` (50KB, pipeline path)
TTI wrapper pass	`sub_1BFB520` (208-byte alloc, wraps `sub_1BFB9A0`)
Register bit width (Vector)	`sub_DFE640` -- returns 32 (fixed)
Scalable vectors	`sub_DFE610` -- returns `false`
Max interleave factor	`sub_DFB120` (at TTI+448), `sub_DFB730` (vectorized variant)
SubtargetFeatures	Offsets +2498, +2584, +2843, +2870, +2871
Target triples	`nvptx64-nvidia-cuda`, `nvptx-nvidia-cuda`, `nvsass-nvidia-*` (6 total)

NVPTXTargetMachine

Dual-Path Target Initialization

cicc constructs the TargetMachine through two independent code paths depending on whether compilation enters through the standalone cicc CLI or through the LibNVVM API. Both converge on TargetRegistry::lookupTarget (sub_16D3AC0) but assemble the target triple, feature string, and TargetOptions differently.

Path 1 -- cicc standalone (sub_12F7D90 -> sub_12F4060):

sub_12F7D90 — CLI parser:
    parse "-arch=compute_XX" → SM version (multiplied by 10)
    parse "-opt=N"           → optimization level
    parse "-ftz=N"           → flush-to-zero mode
    parse "-fma=N"           → FMA contraction level
    parse "-prec-div=N"      → float division precision
    parse "-prec-sqrt=N"     → sqrt precision
    parse "--device-c"       → device compilation flag

sub_12F4060 — TargetMachine creation (16KB):
    triple = (pointerWidth == 64) ? "nvptx64" : "nvptx"
    features = ""
    if (sharedmem32bit):
        features += "+sharedmem32bitptr"
    features += ",+fma-level=N,+prec-divf32=N,+prec-sqrtf32=N"

    opts = TargetOptions {
        flags: 0,
        reloc: PIC (1),
        codeModel: 8,
        optLevel: from_cli,
        threadModel: 1
    }

    TM = TargetRegistry::lookupTarget(triple, cpu_string)
    if (!TM):
        error "Error: Cannot specify multiple -llcO#\n"
    return TM->createTargetMachine(triple, cpu, features, opts)

Path 2 -- pipeline assembler (sub_12E54A0):

The master pipeline assembly function (50KB, called from both Phase I and Phase II) constructs the target independently:

sub_12E54A0:
    ptrSize = Module::getDataLayout().getPointerSizeInBits(0)
    if (8 * ptrSize == 64):
        triple = "nvptx64"                          // 7 chars
    else:
        triple = "nvptx"                            // 5 chars

    target = sub_16D3AC0(&triple, &cpu_string)      // TargetRegistry::lookupTarget
    if (!target):
        error "Failed to locate nvptx target\n"     // sub_1C3EFD0

    // TargetOptions setup:
    opts[0] = 0                                     // no flags
    opts[1] = 1                                     // PIC relocation
    opts[2] = 8                                     // code model
    opts[3] = 1                                     // opt level indicator
    opts[4] = 1                                     // thread model
    opts[5] = 0                                     // reserved

    sub_167F890(subtargetInfo)                       // initialize SubtargetInfo
    TLI = sub_14A04B0(targetLibInfo, moduleName)     // TargetLibraryInfo
    sub_149CBC0(TLI)                                 // finalize TLI
    TTI = sub_1BFB9A0(DataLayout, a2, a3, v269)     // TargetTransformInfo

    optLevel = read qword_4FBB430                    // cl::opt<int> value
    PassManagerBuilder = sub_1611EE0(PM)

The pipeline assembler path also checks for an extension hook: if the target has a createExtendedTargetMachine vtable entry at offset +88, it calls that instead, enabling custom target backends. The returned TargetMachine pointer feeds into the 150+ pass registrations that follow.

TargetOptions

The TargetOptions struct passed to both paths uses LLVM's standard layout. The key NVIDIA-specific values:

Field	Value	Meaning
Relocation model	1 (PIC)	Position-independent code, always
Code model	8	Large code model (matches PTX's flat addressing)
Thread model	1	POSIX-style threading assumed
Optimization level	From CLI	Stored in `qword_4FBB430`, default from `qword_4FBB430[2]`

NVIDIA-Specific Target Features

The feature string passed to createTargetMachine encodes math precision and shared memory configuration as subtarget features. These are not upstream LLVM features -- they are NVIDIA extensions:

Feature	CLI Source	Subtarget Effect
`+sharedmem32bitptr`	`nvptx-short-ptr` / `nvptx-32-bit-smem`	Enables 32-bit pointers for address space 3 (shared memory); adds `p3:32:32:32` to data layout
`+fma-level=N`	`-fma=N`	0=off, 1=on, 2=aggressive FMA contraction
`+prec-divf32=N`	`-prec-div=N`	0=approx, 1=full, 2=IEEE+ftz, 3=IEEE compliant
`+prec-sqrtf32=N`	`-prec-sqrt=N`	0=approx (`rsqrt.approx`), 1=rn (`sqrt.rn`)

Registered in ctor_607 (0x584B60, 14KB):

Knob	Type	Default	Description
`nvptx-sched4reg`	bool	--	Schedule for register pressure
`nvptx-fma-level`	int	--	FMA contraction level
`nvptx-prec-divf32`	int	--	F32 division precision
`nvptx-prec-sqrtf32`	int	--	Sqrt precision
`nvptx-approx-log2f32`	bool	--	Use `lg2.approx` for log2
`nvptx-force-min-byval-param-align`	bool	--	Force 4-byte byval alignment
`nvptx-normalize-select`	bool	--	Override `shouldNormalizeToSelectSequence`
`enable-bfi64`	bool	--	Enable 64-bit BFI instructions

NVPTXSubtarget Feature Flags

The NVPTXSubtarget object carries the type legalization tables and architecture-specific feature flags that the SelectionDAG, register allocator, and type legalizer consult at every step. These are populated during target construction and indexed by the SM processor table.

Feature Flag Offsets

Offset	Size	Purpose	Stride
+120	ptr	Register class array (8-byte stride entries)	--
+2498	259	Type legality flags (indexed per MVT)	259 bytes per type action
+2584	259	Float legality flags (indexed per MVT)	259 bytes per type action
+2843	1	Integer type support flag	--
+2870	1	Branch distance flag	--
+2871	1	Jump table eligibility flag	--

The type legality arrays at +2498 and +2584 are the backbone of SelectionDAG's getTypeAction() and isTypeLegal() queries. Each entry covers one MVT (Machine Value Type) and stores the action: Legal, Promote, Expand, Scalarize, or SplitVector. For NVPTX, i32 and f32 are always Legal; i64 and f64 are Legal on all supported SM versions but with expanded arithmetic costs; vectors wider than 128 bits are always Split or Scalarized.

The function sub_201BB90 reads these offsets during type legalization to determine expansion strategy. The branch distance flags at +2870/+2871 control sub_20650A0, which decides jump table eligibility beyond the standard no-jump-tables flag.

Initialization Flow

The SubtargetFeatures initialization follows this path:

ctor_605 (0x584510, 2.6KB) populates qword_502A920 with the 45-entry SM processor table at static init time.
sub_167F890 initializes the SubtargetInfo during pipeline setup.
sub_982C80 initializes the 224-byte NVPTX feature flag table based on SM version and OS/ABI info.
sub_97DEE0 performs initial population of the feature bitfield.
sub_982B20 applies SM-version-specific refinements from the global table at qword_4F7FCC8.

The 224-byte feature table (sub_982C80) initializes bytes 0-127 to all-1s (0xFF), then selectively clears bits based on the target configuration. This "default-enabled, selectively-disabled" pattern means that features are assumed present unless explicitly turned off for a given target.

NVPTXTargetTransformInfo Hook Table

The TTI is the interface through which all LLVM optimization passes query target-specific costs and capabilities. For NVPTX, every hook returns a value calibrated for a scalar-register GPU architecture rather than a SIMD-register CPU.

TTI Hook	Address	Return Value	Upstream Equivalent
`getRegisterBitWidth(Vector)`	`sub_DFE640`	`TypeSize::getFixed(32)`	AVX2 returns 256, AVX-512 returns 512
`supportsScalableVectors()`	`sub_DFE610`	`false`	AArch64 SVE returns `true`
`getMaxInterleaveFactor()`	`sub_DFB120`	Register-pressure-bounded	CPU returns 2-4 based on uarch
`getMaxInterleaveFactor(vectorized)`	`sub_DFB730`	Separate limit for vectorized loops	--
`getRegisterBitWidth(Scalar)`	`sub_DFB1B0`	32	Matches PTX 32-bit register file
`getInstructionCost()`	`sub_20E14F0` (32KB)	Per-opcode latency from sched model	--
`hasAttribute(30)`	`sub_B2D610`	Checks `noimplicitfloat`	Standard LLVM
`hasAttribute(47)`	`sub_B2D610`	Checks `alwaysvectorize`	Standard LLVM
`hasAttribute(18)`	`sub_B2D610`	Checks `optnone`	Standard LLVM

Impact on Loop Vectorization

The 32-bit register width return from sub_DFE640 is the single most consequential TTI hook for GPU compilation. The standard LLVM VF formula is:

VF = registerBitWidth / elementBitWidth

With registerBitWidth = 32:

float (32-bit): VF = 1 -- no vectorization from the register-width formula alone
half (16-bit): VF = 2
i8 (8-bit): VF = 4

This means that profitable vectorization of 32-bit types (the dominant case in CUDA) must come entirely from the cost model determining that ld.v2.f32 or ld.v4.f32 is cheaper than multiple scalar loads, not from the register-width heuristic. The LoopVectorize pass (sub_2AF1970) has an explicit override: when the VF formula produces VF <= 1 and the byte_500D208 knob is set, it forces VF = 4 for outer loops.

Impact on SLP Vectorization

The SLP vectorizer (sub_2BD1C50) receives the target vector register width as parameter a3 and uses it to determine maximum bundle width. With 32 bits, SLP bundles are limited to:

2x i16 (32 bits total)
4x i8 (32 bits total)
1x i32 or f32 (degenerate -- no SLP benefit)

In practice, the SLP vectorizer's profitability model can override this limit when paired loads/stores demonstrate memory coalescing benefit, but the register width serves as the initial upper bound.

Impact on Interleave Count

The getMaxInterleaveFactor hook (sub_DFB120, queried at TTI+448) caps the interleave count (IC) for loop unroll-and-jam. The interleave selection algorithm in sub_2AED330 reads this value and combines it with scheduling info at TTI+56:

maxIC    = TTI.getMaxInterleaveFactor(VF)
issueWidth = *(TTI + 56 + 32)              // scheduling model: issue width
latency    = *(TTI + 56 + 36)              // scheduling model: latency
IC         = IC / max(issueWidth, latency)  // cap by pipeline throughput

This models the SM's instruction issue pipeline: even if register pressure allows IC=8, the warp scheduler may saturate at lower IC values, making additional interleaving waste register budget without throughput gain.

Arithmetic Cost for i64

NVPTX GPUs have 32-bit ALUs. All 64-bit integer arithmetic is emulated through pairs of 32-bit operations with carry propagation. The TTI getArithmeticInstrCost hook reflects this by returning approximately 2x the base cost for i64 operations:

Operation	i32 Cost	i64 Cost	Ratio
ADD/SUB	1	2	2x (add.cc + addc)
MUL	1	~4	4x (mul.lo + mul.hi + add chain)
DIV/REM	high	very high	Library call on both
Shift	1	2-3	funnel shift pair

This cost differential causes LLVM optimization passes (InstCombine, SCEV-based transformations, IV widening) to prefer i32 operations, which NVIDIA's custom IV Demotion pass (sub_18B1DE0) further exploits by narrowing 64-bit induction variables to 32-bit where the trip count permits.

SM Processor Table

The processor table at qword_502A920 is a flat array of 90 entries (45 SM variants x 2 fields per entry) with stride-2 layout: even indices hold the SM name string pointer, odd indices hold the PTX version code.

Populated by ctor_605 at 0x584510 (2.6KB), called during static initialization before main. The table is read-only after construction.

qword_502A920[2*i + 0] = const char* sm_name    // e.g., "sm_100"
qword_502A920[2*i + 1] = uint64_t   ptx_version // 5, 6, or 7

PTX Version Codes

Code	Meaning	SM Range
5	Legacy PTX	sm_20 through sm_90 (all base variants)
6	Modern PTX	sm_90a, sm_100-sm_121 (base variants only)
7	Extended PTX	sm_100a/f through sm_121a/f (accelerated/forward-compatible)

Notable observations:

sm_90a is the only pre-Blackwell SM with PTX version 6.
The f (forward-compatible) suffix uses the same PTX version as a (accelerated).
No entries exist for sm_84, sm_85 (Ada Lovelace numbering gap).
sm_73 (Volta sub-variant) and sm_88 (Ada sub-variant) are present but not publicly documented.
The table contains 15 legacy architectures (sm_20 through sm_75) that are no longer accessible through the CLI mapping but remain in the backend's processor table.

Data Layout String

The NVPTX data layout string follows LLVM's standard format with three variants selected based on pointer width and shared memory pointer mode:

64-bit with shared memory specialization (most common)

e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

64-bit without shared memory specialization

e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

32-bit mode

e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

Key fields

Field	Meaning	NVIDIA Note
`e`	Little-endian	All NVIDIA GPUs
`p:64:64:64`	Generic pointers: 64-bit, 64-bit aligned	Default for 64-bit compilation
`p3:32:32:32`	Address space 3 (shared memory): 32-bit pointers	Controlled by `nvptx-short-ptr` / `nvptx-32-bit-smem` / `unk_4D0461C`
`n16:32:64`	Native integer widths: 16, 32, 64	Tells LLVM that i16/i32/i64 are all hardware-supported
`v16:16:16` / `v32:32:32`	Vector alignment: natural	16-bit and 32-bit vectors aligned to their width

The p3:32:32:32 entry is the NVIDIA delta: shared memory lives in a 48KB-228KB on-chip SRAM per SM, addressable with 32-bit pointers even in 64-bit mode. Using 32-bit pointers for shared memory saves register pressure and instruction count for every shared memory access.

A separate data layout string e-i64:64-v16:16-v32:32-n16:32:64 appears in the IR linker (sub_106AB30) as a compatibility check during module linking. This shortened form is used to validate that two modules being linked share the same NVPTX target data layout.

Data layout validation is performed at multiple points:

sub_2C74F70 in the NVVM verifier checks the layout string on every module
If empty: "Empty target data layout, must exist"
If invalid: prints "Example valid data layout:" with reference 32-bit and 64-bit strings from off_4C5D0A0 / off_4C5D0A8

Target Triple Construction

The target triple is constructed at module creation time by checking the pointer width:

if (unk_4F06A68 == 8)                    // 64-bit data model
    triple = "nvptx64-nvidia-cuda"       // 19 chars
else
    triple = "nvptx-nvidia-cuda"         // 17 chars

Eight triples are valid in UnifiedNVVMIR mode:

Triple	Width	Runtime
`nvptx-nvidia-cuda`	32-bit	CUDA
`nvptx64-nvidia-cuda`	64-bit	CUDA
`nvptx-nvidia-nvcl`	32-bit	OpenCL
`nvptx64-nvidia-nvcl`	64-bit	OpenCL
`nvsass-nvidia-cuda`	SASS	CUDA native assembly
`nvsass-nvidia-nvcl`	SASS	OpenCL native assembly
`nvsass-nvidia-directx`	SASS	DirectX backend
`nvsass-nvidia-spirv`	SASS	SPIR-V backend

In non-UnifiedNVVMIR mode, validation is looser: the triple must start with nvptx- or nvptx64- and contain -cuda. The nvsass-nvidia-directx and nvsass-nvidia-spirv triples (discovered in sub_2C80C90) are notable evidence that NVIDIA's SASS-level backend supports DirectX and SPIR-V shader compilation alongside traditional CUDA/OpenCL.

Configuration Knobs

Backend Options (ctor_609_0, 0x585D30, 37KB)

Knob	Type	Default	Description
`nvptx-short-ptr`	bool	--	32-bit pointers for const/local/shared
`nvptx-32-bit-smem`	bool	--	32-bit shared memory pointers
`nvptx-enable-machine-sink`	bool	--	Enable Machine Sinking
`enable-new-nvvm-remat`	bool	true	Enable new rematerialization
`nv-disable-remat`	bool	false	Disable all remat passes
`nv-disable-mem2reg`	bool	false	Disable MI Mem2Reg pass
`nv-disable-scev-cgp`	bool	false	Disable SCEV address mode opt
`disable-nvptx-load-store-vectorizer`	bool	false	Disable load/store vectorizer
`disable-nvptx-require-structured-cfg`	bool	false	Turn off structured CFG requirement
`nvptx-exit-on-unreachable`	bool	true	Lower unreachable as exit
`nvptx-early-byval-copy`	bool	--	Copy byval args early
`enable-nvvm-peephole`	bool	true	Enable NVVM Peephole Optimizer
`lower-func-args`	bool	true	Lower large aggregate params
`enable-sink`	bool	true	Enable Sinking
`disable-post-opt`	bool	false	Disable LLVM IR opts post-opt
`usedessa`	int	2	Select deSSA method
`ldg`	bool	true	Load Global Constant Transform
`print-isel-input`	bool	false	Print LLVM IR input to isel
`no-reg-target-nvptxremat`	bool	false	Only old remat without reg targets
`disable-set-array-alignment`	bool	false	Disable alignment enhancements
`nvptx-lower-global-ctor-dtor`	bool	--	Lower GPU ctor/dtors to globals

Register Pressure & FCA Options (ctor_074, 0x49AAB0)

Knob	Type	Default	Description
`fca-size`	int	8	Max size of first-class aggregates (bytes)
`reg-target-adjust`	int	0 (range -10..+10)	Register pressure target adjustment
`pred-target-adjust`	int	0 (range -10..+10)	Predicate register target adjustment
`remat-load-param`	bool	--	Support remating const `ld.param` not in NVVM IR
`cta-reconfig-aware-rpa`	bool	--	CTA reconfiguration-aware register pressure analysis

Extension Options (ctor_610, 0x5888A0)

Knob	Type	Default	Description
`unroll-assumed-size`	int	4	Assumed size for unknown local array types
`enable-loop-peeling`	bool	--	Enable loop peeling
`enable-256-bit-load-store`	bool	--	Enable 256-bit vector loads/stores
`ias-param-always-point-to-global`	bool	--	Parameters always point to global memory
`ias-strong-global-assumptions`	bool	--	Strong global memory assumptions
`ias-wmma-memory-space-opt`	bool	--	Memory Space Optimization for WMMA

TTI Cost Model Options (ctor_061, 0x494D20)

Knob	Type	Default	Description
`costmodel-reduxcost`	bool	--	Recognize reduction patterns
`cache-line-size`	int	--	Cache line size for cost model
`min-page-size`	int	--	Minimum page size
`predictable-branch-threshold`	float	--	Threshold for predictable branch cost

Differences from Upstream LLVM

Dual-path TargetMachine construction. Upstream LLVM has a single target creation path through LLVMTargetMachine::createPassConfig. NVIDIA has two independent paths (CLI and pipeline assembler) that converge at TargetRegistry::lookupTarget.
NVIDIA-proprietary target features. The +sharedmem32bitptr, +fma-level=N, +prec-divf32=N, +prec-sqrtf32=N features do not exist in upstream NVPTX. Upstream NVPTX has +ptx75, +sm_90 style features. NVIDIA's math precision features are passed through the target feature string to avoid adding new cl::opt for each.
224-byte feature table. The sub_982C80 feature table with its "default all-1s then selectively clear" initialization pattern is unique to cicc. Upstream NVPTXSubtarget uses a much simpler feature set derived from +sm_XX and +ptx_YY features.
Scheduling info at TTI+56. The issue-width and latency values stored in the TTI sub-structure at offset +56 are used by the interleave count selection algorithm. Upstream LLVM's NVPTX backend does not populate these scheduling parameters -- it relies on the default "no scheduling model" behavior.
Extension hook at vtable+88. The pipeline assembler checks for a createExtendedTargetMachine entry, enabling loadable target backend extensions. This is not present in upstream LLVM.

Function Map

Function	Address	Size	Role
NVPTX Target Lookup and Creation	`sub_12EA530`	4 KB	--
TargetMachine Creation with NVIDIA Options	`sub_12F4060`	16 KB	--
Master Pipeline Assembly (includes TM setup)	`sub_12E54A0`	50 KB	--
CICC CLI Argument Parser	`sub_12F7D90`	14 KB	--
`TargetRegistry::lookupTarget()`	`sub_16D3AC0`	--	--
SubtargetInfo initialization	`sub_167F890`	--	--
TTIWrapperPass allocation (208 bytes)	`sub_1BFB520`	--	--
TargetTransformInfo / DataLayout creation	`sub_1BFB9A0`	--	--
TargetLibraryInfo creation	`sub_14A04B0`	--	--
TargetLibraryInfo finalization	`sub_149CBC0`	--	--
`TTI::getRegisterBitWidth(Vector)` -- returns 32	`sub_DFE640`	--	--
`TTI::supportsScalableVectors()` -- returns false	`sub_DFE610`	--	--
`TTI::getMaxInterleaveFactor()` (at TTI+448)	`sub_DFB120`	--	--
`TTI::getMaxInterleaveFactor(vectorized)`	`sub_DFB730`	--	--
`TTI::getRegisterBitWidth(Scalar)` or cache-line query	`sub_DFB1B0`	--	--
`TTI::getInstructionCost()` / scheduling cost model	`sub_20E14F0`	33 KB	--
`TTI::hasAttribute(N)` -- function attribute query	`sub_B2D610`	--	--
`TTI::getInstructionCost()` (IR-level variant)	`sub_B91420`	--	--
NVPTX feature flag table initializer (224 bytes)	`sub_982C80`	--	--
Feature bitfield initial population	`sub_97DEE0`	--	--
SM-version-specific feature refinements	`sub_982B20`	--	--
SubtargetFeature reads at +2843, +2584, +2498	`sub_201BB90`	--	--
Branch distance / jump table checks at +2870, +2871	`sub_20650A0`	--	--
EDG SM architecture feature gating (38KB, ~60 flags)	`sub_60E7C0`	--	--
Module initialization with triple and data layout	`sub_908850`	--	--
SM processor table population (`0x584510`, 2.6KB)	`ctor_605`	--	--
NVPTX backend math options (`0x584B60`, 14KB)	`ctor_607`	--	--
NVPTX backend options (`0x585D30`, 37KB)	`ctor_609_0`	--	--

Cross-References

GPU Target Architecture -- Full SM table, architecture gating thresholds, NVVM container arch enum
LoopVectorize & VPlan -- TTI hook usage in VF selection and interleave count
SLP Vectorizer -- TTI register width as SLP bundle width limit
SelectionDAG -- NVPTXTargetLowering, type legality from SubtargetFeatures
Memory Space Optimization -- Address space numbering convention
IV Demotion -- Exploits i64 cost differential reported by TTI
Register Allocation -- Register pressure budgets bounded by TTI
Instruction Scheduling -- Scheduling model data at TTI+56
CLI Flags -- -arch, -ftz, -fma, -prec-div, -prec-sqrt routing
Optimization Levels -- qword_4FBB430 optimization level storage
Pipeline & Ordering -- Where TTI is registered in the pass pipeline

Alias Analysis & NVVM AA

cicc ships a custom alias analysis pass (NVVM AA, registered as nvptx-aa) that exploits GPU address space disjointness to prove pointer pairs cannot alias. On a GPU, each hardware memory partition -- global DRAM, shared scratchpad, local stack, constant cache, kernel parameter window -- occupies a physically separate address range. Pointers into different address spaces can never reference the same byte, a property that does not hold on any mainstream CPU ISA. NVVM AA encodes this hardware invariant into the LLVM AA pipeline, returning NoAlias for any cross-address-space pointer pair. This single fact unlocks aggressive dead-store elimination, load-store motion, GVN load forwarding, and MemorySSA precision that would be impossible on a flat-memory machine. The pass is stateless, trivially cheap, and runs first in the AA chain so that more expensive analyses (BasicAA, TBAA) can skip pairs that NVVM AA already resolved.

Beyond pure address-space disjointness, cicc augments the standard LLVM AA infrastructure in three further ways: (1) a process-restrict pass that propagates noalias attributes from __restrict__ kernel parameters, (2) !noalias.addrspace metadata (metadata kind 42) that tags pointers with the set of address spaces they provably do not alias with, and (3) NVIDIA-specific knobs controlling traversal depth, TBAA strictness, and fence relaxation.

Key Facts

Property	Value
Pass name (legacy PM)	`nvptx-aa`
Pass name (new PM)	Registered via `NVPTXTargetMachine::registerEarlyDefaultAliasAnalyses`
Legacy wrapper	`NVPTXAAWrapperPass` (ImmutablePass, char ID)
External wrapper	`NVPTXExternalAAWrapper` (hooks into `ExternalAAWrapperPass`, `RunEarly=true`)
Result class	`NVPTXAAResult : AAResultBase`
State	Stateless -- `invalidate()` always returns false
AA chain position	First (before BasicAA)
Address traversal depth	Controlled by `nvptx-traverse-address-aliasing-limit` (default 6)
AA evaluator pass	`aa-eval` at `sub_13549C0` (11,038 bytes)
AA query entry point	`sub_134CB50` -- `AAResults::alias(MemoryLocation, MemoryLocation)`
ModRef query (call, loc)	`sub_134F0E0` -- `AAResults::getModRefInfo(CallBase, MemoryLocation)`
ModRef query (call, call)	`sub_134F530` -- `AAResults::getModRefInfo(CallBase, CallBase)`

GPU Address Space Table

NVPTX defines six logically disjoint address spaces plus a generic (flat) umbrella. See Address Spaces for the complete master table with hardware mapping, pointer widths, latency numbers, and data layout strings.

The critical property exploited by NVVM AA: any (AS_x, AS_y) pair where x != y and neither is 0 (generic) and neither is the shared/shared-cluster pair (AS 3 vs AS 7) returns NoAlias, unless x is global and y is param (or vice versa) since cvta.param on SM 70+ makes param addressable as global. See the Aliasing Rules section for the complete cross-space aliasing specification and the MemorySpaceOpt Internal Bitmask section for the dataflow bitmask encoding used during address space resolution.

The NVVM AA Algorithm

The core alias function follows upstream NVPTXAliasAnalysis.cpp in structure, enhanced with cicc-specific extensions. The pseudocode:

// NVPTXAAResult::alias -- the heart of NVVM AA
AliasResult alias(const MemoryLocation &Loc1,
                  const MemoryLocation &Loc2,
                  AAQueryInfo &AAQI) {
    unsigned AS1 = getAddressSpace(Loc1.Ptr, TraverseLimit);
    unsigned AS2 = getAddressSpace(Loc2.Ptr, TraverseLimit);

    // If either pointer is in generic (flat) space, we cannot disambiguate.
    // Generic pointers can point to any physical memory at runtime.
    if (AS1 == ADDRESS_SPACE_GENERIC || AS2 == ADDRESS_SPACE_GENERIC)
        return AliasResult::MayAlias;

    // Distributed shared memory (AS 7) overlaps with regular shared (AS 3).
    if ((AS1 == 3 && AS2 == 7) || (AS1 == 7 && AS2 == 3))
        return AliasResult::MayAlias;

    // Same address space: cannot determine from space alone.
    // Fall through to BasicAA / TBAA for further analysis.
    if (AS1 == AS2)
        return AliasResult::MayAlias;

    // Different non-generic, non-overlapping spaces: provably disjoint.
    return AliasResult::NoAlias;
}

// getAddressSpace -- walk through casts to find the underlying space.
// Traverses up to MaxLookup levels of getUnderlyingObject().
unsigned getAddressSpace(const Value *V, unsigned MaxLookup) {
    while (MaxLookup-- > 0) {
        unsigned AS = V->getType()->getPointerAddressSpace();
        if (AS != ADDRESS_SPACE_GENERIC)
            return AS;
        const Value *Next = getUnderlyingObject(V, /*MaxLookup=*/1);
        if (Next == V)
            break;  // Reached a root (alloca, argument, global)
        V = Next;
    }
    return V->getType()->getPointerAddressSpace();
}

The getAddressSpace helper is the key difference from a naive check. A pointer may be in generic address space (AS 0) at its use site but was produced by an addrspacecast from a specific space. The traversal walks backward through getUnderlyingObject (which strips GEPs, bitcasts, PHIs) to find the original non-generic space. The depth limit (nvptx-traverse-address-aliasing-limit, default 6) prevents exponential blowup on deeply nested pointer chains.

The getModRefInfoMask method adds a further optimization: pointers into constant memory (AS 4) or parameter memory (AS 101) are read-only, so it returns NoModRef -- the pointer's memory is never modified. This allows DSE to skip analysis of stores that might alias with const/param loads, and lets LICM hoist loads from constant memory without checking for intervening stores.

The getMemoryEffects method handles inline assembly: PTX inline asm without side-effects or {memory} clobbers is treated as having no memory effects, which prevents it from blocking optimizations.

The Generic Address Space Problem

The generic (flat, AS 0) address space is the fundamental obstacle to alias precision on GPUs. When the frontend cannot determine which physical memory a pointer targets, it emits the pointer in AS 0. The hardware resolves generic addresses at runtime using address range checks -- a pointer into the shared memory window maps to shared, otherwise it maps to global.

For NVVM AA, a generic pointer forces MayAlias against every other pointer, destroying the disjointness guarantee. This is why MemorySpaceOpt is so critical: it runs before the main optimization pipeline and converts generic pointers to specific address spaces wherever possible, feeding precise AS information into NVVM AA.

Three mechanisms address the generic pointer problem:

1. MemorySpaceOpt (pre-optimization conversion). The two-phase interprocedural pass at sub_1C70910 resolves generic pointers by tracing them back to their allocation sites. If a generic pointer is always derived from a __shared__ variable, the pass inserts an addrspacecast to AS 3 and rewrites all uses. When different call sites pass different address spaces for the same argument, the pass clones the function into space-specialized versions. This is the most impactful optimization: every generic pointer that MemorySpaceOpt resolves gives NVVM AA an additional NoAlias edge.

2. Address space traversal in AA. Even without MemorySpaceOpt, the getAddressSpace helper in NVVM AA walks through addrspacecast chains. If a generic pointer %p was produced by addrspacecast i8 addrspace(3)* %s to i8*, the traversal discovers AS 3. The traversal depth limit (default 6) controls how far back the walk goes.

3. !noalias.addrspace metadata (kind 42). cicc attaches this metadata to instructions when address space information is known but the pointer itself remains generic. The AA evaluator (sub_13549C0) detects this metadata via opcode byte 0x4E ('N') and sets bit 2 in a pointer-tagged value (OR with 4), propagating the address-space disambiguation information through to AAResults::alias. This is a cicc-specific extension not found in upstream LLVM.

AA Pipeline Ordering

cicc configures the AA chain with NVVM AA running first, as confirmed by the NVPTXExternalAAWrapper which passes RunEarly=true to ExternalAAWrapperPass. The full chain:

NVVM AA  -->  BasicAA  -->  TBAA  -->  ScopedNoAliasAA  -->  GlobalsAA
  |              |            |              |                    |
  |              |            |              |                    +-- Module-level: which globals
  |              |            |              |                        escape? (enable-unsafe-
  |              |            |              |                        globalsmodref-alias-results)
  |              |            |              |
  |              |            |              +-- !noalias / !alias.scope metadata
  |              |            |                  (enable-scoped-noalias, default true)
  |              |            |
  |              |            +-- Type-based: !tbaa metadata tree
  |              |                (enable-tbaa, default true)
  |              |
  |              +-- Stateless: GEP decomposition, alloca vs argument,
  |                  capture analysis (basic-aa-recphi, default true;
  |                  basic-aa-separate-storage, default true)
  |
  +-- Address space disjointness (stateless, O(depth) per query)

The chain is queried through AAResults::alias() (sub_134CB50), which dispatches through the registered AA providers in order. Each provider returns NoAlias, MayAlias, PartialAlias, or MustAlias. If any provider returns NoAlias, the chain short-circuits -- subsequent providers are not consulted. This is why NVVM AA runs first: cross-address-space pairs are resolved in O(1) without invoking the more expensive BasicAA GEP decomposition.

The AAResults object consumed by MemorySSA, GVN, DSE, and LICM is the same chained result. All memory-aware passes benefit transparently from NVVM AA without any code changes.

Integration with Memory Optimization Passes

NVVM AA's impact flows through every pass that queries alias information:

MemorySSA (sub_1A6A260) builds its memory SSA graph using AAResults at [this+0xB8] (retrieved via tag unk_4F9D3C0). When NVVM AA proves that a store to shared memory and a load from global memory are NoAlias, MemorySSA does not create a dependency edge between them, resulting in a sparser -- and more precise -- memory graph. This precision propagates to every consumer of MemorySSA.

GVN (sub_1900BB0) uses AA for load elimination and store forwarding. With NVVM AA, a load from %p_global can be forwarded past a store to %q_shared because they provably do not alias. Without NVVM AA, GVN would conservatively assume they might alias and abandon the forwarding. The GVN implementation queries sub_134CB50 indirectly through MemoryDependenceResults, which itself consults AAResults.

DSE (sub_19DD1D0 and related functions) eliminates dead stores by proving that no subsequent load reads the stored value. DSE requires AAResults at unk_4F9D3C0. The DSE report confirms: "The alias analysis that DSE consumes already handles address-space separation. CUDA address spaces (shared=3, global=1, local=5, constant=4) are handled by the underlying NVVM alias analysis which knows that different address spaces cannot alias." DSE does NOT implement its own address-space checks -- it relies entirely on NVVM AA.

LICM uses AA to determine whether a load inside a loop can be hoisted out. If NVVM AA proves a loop-invariant load from constant memory (AS 4, getModRefInfoMask returns NoModRef) cannot be modified by any store in the loop, LICM hoists it. This is especially impactful for __constant__ kernel arguments accessed repeatedly in hot loops.

`noalias` Metadata and `restrict` Handling

cicc provides two mechanisms for marking kernel pointer parameters as non-aliasing:

1. -restrict / --kernel-params-are-restrict (frontend flag, offset +1096). When the user passes -restrict to nvcc (or --kernel-params-are-restrict to cicc), it routes to the LLVM knob -nvptx-kernel-params-restrict via the llc argument vector. This causes cicc to add the noalias attribute to all pointer-typed kernel parameters, asserting that the programmer guarantees no two kernel pointer arguments alias. The process-restrict pass (ProcessRestrictPass, registered as a function pass in the new PM at position 419 in the pipeline parser, parameter parser at sub_233A330) then propagates this attribute through the call graph. The propagate-only mode restricts the pass to propagation without inserting new restrict annotations.

2. -allow-restrict-in-struct (flag at offset +1128). Extends __restrict__ handling to pointer fields inside struct arguments. When enabled, the process-restrict pass annotates struct-member pointers with noalias scope metadata, enabling AA to disambiguate pointers extracted from different struct fields. This flag routes to both the opt and llc argument vectors as -allow-restrict-in-struct.

Supporting knobs:

apply-multi-level-restrict -- apply __restrict__ to all pointer indirection levels (not just the outermost pointer)
dump-process-restrict -- debug dump during restrict processing

The noalias attribute interacts with the AA chain through ScopedNoAliasAA, which reads !noalias and !alias.scope metadata attached to instructions. cicc's frontend emits these metadata nodes when __restrict__ qualifiers are present in the CUDA source.

The !noalias.addrspace metadata (kind 42, registered in sub_B6EEA0) is a separate mechanism specific to address-space disambiguation. It is attached by MemorySpaceOpt or IR generation when a pointer is known to not alias with pointers in specific address spaces, even if the pointer itself remains in generic AS 0. The AA evaluator detects this metadata and tags the pointer with bit 2 (OR with 4) for disambiguation during alias queries.

The ProcessRestrict Propagation Algorithm

ProcessRestrictPass is NVIDIA's interprocedural restrict propagation pass, registered as pipeline entry 419 with class name ProcessRestrictPass. It runs as a function pass but has interprocedural effects: it reads the noalias attribute from kernel entry points and propagates equivalent information to callees by attaching !noalias and !alias.scope metadata to memory instructions. The knobs controlling its behavior are grouped in ctor_534 (address range 0x560000--0x5CFFFF), alongside allow-restrict-in-struct and apply-multi-level-restrict, and independently in ctor_270 (address range 0x4F0000--0x51FFFF) alongside process-restrict.

Activation and Flag Routing

The restrict pipeline activates through a chain of flag translations:

User:   nvcc --restrict kernel.cu
          |
nvcc:   cicc -restrict             (offset +1096 in flag struct)
          |
cicc:   llc  -nvptx-kernel-params-restrict    (routes to llc args only)
        opt  -allow-restrict-in-struct        (if -allow-restrict-in-struct set)
        opt  -apply-multi-level-restrict      (if set)

The critical distinction: -restrict routes exclusively to the llc argument vector (not opt), meaning the noalias attribute injection happens during code generation, not during the optimization pipeline. The process-restrict pass in the opt pipeline then reads these attributes and propagates their implications as metadata. The -allow-restrict-in-struct flag routes to both opt and llc, enabling struct-member restrict handling on both sides.

Propagation Algorithm

The pass operates in two modes controlled by the propagate-only parameter:

Full mode (default). The pass performs both annotation and propagation:

ProcessRestrictPass::run(Function &F):

  // Phase 1: Identify restrict-qualified pointer arguments
  for each Argument &A in F:
    if A.hasNoAliasAttr() and A.getType()->isPointerTy():
      RestrictArgs.push_back(&A)

  // Phase 1b: Struct member extraction (if allow-restrict-in-struct)
  if AllowRestrictInStruct:
    for each Argument &A in F:
      if A.getType() is StructType containing pointer fields:
        for each pointer field P extracted via extractvalue/GEP:
          RestrictArgs.push_back(P)

  // Phase 1c: Multi-level restrict (if apply-multi-level-restrict)
  if ApplyMultiLevelRestrict:
    for each pointer in RestrictArgs:
      if pointer points to pointer (T**):
        add the inner pointer dereference to RestrictArgs

  if RestrictArgs.empty():
    return PreservedAnalyses::all()

  // Phase 2: Create alias scope domain and per-argument scopes
  MDNode *Domain = createAliasScopeDomain(F.getName())
  for each pointer P in RestrictArgs:
    MDNode *Scope = createAliasScope(Domain, P->getName())
    ScopeMap[P] = Scope

  // Phase 3: Attach !alias.scope and !noalias metadata to memory ops
  for each Instruction &I in F:
    if I is load, store, call, or memcpy/memmove/memset:
      Value *Ptr = getPointerOperand(I)
      Value *Underlying = getUnderlyingObject(Ptr)

      // Which restrict argument does this pointer derive from?
      if ScopeMap.count(Underlying):
        MDNode *MyScope = ScopeMap[Underlying]
        I.setMetadata(!alias.scope, MyScope)

        // Build noalias set: all OTHER restrict arguments
        SmallVector<Metadata*> NoAliasScopes
        for each (P, S) in ScopeMap:
          if P != Underlying:
            NoAliasScopes.push_back(S)
        I.setMetadata(!noalias, MDNode::get(NoAliasScopes))

  // Phase 4: Debug dump (if dump-process-restrict)
  if DumpProcessRestrict:
    print annotated IR to dbgs()

Propagate-only mode. Skips Phase 1 annotation -- does not create new noalias attributes or scopes. Instead, it only reads existing !alias.scope and !noalias metadata from callers and propagates them through inlined call chains. This mode is used in later pipeline stages where new restrict annotations would be unsound (the interprocedural calling context has changed due to inlining).

How ScopedNoAliasAA Consumes the Metadata

The ScopedNoAliasAA provider (registered as scoped-noalias-aa in sub_233BD40, enabled by default via enable-scoped-noalias at ctor_060, global at 0x4B0000) processes the metadata as follows:

ScopedNoAliasAA::alias(LocA, LocB):

  // Extract !noalias sets from the instructions that produced LocA and LocB
  MDNode *NoAliasA = InstA->getMetadata(!noalias)   // set of scopes A does NOT alias
  MDNode *ScopeB   = InstB->getMetadata(!alias.scope) // B's own scope

  MDNode *NoAliasB = InstB->getMetadata(!noalias)
  MDNode *ScopeA   = InstA->getMetadata(!alias.scope)

  // If A's noalias set contains B's scope, or vice versa: NoAlias
  if NoAliasA contains any scope in ScopeB:
    return NoAlias
  if NoAliasB contains any scope in ScopeA:
    return NoAlias

  return MayAlias  // fall through to next AA provider

This means that after ProcessRestrictPass annotates a load from __restrict__ float *a with !alias.scope !{!scope_a} and !noalias !{!scope_b, !scope_c}, any load from __restrict__ float *b (with !alias.scope !{!scope_b}) will be proven NoAlias by ScopedNoAliasAA because scope_b appears in the first instruction's !noalias set. This is the standard LLVM scoped-noalias mechanism; cicc's contribution is the ProcessRestrictPass that generates these metadata nodes from CUDA __restrict__ annotations.

Restrict and Struct Members

When -allow-restrict-in-struct is active, the pass handles a common CUDA pattern where kernel parameters are passed through a struct:

struct Args {
    float * __restrict__ a;
    float * __restrict__ b;
    int n;
};

__global__ void kernel(Args args) {
    // Without allow-restrict-in-struct: a and b are NOT marked noalias
    //   because the struct argument itself is not __restrict__
    // With allow-restrict-in-struct: process-restrict extracts the
    //   pointer fields and creates per-field alias scopes
    args.a[i] = args.b[i] * 2.0f;  // DSE/LICM can now prove no alias
}

The pass identifies pointer-typed fields within struct arguments by walking extractvalue and getelementptr chains from the struct argument. Each extracted pointer receives its own alias scope, identical to what a top-level __restrict__ parameter would receive.

Multi-Level Restrict

When -apply-multi-level-restrict is active, the pass handles pointer-to-pointer arguments:

__global__ void kernel(float ** __restrict__ ptrs) {
    // Level 0: ptrs itself is restrict (different ptrs args don't alias)
    // Level 1: *ptrs (the pointed-to pointer) is also restrict
    //   meaning ptrs[i] and ptrs[j] point to non-aliasing memory
    float *a = ptrs[0];
    float *b = ptrs[1];
    a[x] = b[x];  // Proven NoAlias with multi-level restrict
}

Without this flag, only the outermost pointer level receives noalias treatment. With it, the pass follows dereference chains and creates scopes for each indirection level.

NVVM AA Query Logic -- Internal Detail

The AA chain in cicc is queried through AAResults::alias() at sub_134CB50. This function dispatches through the registered AA providers in registration order. The chain ordering observed in cicc v13.0 is:

NVVM AA  ->  BasicAA  ->  TBAA  ->  ScopedNoAliasAA  ->  GlobalsAA

This ordering is confirmed by sub_233BD40 (the AA chain builder, 4.8KB) which constructs the pipeline from names: globals-aa, basic-aa, objc-arc-aa, scev-aa, scoped-noalias-aa, tbaa. NVVM AA is injected at the front via NVPTXExternalAAWrapper with RunEarly=true, so it executes before all others.

The Query Dispatch Path

User pass (GVN, DSE, LICM, MemorySSA)
  |
  v
AAResults::alias(MemoryLocation &A, MemoryLocation &B)   [sub_134CB50]
  |
  +-- (1) NVPTXAAResult::alias()
  |       Check address spaces: cross-space pairs -> NoAlias
  |       If NoAlias: short-circuit, return immediately
  |
  +-- (2) BasicAA
  |       GEP decomposition, alloca vs argument, capture analysis
  |       basic-aa-recphi (default true): recursive PHI analysis
  |       basic-aa-separate-storage (default true): separate underlying objects
  |
  +-- (3) TBAA (Type-Based Alias Analysis)
  |       !tbaa metadata tree comparison
  |       enable-tbaa (default true)
  |
  +-- (4) ScopedNoAliasAA
  |       !noalias / !alias.scope metadata (from ProcessRestrict or frontend)
  |       enable-scoped-noalias (default true, ctor_060 at ~0x494CC1)
  |
  +-- (5) GlobalsAA  [sub_13C7380, 35.7KB]
  |       Module-level: which globals escape?
  |       enable-unsafe-globalsmodref-alias-results (default false)
  |
  v
Final AliasResult (NoAlias / MayAlias / PartialAlias / MustAlias)

Any provider returning NoAlias short-circuits the chain -- subsequent providers are never consulted. This is why NVVM AA runs first: cross-address-space pairs are resolved with zero overhead from BasicAA's GEP decomposition.

ModRef Queries

Two additional entry points handle call-site interactions:

sub_134F0E0 -- AAResults::getModRefInfo(CallBase, MemoryLocation). Returns a ModRefInfo encoding that combines Mod/Ref bits with MustAlias information (8 values, 0--7). This is used by DSE and LICM to determine whether a call can read or write a specific memory location.

sub_134F530 -- AAResults::getModRefInfo(CallBase, CallBase). Same encoding but for two call sites. Used by MemorySSA to build dependencies between calls.

The getModRefInfoMask method in NVVM AA adds a key optimization: pointers into constant memory (AS 4) or parameter memory (AS 101) return NoModRef because these memories are read-only from the kernel's perspective. This lets DSE skip alias analysis entirely for constant/param loads and lets LICM hoist them unconditionally.

`getMemoryEffects` for Inline Assembly

NVVM AA's getMemoryEffects method inspects PTX inline assembly blocks. An inline asm statement without the sideeffect flag and without a {memory} clobber constraint is classified as having no memory effects (MemoryEffects::none()). This prevents innocent inline asm (register manipulation, warp votes) from blocking load motion, store elimination, and CSE across the asm block.

Address-Space-Based NoAlias Rules -- Complete Matrix

The cross-address-space NoAlias decision is the cheapest and most impactful alias analysis in cicc. The full decision matrix for all pairs:

	AS 0 (generic)	AS 1 (global)	AS 3 (shared)	AS 4 (const)	AS 5 (local)	AS 6 (tensor)	AS 7 (shmem cluster)	AS 101 (param)
AS 0	MayAlias	MayAlias	MayAlias	MayAlias	MayAlias	MayAlias	MayAlias	MayAlias
AS 1	MayAlias	MayAlias	NoAlias	NoAlias	NoAlias	NoAlias	NoAlias	MayAlias*
AS 3	MayAlias	NoAlias	MayAlias	NoAlias	NoAlias	NoAlias	MayAlias	NoAlias
AS 4	MayAlias	NoAlias	NoAlias	MayAlias	NoAlias	NoAlias	NoAlias	NoAlias
AS 5	MayAlias	NoAlias	NoAlias	NoAlias	MayAlias	NoAlias	NoAlias	NoAlias
AS 6	MayAlias	NoAlias	NoAlias	NoAlias	NoAlias	MayAlias	NoAlias	NoAlias
AS 7	MayAlias	NoAlias	MayAlias	NoAlias	NoAlias	NoAlias	MayAlias	NoAlias
AS 101	MayAlias	MayAlias*	NoAlias	NoAlias	NoAlias	NoAlias	NoAlias	MayAlias

* AS 1 (global) vs AS 101 (param) returns MayAlias because cvta.param (SM 70+) converts parameter pointers to global-space addresses. A parameter-space pointer and a global-space pointer may reference the same physical byte after conversion. This is a conservative choice; upstream LLVM has a commented TODO noting that cvta.param support is not yet implemented, and cicc matches this conservatism.

The decision algorithm implemented in NVPTXAAResult::alias:

if AS1 == 0 or AS2 == 0:          -> MayAlias   (generic escapes all reasoning)
if AS1 == AS2:                     -> MayAlias   (same space, need deeper AA)
if {AS1,AS2} == {3,7}:            -> MayAlias   (shared/cluster overlap)
if {AS1,AS2} == {1,101}:          -> MayAlias   (global/param overlap via cvta.param)
otherwise:                         -> NoAlias    (hardware disjointness)

The `!noalias.addrspace` Metadata Mechanism

When MemorySpaceOpt or IR generation determines that a generic-space pointer provably does not alias with a specific address space, but cannot convert the pointer itself to that space (for example, because other uses require it to remain generic), cicc attaches !noalias.addrspace metadata (kind 42) to the instruction. This is registered in sub_B6EEA0 alongside the 41 standard LLVM metadata kinds (dbg=1, tbaa=2, prof=3, ..., noalias.addrspace=42).

The AA evaluator at sub_13549C0 detects this metadata during pointer collection (Phase 2 of the evaluator). When it encounters an instruction with opcode byte 0x4E (78, ASCII 'N'), it tags the pointer value with bit 2 set (OR with 4):

// At 0x1356170, 0x1356180, 0x1356190 in the AA evaluator:
if opcode_byte == 0x4E:          // noalias.addrspace annotation
    tagged_ptr = raw_ptr | 4     // set bit 2 as disambiguation flag

This tagged pointer propagates through to AAResults::alias() (sub_134CB50), AAResults::getModRefInfo(CallBase, MemoryLocation) (sub_134F0E0), and AAResults::getModRefInfo(CallBase, CallBase) (sub_134F530). The AA providers detect bit 2 and use the associated metadata to return NoAlias for the tagged pointer against pointers in the excluded address spaces.

Similarly, opcode byte 0x1D (29) identifies addrspacecast instructions. The evaluator captures the pre-cast value via cmovz, allowing the AA to trace back to the original non-generic address space even when the instruction itself operates on generic pointers.

The three opcode values that trigger special handling in the AA evaluator:

Opcode byte	Decimal	Meaning	AA evaluator action
`0x4E`	78 ('N')	`!noalias.addrspace` annotated	OR pointer with 4 (set bit 2)
`0x1D`	29	`addrspacecast`	Capture pre-cast value for AS lookup
`0x36`, `0x37`	54, 55	`llvm.noalias.scope.decl` intrinsic results	Insert into separate scope pointer sets

Comparison with Upstream LLVM NVPTX

Upstream LLVM (as of LLVM 19/20) includes NVPTXAliasAnalysis.cpp in llvm/lib/Target/NVPTX/, which implements the same core address-space disjointness logic. cicc's version is functionally equivalent to upstream for the basic alias query but differs in several ways:

Aspect	Upstream LLVM	cicc v13.0
Core alias check	Same: cross-AS = NoAlias, generic = MayAlias	Same
Shared cluster handling	AS 3 vs AS 7 = MayAlias	Present (SM 90+ targets)
Param aliasing with global	Commented TODO: "cvta.param not yet supported"	Same conservative treatment
getModRefInfoMask	Const/param = NoModRef	Same
Inline asm analysis	Checks side-effects + `{memory}` clobber	Same
Traversal depth knob	`nvptx-traverse-address-aliasing-limit` (default 6)	Same knob present
`!noalias.addrspace` metadata	Not used upstream	cicc-specific extension (metadata kind 42)
`strict-aliasing` knob	Not in upstream NVPTX	cicc adds "Datatype based strict alias"
`nvptxaa-relax-fences`	Not in upstream	cicc-specific: ordering relaxation for fences
`process-restrict` pass	Not in upstream NVPTX backend	cicc-specific interprocedural restrict propagation
Integration with MemorySpaceOpt	No upstream equivalent	cicc's address space inference feeds NVVM AA

The most significant delta is the ecosystem: upstream NVPTX has the AA pass but lacks the interprocedural MemorySpaceOpt pipeline that resolves generic pointers, the process-restrict pass that propagates noalias, and the !noalias.addrspace metadata that bridges partial address-space knowledge into the AA chain. These three components working together give cicc far more NoAlias results than upstream LLVM achieves on the same IR.

Configuration Knobs

NVVM AA Knobs

Knob	Type	Default	Description
`nvptx-traverse-address-aliasing-limit`	unsigned	6	Maximum depth for `getAddressSpace` traversal through `getUnderlyingObject`
`nvptxaa-relax-fences`	bool	(unknown)	Enable ordering relaxation for fence instructions in AA
`strict-aliasing`	bool	(unknown)	"Datatype based strict alias" -- NVIDIA extension for type-based disambiguation
`traverse-address-aliasing`	bool	(unknown)	"Find address space through traversal" -- master enable for the traversal in `getAddressSpace`
`assume-default-is-flat-addrspace`	bool	false	Treat default address space (0) as flat/generic (testing knob)

Standard LLVM AA Knobs (present in cicc)

Knob	Type	Default	Description
`disable-basic-aa` / `disable-basicaa`	bool	false	Disable BasicAA entirely
`basic-aa-recphi`	bool	true	Enable recursive PHI analysis in BasicAA
`basic-aa-separate-storage`	bool	true	Enable separate-storage analysis in BasicAA
`enable-tbaa`	bool	true	Enable Type-Based Alias Analysis
`enable-scoped-noalias`	bool	true	Enable ScopedNoAlias AA (processes `!noalias` / `!alias.scope`)
`enable-unsafe-globalsmodref-alias-results`	bool	false	Enable GlobalsModRef (requires unsafe assumption about global escapes)
`alias-set-saturation-threshold`	int	(default)	Maximum pointers in an AliasSet before it saturates
`aa-pipeline`	string	(default)	Override the AA pipeline configuration

Restrict Processing Knobs

Knob	Type	Default	Description
`nvptx-kernel-params-restrict`	bool	false	Mark all kernel pointer params as `noalias` (activated by `-restrict` flag)
`allow-restrict-in-struct`	bool	false	Propagate `__restrict__` into struct pointer members
`apply-multi-level-restrict`	bool	(unknown)	Apply `__restrict__` through all pointer indirection levels
`dump-process-restrict`	bool	false	Debug dump during restrict processing

AA Evaluator Debug Flags

The aa-eval diagnostic pass (sub_13549C0) uses 14 independent boolean flags for selective output:

Address	Flag	Controls
`byte_4F97AA0`	`print-all-alias-modref-info`	Master enable for all AA debug output
`byte_4F979C0`	`print-all-alias-no`	Print NoAlias pointer pairs
`byte_4F978E0`	`print-all-alias-may`	Print MayAlias pointer pairs
`byte_4F97800`	`print-all-alias-partial`	Print PartialAlias pointer pairs
`byte_4F97720`	`print-all-alias-mustalias`	Print MustAlias pointer pairs
`byte_4F97640`	`print-all-modref-none`	Print NoModRef results
`byte_4F97560`	`print-all-modref-ref`	Print JustRef results
`byte_4F97480`	`print-all-modref-mod`	Print JustMod results
`byte_4F973A0`	`print-all-modref-both`	Print BothModRef results
`byte_4F96F40`	`aa-eval-callsite-modref`	Enable call-site ModRef evaluation (Phase 5)

Function Map

Function	Address	Size	Role
`AAResults::alias(MemoryLocation, MemoryLocation)` -- main alias query entry	`sub_134CB50`	--	--
`AAResults::getModRefInfo(CallBase, MemoryLocation)`	`sub_134F0E0`	--	--
`AAResults::getModRefInfo(CallBase, CallBase)`	`sub_134F530`	--	--
`AAEvaluator::runOnFunction` -- the `aa-eval` diagnostic pass	`sub_13549C0`	11,038 B	--
`SmallPtrSet::insert` (pointer collection in aa-eval)	`sub_13540B0`	--	--
Pointer-pair result printer (aa-eval)	`sub_1352080`	--	--
Call-site pair result printer (aa-eval)	`sub_1351E00`	--	--
Formatted alias result printer (aa-eval)	`sub_13523B0`	--	--
`GlobalsAA` main analysis function	`sub_13C7380`	35.7 KB	--
`GlobalsAA` helper (per-function analysis)	`sub_13C5530`	21 KB	--
`GlobalsAA` call-site analysis	`sub_13C4410`	6.7 KB	--
`GlobalsAA` alias query	`sub_13C34D0`	12.6 KB	--
AA iteration / chaining logic	`sub_FD1250`	23.4 KB	--
Dominator-tree-based AA query setup (used by MemorySSA)	`sub_14A4050`	--	--
Metadata kind registration (including `noalias.addrspace` = kind 42)	`sub_B6EEA0`	9 KB	--
MemorySpaceOpt pass entry (IP-MSP worklist driver)	`sub_1C70910`	~2,427 lines	--
MemorySpaceOpt per-BB scanner + address-space bitmask builder	`sub_1CA8CD0`	~898 lines	--

Cross-References

MemorySpaceOpt -- the interprocedural pass that resolves generic pointers to specific address spaces, directly feeding NVVM AA
IP Memory Space Propagation -- the interprocedural wrapper around MemorySpaceOpt
GVN -- consumes AA for load elimination and store forwarding
DSE -- relies on AA for dead store detection; confirmed to have no internal address-space checks
LICM -- uses AA to hoist/sink memory operations across loops
Pipeline & Ordering -- where NVVM AA fits in the overall pass schedule
LLVM Knobs -- complete knob inventory including AA-related knobs
Optimization Levels -- how NVVMAliasAnalysis appears in the tier 2+ pipeline

MemorySSA Builder for GPU

MemorySSA constructs a sparse SSA form over memory operations, giving every instruction that reads or writes memory a position in a use-def chain that tracks the flow of memory state through a function. In upstream LLVM, MemorySSA already delivers significant speedups over the older MemoryDependenceResults analysis by avoiding per-query linear scans. In cicc v13.0, the payoff is amplified because the underlying alias analysis pipeline includes NVVM AA, which returns NoAlias for any cross-address-space pointer pair. A store to shared memory (addrspace(3)) and a load from global memory (addrspace(1)) will never produce a dependency edge in the MemorySSA graph, yielding a dramatically sparser representation than would be possible on a flat-memory architecture. Every pass that consumes MemorySSA -- LICM, EarlyCSE, DSE, GVN, SimpleLoopUnswitch -- benefits from this precision without containing any GPU-specific logic itself.

Key Facts

Property	Value
Builder entry wrapper	`sub_1A6CAD0` (48 bytes -- skipFunction guard + tail call)
Builder core function	`sub_1A6A260` (10,344 bytes)
MemoryAccess allocator	`sub_1A69110` (1,245 bytes)
Pass registration string	`"memoryssa"` (analysis #179 in pipeline parser)
Pipeline parser entry	`"print<memoryssa>"` -> `MemorySSAPrinterPass`
Required analyses	AliasAnalysis (tag `unk_4F9D3C0`), DominatorTree (tag `unk_4F9E06C`), LoopInfo (tag `unk_4F9A488`)
Stack frame size	0x3F8 = 1,016 bytes
MemoryAccess node size	0x40 = 64 bytes (bump-allocated)
Walker check limit	`memssa-check-limit` = 100 (max stores/phis to walk past)
Verification flag	`verify-memoryssa` (off by default, on under `EXPENSIVE_CHECKS`)
DOT graph output	`dot-cfg-mssa` (filename for CFG + MemorySSA visualization)

MemorySSA Node Types

MemorySSA represents memory state with three node types, all stored in 64-byte heap-allocated objects:

MemoryDef (kind=2) -- Created for every instruction that may write memory: stores, calls with side effects, atomics, memcpy/memmove intrinsics. Each MemoryDef takes the previous memory state as its operand and produces a new version of memory state.

MemoryUse (kind=1) -- Created for every instruction that reads memory but does not modify it: loads, calls to readonly/readnone functions. A MemoryUse points to the MemoryDef (or MemoryPhi) that represents the most recent memory state it depends on.

MemoryPhi (kind=3) -- Inserted at control flow join points where predecessors have different reaching memory definitions, exactly like an SSA phi node for scalar values. A MemoryPhi merges the memory states from each predecessor into a single version.

All three types share a common layout:

Offset	Size	Field
+0x00	8	vtable / next pointer (intrusive list)
+0x08	8	prev pointer (intrusive list)
+0x10	4	kind (1=MemoryUse, 2=MemoryDef, 3=MemoryPhi)
+0x14	4	operand_count (bits 0-27)
+0x17	1	flags byte (bit 6 = 0x40 = "has inline operands")
+0x18	8	defining instruction / accessed Value*
+0x20	8	type/size descriptor (APInt or pointer to APInt)
+0x28	8	operand/predecessor pointer
+0x30	8	current reaching definition (MemoryAccess*)
+0x38	8	associated BasicBlock* (or null)

The sentinel value 1 stored in the reaching-definition field (+0x30) represents LiveOnEntry -- the implicit MemoryDef that dominates the entire function and represents the initial state of memory at function entry.

Construction Algorithm

The builder at sub_1A6A260 follows the standard LLVM MemorySSA construction algorithm, implemented as a dominator-tree DFS rename pass. The implementation is split into eight phases.

Phase 1 -- Prerequisite Retrieval (0x1A6A260 - 0x1A6A3A0)

The builder queries the analysis manager for three required results via a vtable-tagged vector. Each registered analysis is identified by a unique tag pointer:

unk_4F9D3C0 -> calls virtual method [rax+0x68] -> sub_14A4050 -- retrieves AAResults, stored at [this+0xB8]
unk_4F9E06C -> retrieves DominatorTree result, stored at [this+0xA8] (offset +0xA0 within the wrapper)
unk_4F9A488 -> retrieves LoopInfo, stored at [this+0xB0]

If any tag is not found in the registered analysis vector, control jumps to terminal handlers at 0x1A6CAAF-0x1A6CABE (assertion / unreachable).

Phase 2 -- Worklist Initialization (0x1A6A3A0 - 0x1A6A6B0)

The builder allocates a 1,016-byte stack frame and initializes four layers of SmallVector-based renaming stacks:

Level 0: DFS traversal order over the dominator tree (computed by sub_13B8390)
Level 1: Per-block instruction iterator
Level 2: Per-block incoming MemoryPhi operand buffer (SmallVector at rbp-0x330, inline capacity 8)
Level 3: Memory state stack (current reaching definition per DFS depth)

Each layer is initialized by sub_16CCEE0 (SmallVector move-assign). Temporary intermediate buffers are freed before the main walk begins.

Phase 3 -- Dominator Tree Walk (0x1A6A88C - 0x1A6B070)

The main loop visits each basic block in DFS order over the dominator tree. For every instruction, the builder reads the opcode byte at [instruction-8] and classifies it:

opcode_tag = *(uint8_t*)(instr - 8);

switch (opcode_tag) {
    case 0x18..0x38:    // Memory instructions (load/store range)
        type_tag = *(uint8_t*)(*(instr - 0x18) + 8);
        if (type_tag == 0x10)  // PointerType result -> this is a Load
            createMemoryUse(instr);
        else
            createMemoryDef(instr);  // Store
        break;

    case 0x0B:          // CallInst
        classifyCall(instr);   // -> sub_1A69C30
        break;

    case 0x27:          // PHINode
        if (predecessors_disagree_on_memory_state())
            createMemoryPhi(block);
        break;
}

Type-size computation. For each memory access, a three-level nested switch computes the byte-size of the accessed region. The switch handles all LLVM Type IDs:

Type ID	Type	Size computation
1	HalfTy	16 bits
2	FloatTy	32 bits
3	DoubleTy	64 bits
4	FP80	80 bits
5	FP128	128 bits
6	PPC_FP128	128 bits
7	PointerTy	`getPointerSizeInBits()` via `sub_15A9520`
11	IntegerTy	`[type+8] >> 8` (raw bit width)
14	StructTy	`getStructLayout()` via `sub_15A9FE0`
0, 8, 10, 12, 16	Array/Vector	element_count * element_size

When the computed access size differs from the store size ([rax+8] >> 8), the builder routes through sub_1A69690 to create a partial-store MemoryDef, capturing the precise overlap region.

Phase 4 -- Call and Intrinsic Classification

Call instructions (opcode 0x0B) are dispatched through sub_1A69C30 (call-instruction MemoryDef handler), which classifies intrinsics by ID:

ID 0x0F (lifetime.start) and ID 0x17 (lifetime.end) -- no memory effect, skipped
ID 0x27 -- memcpy/memmove-like intrinsics, create MemoryDef
ID 0x2F -- atomic intrinsics (checks [rdx-0x30] for ordering)
ID 0x33 -- NVIDIA-specific intrinsics (surface/texture operations, NVVM builtins)

Phase 5 -- MemoryAccess Allocation (sub_1A69110)

The core allocator creates all three node types. Parameters:

Register	Meaning
rdi	MemorySSA `this`
esi	kind: 1=MemoryUse, 2=MemoryDef, 3=MemoryPhi
rdx	defining value / access value
rcx	type descriptor (APInt holding access size)
r8	instruction pointer
r9	predecessor block (for MemoryPhi)

Each allocation calls sub_22077B0 (BumpPtrAllocator::Allocate) for 0x40 bytes, populates all fields, inserts the node into the intrusive list via sub_2208C80, and increments the node counter at [this+0xD0].

For kind==1, sub_16A57B0 (countLeadingZeros) determines whether the access is a full or partial def. For kind==3 (MemoryPhi), the operand list is populated by iterating predecessor blocks through sub_146F1B0 (AA-driven reaching-definition lookup).

Phase 6 -- Trivial Phi Optimization (0x1A6B280 - 0x1A6B9BD)

After the DFS walk, the builder post-processes all MemoryPhi nodes. Any MemoryPhi whose operands all resolve to the same MemoryDef is trivial -- it can be replaced with that single reaching definition. The loop at 0x1A6B9DE iterates the result vector [this+0xD8..this+0xE0]:

for (auto *Phi : result_vector) {
    unsigned count = Phi->operand_count & 0x0FFFFFFF;
    if (all_operands_identical(Phi)) {
        Phi->replaceAllUsesWith(single_reaching_def);  // sub_164B780
        Phi->eraseFromParent();                         // sub_1AEB370
        destroy(Phi);                                   // sub_164BEC0
    }
}

This cleanup is critical for GPU code. Because NVVM AA proves so many memory operations are independent, many join points that would require MemoryPhis on a flat-memory machine will have all predecessors carrying the same memory state. The trivial-phi elimination pass removes these, reducing the graph to only the essential dependencies.

GPU-Specific Precision Gains

The MemorySSA builder itself contains no explicit GPU logic. The GPU awareness comes entirely through the AA pipeline at [this+0xB8], which chains BasicAA -> TBAA -> ScopedNoAliasAA -> NVVM AA. The critical interaction points are:

Cross-address-space independence. When sub_146F1B0 queries the AA for a (store to addrspace(3), load from addrspace(1)) pair, NVVM AA returns NoAlias before BasicAA or TBAA are even consulted. The MemorySSA builder then skips creating a dependency edge. This means a MemoryUse for a global load will not depend on a MemoryDef for a shared store -- they exist in parallel chains.

Partial-alias precision. The builder at 0x1A6AFB3 creates MemoryDefs even for partial overlaps, then calls sub_1A69690 to register the precise overlap region. Standard LLVM would conservatively treat partial alias as MayAlias and create a full dependency. cicc's more aggressive approach uses the partial overlap information downstream for finer-grained DSE and LICM decisions.

Address-space check on volatile access. The call to sub_15FA300 at 0x1A6B88E performs what appears to be a volatile-access or address-space check specific to CUDA memory spaces. This gate prevents the builder from creating false dependencies between volatile shared memory operations (used for inter-warp communication) and non-volatile global operations.

NVIDIA custom intrinsic handling. Type ID 0x33 in sub_1A69990 is not a standard LLVM type ID. It appears to be cicc's custom type for CUDA-specific memory operations (surface/texture references, NVVM-specific typed pointers). These are classified as memory-clobbering conservatively unless the AA can prove otherwise.

Practical effect. Consider a kernel that loads from global memory, operates on shared memory, and stores back to global memory:

__global__ void kernel(float *out, float *in) {
    __shared__ float smem[256];
    smem[threadIdx.x] = in[threadIdx.x];        // global load + shared store
    __syncthreads();
    float val = smem[threadIdx.x] * 2.0f;       // shared load
    out[threadIdx.x] = val;                      // global store
}

On a flat-memory machine, the MemorySSA graph would have a single linear chain: every memory operation depends on the previous one. With NVVM AA feeding MemorySSA, the graph splits into two parallel chains -- one for shared memory and one for global memory -- connected only at the __syncthreads() barrier (which is modeled as a MemoryDef that clobbers all address spaces).

The MemorySSA Walker

Passes do not directly traverse the MemorySSA def-use chains. Instead, they query the CachingWalker, which answers the fundamental question: "What is the nearest MemoryDef that actually clobbers this memory location?"

The walker performs an optimized upward walk along the def chain, testing each MemoryDef against the query location using the full AA pipeline. The walk terminates when:

A MemoryDef that clobbers the query location is found (instructionClobbersQuery returns true)
LiveOnEntry is reached (the location was never written in this function)
The walk budget (memssa-check-limit = 100 steps) is exhausted, in which case the current MemoryDef is returned conservatively as a clobber

When a MemoryPhi is encountered, the walker splits into multiple paths (one per predecessor) and tracks them using a DefPath worklist. Each path records a (MemoryLocation, First, Last, Previous) tuple, enabling the walker to reconstruct the full path from any clobber back to the query origin.

Caching. The CachingWalker memoizes results per (MemoryAccess, MemoryLocation) pair. Once a clobber query is resolved, subsequent queries for the same access return the cached result immediately. The SkipSelfWalker variant (used by DSE) additionally skips the MemoryDef that is the query origin itself, answering "what did this store overwrite?" rather than "what clobbers this store?"

On GPU, the walker's budget is rarely exhausted for shared-memory operations because NVVM AA prunes so many false dependencies that the def chain is short. For global memory operations in loops with many stores, the 100-step limit can be hit; increasing memssa-check-limit trades compilation time for precision in these cases.

Consumer Passes

Five major passes consume MemorySSA in cicc:

Pass	How it uses MemorySSA
LICM	Queries the walker to determine whether a load inside a loop is clobbered by any store in the loop body. If no clobber is found, the load is hoisted. NVVM AA makes shared-memory loads trivially hoistable past global stores.
EarlyCSE (`early-cse-memssa` variant, `sub_27783D0`)	Uses MemorySSA to find redundant loads -- two loads from the same location with no intervening clobber are CSE'd. The MemorySSA variant avoids the O(n^2) scanning of the non-MSSA EarlyCSE.
DSE	Walks the MemorySSA graph backwards from a store to find earlier stores to the same location with no intervening loads. Dead stores are eliminated. DSE has its own extensive set of MemorySSA walk limits (see knobs below).
GVN	Can optionally use MemorySSA instead of MemoryDependenceResults (controlled by `enable-gvn-memoryssa`). When enabled, GVN uses the walker for load-value forwarding and PRE.
SimpleLoopUnswitch	Queries MemorySSA to determine whether a condition inside a loop depends on memory modified in the loop. The `simple-loop-unswitch-memoryssa-threshold` knob controls the walk limit.

Knobs and Thresholds

MemorySSA Core

Knob	Default	Effect
`memssa-check-limit`	100	Maximum stores/phis the walker will walk past before giving up. Higher values improve precision at the cost of compilation time.
`verify-memoryssa`	false	Enables expensive verification of MemorySSA invariants after every modification.
`dot-cfg-mssa`	`""`	If set, dumps the CFG annotated with MemorySSA information to the named DOT file.

DSE MemorySSA Walk Limits

Knob	Default	Effect
`dse-memoryssa`	true	Master switch enabling MemorySSA-based DSE.
`dse-memoryssa-scanlimit`	150	Max memory accesses DSE will scan for a redundant store.
`dse-memoryssa-walklimit`	90	Max MemorySSA walk steps per DSE query.
`dse-memoryssa-partial-store-limit`	5	Max partial stores DSE will try to merge.
`dse-memoryssa-defs-per-block-limit`	5000	Skip blocks with more defs than this limit.
`dse-memoryssa-samebb-cost`	1	Walk cost weight for same-block MemoryDefs.
`dse-memoryssa-otherbb-cost`	5	Walk cost weight for cross-block MemoryDefs.
`dse-memoryssa-path-check-limit`	50	Max paths DSE will check for nontrivial reachability.
`dse-optimize-memoryssa`	true	Enables DSE's own MemorySSA optimization (trivial phi removal during DSE).

GVN / MemoryDependence

Knob	Default	Effect
`enable-gvn-memoryssa`	varies	Switches GVN from MemDep to MemorySSA.
`memdep-block-scan-limit`	100 (legacy)	Legacy MemDep per-block scan limit.
`memdep-block-number-limit`	200 (legacy) / 1000 (NewPM)	Max blocks MemDep will search. Note: the NewPM variant defaults to 1,000, a 5x increase.

Function Map

Function	Address	Size	Role
Pass entry wrapper (skipFunction guard + tail call to builder)	`sub_1A6CAD0`	48	--
MemorySSA builder core (DFS rename walk)	`sub_1A6A260`	10,344	--
MemoryAccess node allocator (Def/Use/Phi)	`sub_1A69110`	1,245	--
MemoryDef creation dispatcher (routes to `sub_1A69110`)	`sub_1A695F0`	--	--
Store-instruction MemoryDef handler (partial store support)	`sub_1A69690`	754	--
MemoryPhi operand insertion handler (bidirectional edge setup)	`sub_1A69990`	664	--
Call-instruction handler (intrinsic classification)	`sub_1A69C30`	--	--
`MemorySSA::getMemoryAccess` or walker lookup	`sub_1643330`	--	--
`MemoryAccess::getDefiningAccess`	`sub_1643D30`	--	--
`MemoryLocation::get` or `getForDest`	`sub_1644900`	--	--
`Value::replaceAllUsesWith` (def substitution during trivial phi removal)	`sub_164B780`	--	--
`MemoryAccess::~MemoryAccess` (destructor)	`sub_164BEC0`	--	--
`MemoryAccess::eraseFromParent`	`sub_1AEB370`	--	--
`BumpPtrAllocator::Allocate` (64-byte node allocation)	`sub_22077B0`	--	--
AA query: getModRefInfo / reaching-def resolution	`sub_146F1B0`	--	--
AA query: may-alias check (two-pointer comparison)	`sub_145CF80`	--	--
AA query: isNoAlias / clobber check	`sub_1487400`	--	--
DominatorTree DFS order computation	`sub_13B8390`	--	--
skipFunction guard (checks `isDeclaration`)	`sub_1636880`	--	--

Diagnostic Strings

Diagnostic strings recovered from p2-J04-memoryssa.txt and the pipeline parser (p2c.1-01-pipeline-parser.txt). MemorySSA itself emits no optimization remarks; its diagnostics are configuration knobs and the verification/dump infrastructure.

String	Source	Category	Trigger
`"memoryssa"`	Pipeline parser analysis #179	Registration	Analysis registration name in the pass pipeline
`"print<memoryssa>"`	Pipeline parser #406	Registration	Printer pass registration; params: `no-ensure-optimized-uses`
`"memssa-check-limit"`	Knob (default 100)	Knob	Maximum stores/phis the CachingWalker will walk past before returning a conservative clobber
`"verify-memoryssa"`	Knob (default false)	Knob	Enables expensive verification of MemorySSA invariants after every modification; on under `EXPENSIVE_CHECKS`
`"dot-cfg-mssa"`	Knob (default `""`)	Knob	If set, dumps the CFG annotated with MemorySSA information to the named DOT file for visualization
`"dse-memoryssa"`	Knob (default true)	Knob	Master switch enabling MemorySSA-based DSE
`"dse-memoryssa-scanlimit"`	Knob (default 150)	Knob	Max memory accesses DSE will scan for a redundant store
`"dse-memoryssa-walklimit"`	Knob (default 90)	Knob	Max MemorySSA walk steps per DSE query
`"dse-memoryssa-partial-store-limit"`	Knob (default 5)	Knob	Max partial stores DSE will try to merge
`"dse-memoryssa-defs-per-block-limit"`	Knob (default 5000)	Knob	Skip blocks with more defs than this limit
`"dse-memoryssa-samebb-cost"`	Knob (default 1)	Knob	Walk cost weight for same-block MemoryDefs
`"dse-memoryssa-otherbb-cost"`	Knob (default 5)	Knob	Walk cost weight for cross-block MemoryDefs
`"dse-memoryssa-path-check-limit"`	Knob (default 50)	Knob	Max paths DSE will check for nontrivial reachability
`"dse-optimize-memoryssa"`	Knob (default true)	Knob	Enables DSE's own MemorySSA optimization (trivial phi removal during DSE)
`"enable-gvn-memoryssa"`	Knob (varies)	Knob	Switches GVN from MemDep to MemorySSA
`"memdep-block-scan-limit"`	Knob (default 100 legacy)	Knob	Legacy MemDep per-block scan limit
`"memdep-block-number-limit"`	Knob (default 200 legacy / 1000 NewPM)	Knob	Max blocks MemDep will search; NewPM variant defaults to 1,000 (5x increase)
`"print<memoryssa-walker>"`	Pipeline parser	Registration	MemorySSA walker printer pass
`"early-cse-memssa"`	Pipeline parser	Registration	EarlyCSE variant that uses MemorySSA

Cross-References

Alias Analysis & NVVM AA -- the AA pipeline that feeds MemorySSA with GPU-aware NoAlias results
LICM -- primary consumer; NVVM AA-enhanced MemorySSA enables aggressive hoisting of shared-memory loads past global stores
DSE -- walks MemorySSA backwards to find dead stores; extensive set of MemorySSA-specific knobs
GVN -- optional MemorySSA backend via enable-gvn-memoryssa
EarlyCSE -- EarlyCSE's memssa variant uses MemorySSA for redundant load elimination

LazyCallGraph & CGSCC Pass Manager

The LazyCallGraph (LCG) is the data structure that represents which functions call or reference which other functions, built on demand rather than up front. It drives the CGSCC (Call Graph Strongly Connected Components) pass manager, which walks the call graph in bottom-up order so that interprocedural passes -- the inliner, argument promotion, devirtualization, function attribute inference -- process callees before callers. This ordering is essential: the inliner must have finished optimizing a callee's body before it decides whether to inline that callee into a caller. cicc v13.0 uses LLVM's stock LazyCallGraph implementation without NVIDIA-specific modifications to the graph itself. The GPU-specific behavior comes entirely from how the pipeline configures the CGSCC framework: kernels serve as call graph roots, device functions are internal nodes, recursion is rare, and the inline cost model is radically different from any CPU target.

The LCG cluster occupies approximately 220KB of code at 0xD230A0--0xD2F8A0, containing the graph construction logic, Tarjan's SCC algorithm, incremental SCC mutation operations, and the DOT/text graph printers. A separate 69KB function at sub_2613930 implements the New PM CGSCC inliner that runs inside this framework.

Key Facts

Property	Value
Binary cluster	`0xD230A0` -- `0xD2F8A0` (~220KB, ~25 functions)
LLVM source	`llvm/lib/Analysis/LazyCallGraph.cpp`
CGSCC pass manager	`sub_1A62BF0` (the `InlinerWrapper`/standard pipeline factory)
CGSCC pipeline parser	`sub_2377300` (103KB)
CGSCC-to-function adaptor	`sub_2362FB0` (6.7KB)
New PM CGSCC inliner	`sub_2613930` (69KB)
NVIDIA custom inliner	`sub_1864060` (75KB, the old CGSCC SCC-walk inliner)
Inliner core loop	`sub_186CA00` (61KB, `Inliner::inlineCallsImpl`)
DevirtSCCRepeatedPass	`sub_2284BC0` (16KB, "Max devirtualization iterations reached")
SCC object size	136 bytes (`0x88`)
Edge encoding	Pointer with tag bits: bit 2 = call edge, bit 2 clear = ref edge
DenseMap hash	`hash(ptr) = (ptr >> 4) ^ (ptr >> 9)`, bucket size = 16 bytes
DenseMap sentinels	Empty = `0xFFFFFFFFFFFFF000`, Tombstone = `0xFFFFFFFFFFFFE000`
CGSCC invocations per O1/O2/O3	4 passes of `sub_1A62BF0(1,...)`, 1 iteration each
CGSCC invocations at tier 3	`sub_1A62BF0(5,...)` -- 5 iterations
BumpPtrAllocator	`[LCG+0x150]` cursor, `[LCG+0x158]` slab end

Lazy Call Graph Construction

The graph is not built all at once. When the CGSCC pass manager begins, the LCG starts with just the module's externally visible functions and kernel entry points as root nodes. Each node's edges are populated only when first visited by the SCC traversal -- the Node::populateSlow() method (sub_D23BF0 returns the edge iterator range) scans all instructions in the function, recording two kinds of edges:

Call edges (bit 2 set in pointer tag): direct CallBase instructions whose callee resolves to a defined function. These form the strong connectivity that defines SCCs.

Ref edges (bit 2 clear): any other reference to a defined function -- a function pointer stored in a global, passed as a callback argument, taken address of. These contribute to RefSCC grouping but do not create call-graph cycles.

Node layout (deduced from binary):
  +0x00: Function*          (LLVM IR function)
  +0x08: Edge array pointer  (populated lazily)
  +0x10: Edge count / DFSNumber (int32, -1 = completed)
  +0x14: LowLink             (int32, repurposed as SCC index after Tarjan)
  +0x18: Callee edge list    (second array for call edges)
  +0x20: Callee edge count

Edge encoding (single qword):
  Bits 63..3: pointer to target Node
  Bit 2:      1 = call edge, 0 = ref edge
  Bits 1..0:  reserved (alignment)

Population is the only lazy step. Once a node is populated, its edges are cached. Subsequent visits reuse the cached edge list at [node+0x08]. The scan checks [rsi] != 0 to skip unresolvable edges (declaration-only functions with no body).

For a reimplementation: scan every instruction in the function. For each CallBase, if the callee is a defined function, add a call edge. Then walk all non-call operands recursively through constants (including BlockAddress, GlobalAlias, ConstantExpr) collecting any additional function references as ref edges. This matches upstream populateSlow() exactly.

SCC and RefSCC: The Two-Level Hierarchy

The LCG maintains a two-level SCC decomposition:

SCC (Call SCC): a maximal set of functions connected by call edges such that every function is reachable from every other through calls. This is the unit of work for the CGSCC pass manager.
RefSCC (Reference SCC): a maximal set of SCCs connected by ref edges. A RefSCC contains one or more SCCs. SCCs within a RefSCC can reference each other (e.g., mutually store each other's function pointers) but do not necessarily call each other.

RefSCC layout (from [r15] in sub_D25FD0):
  +0x00: LazyCallGraph*     (parent graph)
  +0x08: SCC array pointer   (SmallVector data)
  +0x10: SCC array size
  +0x14: SCC array capacity
  +0x38: DenseMap #1         (SCC* -> index)
         +0x38: qword - bucket base pointer (or inline start)
         +0x40: byte  - flags (bit 0 = active map selector)
         +0x44: dword - tombstone count / generation
  +0x48: DenseMap #2         (alternate map for lazy rehashing)
         +0x48: qword - bucket base pointer
         +0x50: dword - bucket count

SCC layout (136 bytes = 0x88):
  +0x00: qword - parent pointer / metadata
  +0x08: qword - node member array pointer
  +0x10: dword - member count
  +0x14: dword - capacity
  +0x18: Edge list / callee info
  +0x38: DenseMap - node-to-index or similar

The bottom-up SCC ordering is computed using Tarjan's algorithm, implemented in sub_D2C610. The algorithm uses the standard DFS stack with 24-byte entries ({Node*, EdgeIter, EdgeEnd}) and the classic DFSNumber / LowLink fields at node offsets +0x10 and +0x14. When LowLink == DFSNumber, the node is an SCC root -- all nodes above it on the DFS result stack are popped into a new SCC, their DFSNumber set to -1 (completed), and the SCC index written into the LowLink field for reuse.

The Tarjan inner loop at 0xD2CD90--0xD2CEA4 and the SCC member popping at 0xD2CF61--0xD2CFD0 are both 4x unrolled, indicating these are hot paths in the CGSCC pipeline.

Tarjan's SCC Algorithm: Binary-Level Pseudocode

Complexity. Tarjan's SCC algorithm is O(V + E) where V = number of nodes (functions) and E = number of call edges among those nodes. The 4x-unrolled inner loop is a constant-factor optimization, not an algorithmic change. The initial buildSCCs (sub_D2BEB0) runs Tarjan once over the entire call graph: O(V_total + E_total). The incremental switchInternalEdgeToRef runs Tarjan only over the affected SCC's members, giving O(V_scc + E_scc) which is typically O(1) since most GPU SCCs contain a single function. switchInternalEdgeToCall is O(V_scc + E_scc) for the same-SCC fast path (bit flip only), or O(M * V_scc) for the slow merge path where M = number of SCCs being merged. switchOutgoingEdgeToCall/Ref (sub_D27A10, 29KB) is O(R * S) where R = number of RefSCCs involved and S = total SCCs in those RefSCCs. The DenseMap operations throughout use (ptr >> 4) ^ (ptr >> 9) hashing with O(1) amortized insert/lookup. Graph verification (sub_D29180) is O(V + E) for the entire graph. The CGSCC pass manager's outer loop processes each SCC once in post-order, re-visiting at most max_devirt_iterations times (default 1, tier 3: 5), giving O(max_iter * V) passes over the SCCs.

The Tarjan implementation lives inside sub_D2C610 (switchInternalEdgeToRef) at address range 0xD2CC66--0xD2D0BC. It recomputes SCCs within a single RefSCC after a call edge is demoted to a ref edge, which may split the original SCC into multiple smaller SCCs.

The following pseudocode is reconstructed directly from the binary. Every variable name corresponds to a register or stack slot; every offset corresponds to a binary address.

// Address: 0xD2CC66 -- 0xD2D0BC (inside sub_D2C610)
// Input:  RefSCC containing one SCC whose internal call-edge structure changed
// Output: zero or more new SCCs replacing the original

struct StackEntry {           // 24 bytes (0x18)
    Node*       node;         // +0x00
    Edge*       edge_iter;    // +0x08
    Edge*       edge_end;     // +0x10
};

fn tarjan_recompute_scc(old_scc: &SCC, allocator: &BumpPtrAllocator) -> Vec<SCC> {
    // --- Phase 0: Initialize ---
    let mut dfs_counter: i32 = 1;                       // r13d, starts at 1
    let mut worklist: SmallVector<StackEntry, 4>;        // [rbp-0xA0], 24-byte entries
    let mut result_stack: SmallVector<*Node, 8>;         // [rbp-0x120]
    let mut new_scc_count: i32 = 0;                      // r14d, incremented per SCC found

    // --- Phase 1: Push all nodes of old_scc as unvisited roots ---
    for node in old_scc.members() {
        node.DFSNumber = 0;     // [node+0x10] = 0  (unvisited marker)
        node.LowLink   = 0;     // [node+0x14] = 0
    }

    // --- Phase 2: Outer loop -- pick next unvisited root (0xD2CCF7) ---
    for root in old_scc.members() {
        if root.DFSNumber != 0 { continue; }            // already visited

        // Assign DFS number and LowLink to root
        root.DFSNumber = dfs_counter;                    // [rbx+0x10] = r12d
        root.LowLink   = dfs_counter;                    // [rbx+0x14] = r12d
        dfs_counter += 1;                                // r13d++

        // Lazy-populate edges if not yet done
        let (edge_begin, edge_end) = sub_D23BF0(&root.edge_list);  // 0xD2CD0E
        worklist.push(StackEntry { node: root, edge_iter: edge_begin, edge_end });

        // --- Phase 3: DFS inner loop (0xD2CD90 -- 0xD2CEA4, 4x unrolled) ---
        while let Some(top) = worklist.last_mut() {
            if top.edge_iter == top.edge_end {
                // All edges of current node exhausted -- backtrack
                let finished = top.node;
                worklist.pop();                          // 0xD2CE80

                // LowLink propagation to parent
                if let Some(parent) = worklist.last_mut() {
                    // 0xD2CDF5: min(parent.LowLink, finished.LowLink)
                    let child_low = finished.LowLink;    // [rbx+0x14]
                    if child_low >= 0 && child_low < parent.node.LowLink {
                        parent.node.LowLink = child_low; // [r15+0x14] = edx
                    }
                }

                // --- Phase 4: SCC root detection (0xD2CF01) ---
                if finished.DFSNumber == finished.LowLink {
                    // This node is an SCC root. Pop members from result_stack.
                    // (0xD2CF30 -- 0xD2CFD2, 4x unrolled)
                    let scc_dfs = finished.DFSNumber;    // [r15+0x10]
                    loop {
                        // Unrolled: processes 4 nodes per iteration
                        let member = result_stack.pop();
                        if member.DFSNumber < scc_dfs { break; }  // 0xD2CF61

                        member.DFSNumber = -1;           // 0xFFFFFFFF = completed
                        member.LowLink = new_scc_count;  // assign SCC index
                    }
                    // The root itself
                    finished.DFSNumber = -1;
                    finished.LowLink = new_scc_count;
                    new_scc_count += 1;                  // r14d++
                } else {
                    // Not a root -- push onto result stack for later popping
                    result_stack.push(finished);
                }
                continue;
            }

            // Advance to next edge
            let edge_raw = *top.edge_iter;               // load qword
            top.edge_iter += 1;                          // advance by 8

            let target_node = edge_raw & 0xFFFFFFFFFFFFFFF8;  // mask off tag bits
            let is_call     = (edge_raw & 0x4) != 0;          // bit 2 = call edge

            // Only follow CALL edges for SCC computation (ref edges ignored)
            if !is_call { continue; }
            if target_node == 0 { continue; }            // skip null targets

            let target_dfs = target_node.DFSNumber;      // [target+0x10]

            if target_dfs == 0 {
                // Unvisited: assign DFS number, push onto worklist
                target_node.DFSNumber = dfs_counter;     // 0xD2CD78
                target_node.LowLink   = dfs_counter;
                dfs_counter += 1;

                let (eb, ee) = sub_D23BF0(&target_node.edge_list);
                worklist.push(StackEntry { node: target_node, edge_iter: eb, edge_end: ee });

            } else if target_dfs == -1 {
                // Already in a completed SCC -- skip entirely
                continue;

            } else {
                // On the stack (tree/back edge): update LowLink
                // 0xD2CDF5: min(current.LowLink, target.DFSNumber)
                if target_dfs < top.node.LowLink {
                    top.node.LowLink = target_dfs;
                }
            }
        }
    }
}

Key binary details:

The DFS counter is split between r12d and r13d, alternating roles. In practice r13d holds the next available DFS number, starting at 2 (the root gets 1 via the 0x100000001 packed initialization at 0xD2CD0E).
The 4x-unrolled inner loop at 0xD2CD90 processes four edge entries per iteration before branching back, reducing loop overhead on this hot path.
The SCC member popping at 0xD2CF61--0xD2CFD0 is likewise 4x unrolled: it pops at offsets -8, -0x10, -0x18, -0x20 relative to the result stack top, then subtracts 0x20 from the stack pointer per iteration.
The completed marker -1 (0xFFFFFFFF) is written to [node+0x10] (DFSNumber), and the SCC identifier (the r14d counter) is written to [node+0x14] (LowLink). After Tarjan completes, the LowLink field holds the SCC index for every node -- the DFSNumber/LowLink fields are repurposed, not preserved.
Only call edges (bit 2 set) are followed during Tarjan. Ref edges (bit 2 clear) are skipped. This is what makes the SCC decomposition "call-SCC" rather than "reference-SCC."

Complexity: O(V + E) where V = nodes in the old SCC and E = call edges among those nodes. The 4x unrolling is a constant-factor optimization, not an algorithmic change.

Incremental SCC Mutation Operations

When a pass modifies the call graph, the SCC structure must be updated without recomputing the entire graph. The LCG provides six mutation operations, each handling a specific kind of edge change. The two most complex are switchInternalEdgeToCall and switchInternalEdgeToRef; the others handle cross-RefSCC edges and bulk operations.

switchInternalEdgeToCall -- `sub_D25FD0` (5,526 bytes)

Called when a ref edge within the same RefSCC becomes a call edge (the inliner or devirtualization resolves an indirect call to a direct call). This may merge previously separate SCCs into one.

// Address: 0xD25FD0 -- 0xD27566
// Signature (deduced):
//   RefSCC::switchInternalEdgeToCall(
//       Node& SourceN,             // rsi
//       Node& TargetN,             // rdx
//       function_ref<void(ArrayRef<SCC*>)> MergeCB  // rcx (nullable), r8 (data)
//   ) -> bool

fn switchInternalEdgeToCall(source: &Node, target: &Node, merge_cb: Option<Fn>) -> bool {
    let source_scc = sub_D23C40(lcg, source);   // lookupSCC at 0xD26025
    let target_scc = sub_D23C40(lcg, target);   // lookupSCC at 0xD2604E

    // FAST PATH 1: Same SCC -- edge type flip only, no structural change
    if source_scc == target_scc {                // 0xD26B5B
        // Mark the edge as a call edge (flip bit 2) via sub_D23E00
        return false;  // no SCC change
    }

    // Look up SCC indices within the RefSCC's ordered list
    let source_idx = sub_D25BD0(refscc.map, source_scc);  // 0xD26055
    let target_idx = sub_D25BD0(refscc.map, target_scc);  // 0xD260A0

    // FAST PATH 2: Source already appears after target in post-order
    // (the new call edge doesn't create a cycle in the SCC DAG)
    if source_idx > target_idx {                 // 0xD260B4
        // Mark edge as call, no SCC restructuring needed
        return false;
    }

    // SLOW PATH: The new call edge creates a cycle between SCCs.
    // Must merge all SCCs in the range [target_idx .. source_idx].

    // Phase A: DFS reachability within the RefSCC (0xD26C92 -- 0xD26DAB)
    // Walk call edges from target, collecting all SCCs reachable
    // back to source. Uses SmallVector worklist (cap 4) and
    // DenseMap visited set at [r15+0x48].
    let mut merge_set: SmallVector<SCC*, 4>;
    let mut visited: DenseSet<SCC*>;
    // ... DFS marks all SCCs on the cycle ...

    // Phase B: Merge SCCs (0xD26335 -- 0xD263E1)
    let merge_range = &refscc.scc_array[target_idx..=source_idx];
    let merge_count = merge_range.len();

    // Allocate temp buffer for std::rotate
    let tmp = sub_2207800(merge_count * 8);      // operator new
    // sub_D23910 rotates the SCC array to consolidate merged entries
    sub_D23910(refscc.scc_array, target_idx, source_idx);

    // Move all nodes from secondary SCCs into the primary SCC
    for scc in &merge_range[1..] {
        primary_scc.members.extend(scc.members);
        scc.members.clear();
    }

    // Update the SCC-to-index DenseMap with double-buffered rehashing
    // Toggle flags byte at [RefSCC+0x40], tombstone old entries,
    // insert new entries into the alternate map via sub_D24C50

    // Phase C: Invoke merge callback (0xD26480)
    if let Some(cb) = merge_cb {
        cb(ArrayRef { ptr: merge_range.as_ptr(), len: merge_count });
    }

    // Phase D: Reindex remaining SCCs (0xD267A2)
    for scc in &refscc.scc_array[target_idx + 1..] {
        scc_index_map[scc] -= merge_count - 1;  // "sub [rax], ebx" at 0xD267B9
    }

    // Notify the graph of structural change
    sub_D23D60(lcg, 1);                          // notifyRefSCCChange

    return true;  // SCC structure changed
}

Allocation fallback: The temporary buffer allocation at 0xD27447 has a halving fallback (sar rbx, 1): if operator new fails for the full size, it retries with half the size. This handles the case where the merge set is unexpectedly large.

DenseMap double-buffering: The RefSCC maintains two DenseMaps at offsets +0x38 and +0x48. The flags byte at +0x40 (bit 0) selects which map is "current." When entries are migrated during SCC merging, old entries are tombstoned (0xFFFFFFFFFFFFE000) in the departing map and inserted fresh into the other map via sub_D24C50. This avoids a full rehash on every merge -- the tombstone count at +0x44 is incremented, and the map is only rehashed (via sub_D25CB0) when the tombstone ratio crosses a threshold.

switchInternalEdgeToRef -- `sub_D2C610` (5,236 bytes)

Called when a call edge within a RefSCC is demoted to a ref edge (a direct call is deleted or replaced with an indirect reference). This may split a single SCC into multiple smaller SCCs.

// Address: 0xD2C610 -- 0xD2DA84
// Signature (deduced):
//   RefSCC::switchInternalEdgeToRef(
//       RefSCC& Result,                   // rdi (output -- new RefSCC or self)
//       ArrayRef<pair<Node*, Node*>> Pairs // rdx (edge mutations), rcx (byte count)
//   ) -> RefSCC&

fn switchInternalEdgeToRef(pairs: &[(Node, Node)]) -> Vec<SCC> {
    // Phase 0: Flip all edge types from call to ref (0xD2C6A2)
    for (source, target) in pairs {
        sub_D23E00(&source.edge_list, target);   // clear bit 2 in edge pointer
    }

    // Phase 1: Check which pairs actually cross SCC boundaries (0xD2C6A2 -- 0xD2CA2B)
    // Processes pairs 4 at a time (4x unrolled loop).
    // For each pair: DenseMap lookup of source's SCC and target's SCC.
    // If same SCC: the call-to-ref demotion might break the SCC.
    // If different SCCs: no structural impact (they were already separated).
    let mut needs_recompute = false;
    for (source, target) in pairs {     // 4x unrolled at 0xD2C6D0
        let src_scc = densemap_lookup(source);
        let tgt_scc = densemap_lookup(target);
        if src_scc == tgt_scc {
            needs_recompute = true;
        }
    }

    if !needs_recompute { return vec![old_scc]; }

    // Phase 2: Run Tarjan's algorithm on the affected SCC (0xD2CC66 -- 0xD2D0BC)
    // (See "Tarjan's SCC Algorithm" section above for full pseudocode.)
    let new_sccs = tarjan_recompute_scc(old_scc, &lcg.allocator);

    if new_sccs.len() == 1 {
        // The SCC survived intact -- no split occurred
        return vec![old_scc];
    }

    // Phase 3: Allocate new SCC objects (0xD2D0BC -- 0xD2D12E)
    for i in 1..new_sccs.len() {
        // BumpPtrAllocator at [LCG+0x150]:
        let cursor = lcg.alloc_cursor;           // [r12+0x150]
        let aligned = (cursor + 7) & !7;         // align to 8
        let new_end = aligned + 0x88;            // 0x88 = 136 bytes per SCC
        if new_end > lcg.alloc_slab_end {        // [r12+0x158]
            sub_9D1E70(allocator, 0x88, 8);      // slow path: allocate new slab
        }
        lcg.alloc_cursor = new_end;
        let scc = aligned as *mut SCC;
        sub_D23F30(scc, lcg);                    // SCC constructor
    }

    // Phase 4: Distribute nodes among new SCCs (0xD2D1F2 -- 0xD2D309)
    // Each node's LowLink field (set by Tarjan to its SCC index) determines
    // which new SCC it belongs to.
    for node in old_scc.members() {
        let scc_idx = node.LowLink;              // [node+0x14]
        new_sccs[scc_idx].members.push(node);
    }

    // Phase 5: Update ownership maps (0xD2D168 -- 0xD2D1DC)
    // Register new SCCs in the RefSCC's SCC list via sub_D248B0
    for scc in &new_sccs[1..] {
        sub_D248B0(lcg, refscc, scc);            // insertRefSCC
    }
    // Update Node -> SCC DenseMap entries
    // Update SCC -> RefSCC back-pointers via sub_D27750

    // Phase 6: Clean up old SCC (0xD2D3D6 -- 0xD2D49A)
    // Reset all DFS/LowLink fields to -1 (completed state)
    // Zero out old SCC's member list
    // Clear old SCC's internal DenseMap via sub_D24EE0

    return new_sccs;
}

Batch processing optimization: The pair-processing loop at 0xD2C6A2 is 4x unrolled: it processes four (Node*, Node*) pairs per iteration, with explicit remainder handling (1, 2, or 3 leftover pairs) at 0xD2CA2B. Each pair occupies 16 bytes (0x10), so the loop advances by 64 bytes per iteration.

SCC object allocation: New SCC objects (136 bytes each) are allocated from the LCG's BumpPtrAllocator at [LCG+0x150]. The allocator maintains a cursor/end pair for the current slab. When the slab is exhausted, sub_9D1E70 allocates a new slab (the slow path). The alignment requirement is 8 bytes, enforced by the (cursor + 7) & ~7 round-up at 0xD2D0F0.

switchOutgoingEdgeToCall / switchOutgoingEdgeToRef -- `sub_D27A10` (29,179 bytes)

Handles edges that cross RefSCC boundaries. When a ref edge from one RefSCC to another becomes a call edge (or vice versa), the RefSCC structure may need updating. If the new call edge creates a cycle between previously separate RefSCCs, they merge into one. This is the RefSCC-level analog of switchInternalEdgeToCall. The function at sub_D27A10 is 29KB -- the largest single function in the LCG cluster -- because it must handle both directions (to-call and to-ref) and the full RefSCC merge/split logic.

insertInternalRefEdge -- `sub_D2A080` (15,253 bytes)

Adds a new ref edge within a RefSCC. Called when optimization introduces a new reference between functions that are already in the same RefSCC (e.g., a new constant expression referencing a sibling function). This does not affect SCC structure (only call edges define SCCs), but it updates the RefSCC's internal edge tracking.

computeRefSCC -- `sub_D2AD40` (12,495 bytes)

Computes the RefSCC decomposition from scratch for a set of nodes. Used during initial graph construction (sub_D2BEB0) and when incremental updates are insufficient (e.g., after bulk edge insertion). This runs a second level of Tarjan's algorithm over the ref-edge graph, grouping SCCs into RefSCCs.

mergeRefSCC -- `sub_D2DA90` (17,930 bytes)

Merges two or more RefSCCs into one. Called when a new ref edge or promoted call edge connects previously separate RefSCCs that are now mutually reachable. This involves relocating all SCCs from the source RefSCC into the target, updating the graph's RefSCC list at [LCG+0x240], and fixing all back-pointers.

CGSCC Pass Manager: Bottom-Up Interprocedural Optimization

The CGSCC pass manager (sub_1A62BF0) wraps the LCG traversal and runs a pipeline of CGSCC passes over each SCC in bottom-up (post-order) order. The pass manager is invoked multiple times at different points in the optimization pipeline, controlled by a pipelineID parameter.

In the O1/O2/O3 pipeline, it is invoked four times, each with 1 devirtualization iteration:

sub_1A62BF0(1,0,0,1,0,0,1)  -- pass #2  (inliner framework, early)
sub_1A62BF0(1,0,0,1,0,0,1)  -- pass #17 (after DSE/GVN/MemCpyOpt)
sub_1A62BF0(1,0,0,1,0,0,1)  -- pass #21 (after ADCE/JumpThreading)
sub_1A62BF0(1,0,0,1,0,0,1)  -- pass #38 (late, after Sink)

At higher tier levels (tier 3 aggressive optimization), a 5-iteration variant appears: sub_1A62BF0(5,0,0,1,0,0,1). The first parameter controls the maximum number of SCC re-visitation iterations when the call graph is mutated during optimization.

The pipeline IDs observed across all optimization levels are 1, 2, 4, 5, 7, and 8, likely corresponding to LLVM's PassBuilder extension points:

Pipeline ID	Extension Point	Notes
1	`EP_EarlyAsPossible` / basic cleanup	Most common, 4x per O2
2	`EP_LoopOptimizerEnd`
4	`EP_ScalarOptimizerLate`	Sometimes with `optFlag=1`
5	`EP_VectorizerStart`	Used at tier 3 (5 iterations)
7	`EP_OptimizerLast`
8	`EP_CGSCCOptimizerLate`	With `optFlag=1` for inlining

The CGSCC Pass Manager Run Loop

The pass manager's run loop implements the DevirtSCCRepeatedPass pattern. For each SCC in post-order:

fn run_cgscc_pipeline(module: &Module, lcg: &mut LazyCallGraph, max_devirt_iterations: u32) {
    // Build initial SCC post-order via sub_D2BEB0 (buildSCCs)
    let post_order = lcg.build_sccs();           // sub_D2BEB0, 10KB

    for refscc in post_order.bottom_up() {       // sub_D2F8A0 / sub_D30800
        for scc in refscc.sccs() {               // sub_D2E510, 7KB
            let mut iteration = 0;
            let mut changed = true;

            while changed && iteration < max_devirt_iterations {
                changed = false;
                iteration += 1;

                // Run each registered CGSCC pass on this SCC
                for pass in &cgscc_pipeline {
                    let result = pass.run(scc, lcg);

                    if result.invalidated_call_graph {
                        // The pass mutated the call graph.
                        // Update SCC structure via switchInternal* operations.
                        // If SCCs were merged or split, re-queue affected SCCs.
                        changed = true;
                    }

                    // Run the CGSCC-to-function adaptor (sub_2362FB0)
                    // to apply function-level passes to newly modified functions
                    if result.invalidated_functions {
                        for func in scc.functions() {
                            run_function_pipeline(func);
                        }
                    }
                }
            }

            if iteration >= max_devirt_iterations && changed {
                // sub_2284BC0: "Max devirtualization iterations reached"
                // Controlled by abort-on-max-devirt-iterations-reached knob
            }
        }
    }
}

Iteration semantics: The max_devirt_iterations parameter (argument 1 to sub_1A62BF0) controls how many times the pass manager will re-run the CGSCC pipeline on an SCC after the call graph mutates. At O1/O2/O3, this is 1 (single pass, no re-visitation). At tier 3, this is 5 (up to 5 re-runs if devirtualization keeps revealing new direct calls). The devirt iteration check at sub_2284BC0 emits "Max devirtualization iterations reached" when the limit is hit and the graph is still changing.

CGSCC-to-Function Adaptor -- `sub_2362FB0` (6,700 bytes)

The adaptor at sub_2362FB0 wraps a function-level pass for execution inside the CGSCC framework. When the inliner inlines a callee, the callee's body is absorbed into the caller. The caller must then be re-optimized with function-level passes (SimplifyCFG, InstCombine, etc.) before the next CGSCC pass runs. The adaptor handles this by running the function pipeline on each function in the current SCC after each CGSCC pass that reports a change.

The adaptor constructor at sub_230AC20 (5.4KB) creates the module-to-function or CGSCC-to-function wrappers. The adaptor itself stores the inner pass pipeline as a nested FunctionPassManager and forwards run() calls to each function in the SCC.

Registered CGSCC Passes

The registered CGSCC passes (from the pipeline parser at sub_2377300):

Pass name	Address/factory	Purpose
`inline`	`sub_2613930`	New PM CGSCC inliner (69KB)
`argpromotion`	`sub_2500970`	Promote pointer args to by-value
`attributor-cgscc`	`sub_2582AC0`	CGSCC attribute deduction (39KB)
`attributor-light-cgscc`	--	Lightweight variant
`function-attrs`	`sub_1841180`	Infer `readonly`, `nounwind`, etc.
`openmp-opt-cgscc`	--	OpenMP kernel optimization
`coro-annotation-elide`	--	Coroutine elision
`coro-split`	--	Coroutine splitting
`nv-early-inliner`	via `sub_2342850`	NVIDIA early inliner (wraps InlinerWrapper)

CGSCC analyses (3 registered):

Analysis name	Purpose
`no-op-cgscc`	No-op analysis (placeholder)
`fam-proxy`	`FunctionAnalysisManagerCGSCCProxy` -- bridges function-level analyses into CGSCC
`pass-instrumentation`	Pass instrumentation callbacks (via `sub_2342830`)

How the CGSCC Inliner Uses the Call Graph

The inliner is the most important consumer of the LazyCallGraph. The New PM inliner at sub_2613930 (69KB) and the NVIDIA custom inliner at sub_1864060 (75KB) both interact with the LCG through a specific protocol.

The core inlining loop (implemented at sub_186CA00, 61KB, Inliner::inlineCallsImpl) runs within the CGSCC framework:

fn inline_calls_in_scc(scc: &mut SCC, lcg: &mut LazyCallGraph) {
    // Collect all call sites in the SCC
    let mut worklist: Vec<CallSite> = collect_call_sites(scc);

    for callsite in &worklist {
        let callee = callsite.callee();
        let caller = callsite.caller();

        // Compute inline cost
        let cost = compute_inline_cost(callee, caller);  // sub_1864060

        // Decision: inline if cost < threshold
        // (emits optimization remarks: "Inlined", "NotInlined", "AlwaysInline",
        //  "NeverInline", "TooBig", etc.)
        if should_inline(cost) {
            // Perform inlining transformation
            inline_function(callsite);

            // CRITICAL: Update the call graph after inlining.
            // The callee's body is now in the caller. New call edges
            // may have appeared (callee's callees are now caller's callees).
            // Old edges may have disappeared (the call to callee is gone).

            // For each new direct call discovered in the inlined body:
            //   lcg.switchInternalEdgeToCall(caller_node, new_callee_node)
            //     -> may merge SCCs, triggering re-visitation

            // For the removed call edge (caller -> callee):
            //   lcg.switchInternalEdgeToRef(caller_node, callee_node)
            //     -> may split SCCs, triggering re-visitation
            //   (or removeEdge entirely if callee has no other references)

            // Run function-level cleanup on the caller
            // via CGSCC-to-function adaptor (sub_2362FB0)
        }
    }
}

Call graph update protocol: After each inline transformation, the inliner must report all edge changes to the LazyCallGraph. The CGSCC pass manager provides an UpdateResult structure that the inliner fills in:

New call edges: The inlined function body may contain direct calls that the caller did not previously have. Each creates a switchInternalEdgeToCall if target is in the same RefSCC, or switchOutgoingEdgeToCall (sub_D27A10) if target is in a different RefSCC.
Removed call edges: The direct call from caller to callee is replaced by the inlined body. If the caller no longer references the callee at all, the edge is removed. If it still references the callee (e.g., another call site remains), the edge type may change.
SCC merging: If the inlined body creates a new call cycle (e.g., A calls B, B's body contains a call to A), the affected SCCs merge. The merge callback re-queues the merged SCC for another pass of the CGSCC pipeline.
SCC splitting: If removing the call edge from caller to callee breaks the only call-path cycle, the SCC splits. New SCCs are created and inserted into the post-order traversal at the correct position.

Initial Graph Construction: buildSCCs -- `sub_D2BEB0` (9,782 bytes)

The initial call graph is built by sub_D2BEB0 when the CGSCC pass manager first runs. This function:

Collects all module-level root functions (kernels, externally visible functions).
For each root, lazily populates edges via sub_D23BF0.
Runs Tarjan's algorithm to decompose the call graph into SCCs.
Runs a second pass (sub_D2AD40, computeRefSCC) to group SCCs into RefSCCs based on ref edges.
Stores the resulting post-order in the LCG's RefSCC list at [LCG+0x240].

The post-order traversal helpers (sub_D2F8A0 at 10KB, sub_D30800 at 8KB) implement the iterator that the CGSCC pass manager uses to walk RefSCCs and SCCs in bottom-up order. The SCC iteration logic at sub_D2E510 (7KB) handles advancing through SCCs within each RefSCC.

Graph Verification -- `sub_D29180` (6,417 bytes)

The verifier at sub_D29180 checks the consistency of the entire LazyCallGraph after mutations. It validates:

Every node's SCC assignment is correct (no node belongs to the wrong SCC).
Every SCC's RefSCC assignment is correct.
Call edges connect nodes that are reachable via calls (SCC invariant).
Ref edges connect nodes within the same RefSCC.
The post-order is valid: for every call edge A -> B, B's SCC appears before A's SCC in the traversal order.
No dangling pointers (all edge targets are live nodes in the graph).

This verifier is expensive (O(V + E) for the whole graph) and is only enabled in debug builds or when explicitly requested.

LazyCallGraph Data Structure Layout

LazyCallGraph (pointed to by [RefSCC+0]):
  +0x000: ...
  +0x130: DenseMap<Node*, SCC*>  (NodeToSCCMap)
           +0x130: qword - bucket count tracking
           +0x138: qword - bucket array pointer
           +0x140: dword - num entries
           +0x144: dword - num tombstones
           +0x148: dword - num buckets
  +0x150: BumpPtrAllocator
           +0x150: qword - current slab cursor
           +0x158: qword - current slab end
  +0x1A0: qword - total allocated bytes
  +0x1B0: SmallVector<SCC*> - SCC ownership list
           +0x1B0: qword - data pointer
           +0x1B8: dword - size
           +0x1BC: dword - capacity
  +0x240: SmallVector<RefSCC*> - RefSCC list (post-order)

GPU-Specific Call Graph Properties

The LCG implementation itself is GPU-agnostic, but the call graph shape on GPU differs fundamentally from CPU:

Kernels are roots. Functions annotated with nvvm.annotations kernel metadata are externally visible entry points. They are the roots of the call graph -- nothing calls a kernel (launches are host-side). In CGSCC ordering, kernels are processed last (they are the top of the bottom-up traversal).

Device functions are internal. Non-kernel __device__ functions are typically internal linkage. They appear in the call graph only as callees. This produces a characteristic tree-like (or DAG-like) call graph with very few cycles, meaning most SCCs contain a single function.

Recursion is rare. CUDA hardware historically did not support recursion (stack depth is bounded, and the compiler must statically allocate the call stack). Although modern architectures permit limited recursion, real-world CUDA code almost never uses it. This means SCC merging (switchInternalEdgeToCall) is rarely triggered -- most CGSCC processing is trivially single-function SCCs in a DAG.

Aggressive inlining collapses the graph. The NVIDIA inline budget (default 20,000, vs LLVM's 225) causes most device functions to be inlined into their callers. After the early inliner pass, the remaining call graph is typically flat: a handful of kernels with large bodies and very few un-inlined callees. Later CGSCC invocations mostly iterate over single-function SCCs.

ThinLTO Interaction

When ThinLTO imports functions from other modules, they appear in the call graph as available_externally definitions. The LCG treats them like any other defined function -- they get nodes, their edges are lazily populated, and they participate in SCC computation. The NVModuleSummary builder (sub_12E06D0) records call graph edges in the module summary, which the ThinLTO import pass uses to decide which cross-module functions to import. Once imported, those functions become candidates for inlining during the CGSCC traversal.

The function-inline-cost-multiplier knob (visible in sub_2613930's string table) penalizes recursive functions during ThinLTO inlining, since recursive inlining can explode code size without bound.

Knobs and Thresholds

Knob	Default	Effect
`inline-budget`	20,000	Per-caller NVIDIA inline cost budget (89x LLVM default)
`inline-threshold`	225	LLVM default cost threshold (used by New PM inliner)
`nv-inline-all`	off	Bypass cost analysis, force-inline everything
`-aggressive-inline`	--	CLI flag, routes to `inline-budget=40000`
`intra-scc-cost-multiplier`	--	Cost multiplier for inlining within the same SCC
`function-inline-cost-multiplier`	--	Cost multiplier for recursive functions
`abort-on-max-devirt-iterations-reached`	false	Abort if devirt iteration limit is hit
`cgscc-inline-replay`	--	Replay file for inline decisions (debugging)
`cgscc-inline-replay-scope`	`Function`	Replay scope: Function or Module
`cgscc-inline-replay-fallback`	`Original`	Fallback: Original, AlwaysInline, NeverInline
`cgscc-inline-replay-format`	`Line`	Replay format: Line, LineColumn, LineDiscriminator
CGSCC iteration count (arg 1 to `sub_1A62BF0`)	1 (O1-O3), 5 (tier 3)	Max SCC re-visitation iterations after graph mutation

Sentinel Values and Constants

Value	Meaning
`0xFFFFFFFFFFFFF000`	DenseMap empty bucket sentinel
`0xFFFFFFFFFFFFE000`	DenseMap tombstone sentinel
`0x100000000`	Packed `{size=0, cap=1}` for SmallVector initialization
`0x100000001`	Packed `{DFSNumber=1, LowLink=1}` for Tarjan root init
`0x400000000`	Packed `{size=0, cap=4}` for SmallVector initialization
`0x800000000`	Packed `{size=0, cap=8}` for SmallVector initialization
`0x88` (136)	SCC object size in bytes
`0x18` (24)	Tarjan StackEntry size (Node*, EdgeIter, EdgeEnd)
`0x10` (16)	Edge mutation pair size (Node, Node)
`0xFFFFFFFF` (-1)	DFSNumber value indicating "completed" / assigned to an SCC

Diagnostic Strings

The call graph printer at sub_D2B640 (12,287 bytes) emits these strings for debugging:

"Printing the call graph for module:"
"RefSCC with"
"SCC with"
"Edges in function:"
"call SCCs:"
"call"
"ref"
" -> "

The DOT dumper at sub_D29900 emits GraphViz format with "digraph", "[style=dashed" (for ref edges), and standard ";\n", "}\n" terminators.

The New PM inliner at sub_2613930 emits: "function-inline-cost-multiplier", "recursive", "recursive SCC split", "unavailable definition".

The devirtualization pass at sub_2284BC0 emits: "Max devirtualization iterations reached".

The old CGSCC inliner at sub_186CA00 emits: "inline", "NoDefinition", "NotInlined", "AlwaysInline", "Inlined", "Callee", "Caller", "cost=always", "cost=", "threshold=".

The call graph DOT writer cluster at 0x2280000--0x228A000 emits: "view-callgraph", "View call graph", "dot-callgraph", "Print call graph to 'dot' file", "Call graph: ", "external caller", "external callee", "external node", "Writing '", "error opening file for writing!".

Function Map

Function	Address	Size	Role
LazyCallGraph cluster start	`sub_D230A0`	--	--
`std::rotate` / SCC array reorder	`sub_D23910`	--	--
SCC array splitting helper	`sub_D23A60`	--	--
`Node::populate()` / edge iterator (lazy population point)	`sub_D23BF0`	--	--
`LazyCallGraph::lookupSCC(Node&)`	`sub_D23C40`	--	--
`RefSCC::isAncestorOf()` connectivity check	`sub_D23CB0`	--	--
`LazyCallGraph::notifyRefSCCChange()`	`sub_D23D60`	--	--
`Edge::setKind()` (flip call/ref tag bit)	`sub_D23E00`	--	--
SCC constructor	`sub_D23F30`	--	--
`LazyCallGraph::insertRefSCC()`	`sub_D248B0`	--	--
Node edge list cleanup	`sub_D24960`	--	--
DenseMap insert (Node-to-SCC)	`sub_D24C50`	--	--
`RefSCC::isPartOfRefSCC()` check	`sub_D24D10`	--	--
DenseMap clear (SCC internals)	`sub_D24EE0`	--	--
`RefSCC::find()` / updateSCCIndex	`sub_D25AF0`	--	--
`RefSCC::SCCIndexMap::find()`	`sub_D25BD0`	--	--
DenseMap grow/rehash	`sub_D25CB0`	--	--
`switchInternalEdgeToCall()`	`sub_D25FD0`	5,526	--
`Node::setRefSCC()`	`sub_D27750`	--	--
`switchOutgoingEdgeToCall/Ref()`	`sub_D27A10`	29,179	--
Call graph verification	`sub_D29180`	6,417	--
DOT graph dumper	`sub_D29900`	8,235	--
`insertInternalRefEdge()`	`sub_D2A080`	15,253	--
`computeRefSCC()`	`sub_D2AD40`	12,495	--
Call graph text printer	`sub_D2B640`	12,287	--
`buildSCCs()` / initial construction	`sub_D2BEB0`	9,782	--
`switchInternalEdgeToRef()`	`sub_D2C610`	5,236	--
`mergeRefSCC()`	`sub_D2DA90`	17,930	--
SCC iteration logic	`sub_D2E510`	6,890	--
`rebuildSCC()`	`sub_D2F240`	6,141	--
Post-order SCC traversal helper	`sub_D2F8A0`	10,451	--
Post-order traversal	`sub_D30800`	7,796	--
Edge management helper	`sub_D301A0`	5,148	--
RefSCC-level operations	`sub_D31270`	7,696	--
CGSCC pass manager / InlinerWrapper factory	`sub_1A62BF0`	--	--
NVIDIA custom inliner (old CGSCC)	`sub_1864060`	75,000	--
`Inliner::inlineCallsImpl()` (CGSCC core loop)	`sub_186CA00`	61,117	--
Call graph node visitor	`sub_2280510`	24,000	--
Call graph builder	`sub_2282680`	33,000	--
DevirtSCCRepeatedPass ("Max devirtualization iterations reached")	`sub_2284BC0`	16,000	--
InlinerWrapper factory (nv-early-inliner, inliner-wrapper)	`sub_2342850`	--	--
CGSCC-to-function adaptor	`sub_2362FB0`	6,700	--
CGSCC pipeline text parser	`sub_2377300`	103,000	--
Attributor CGSCC pass	`sub_2582AC0`	39,000	--
New PM CGSCC inliner	`sub_2613930`	69,000	--

Cross-References

Inliner Cost Model -- the cost computation that the CGSCC inliner uses to decide whether to inline each call site
ThinLTO Function Import -- how cross-module functions are imported into the call graph
Pipeline & Ordering -- where the four CGSCC invocations sit in the overall pass sequence
Optimization Levels -- how CGSCC iteration counts vary by O-level and tier
Hash Infrastructure -- DenseMap internals, sentinel values, and probing strategy used throughout the LCG

AsmPrinter & PTX Body Emission

The NVPTXAsmPrinter is cicc's final code-generation stage: the component that converts the machine-level IR (MachineFunction, MachineBasicBlock, MachineInstr) into the textual PTX that ptxas consumes. Unlike a conventional LLVM AsmPrinter, which emits real machine assembly for a physical ISA, the NVPTX variant emits PTX -- a virtual ISA with its own declarative syntax for registers, parameters, address spaces, textures, and kernel launch metadata. The AsmPrinter is not merely "formatting instructions"; it is responsible for the entire PTX module structure: file header directives, global variable declarations with topological ordering, function signatures with .param space marshaling, register class declarations, the instruction body with debug annotations, and convergence-control pseudo-instructions required by the warp execution model. In cicc v13.0 the printer spans two address clusters -- the NVPTX-specific emission layer at 0x2140000-0x21FFFFF and the LLVM AsmPrinter override at 0x31E0000-0x3240000.


Pass registration	`sub_214ABE0` -- `"NVPTX Assembly Printer"`
emitFunctionBody	`sub_31EC4F0` (12KB, 2565 asm lines)
Header emission (emitHeader)	`sub_214F370` (7.2KB)
Function header orchestrator	`sub_215A3C0` (10KB)
Kernel attribute emission	`sub_214DA90` (8.7KB)
Parameter list emission	`sub_21502D0` (22KB)
Stack frame + register decls	`sub_2158E80` (17KB)
Global variable emission	`sub_2156420` (20KB)
Call prototype emission	`sub_21CF8D0` (29KB)
Inline asm handler	`sub_31F26A0` / `sub_397DF10` (30KB)
AsmPrinter::doFinalization	`sub_3972F10` (24KB)

PTX Output Structure

A complete PTX module emitted by cicc follows this exact structure. Every element in this layout corresponds to a specific emitter function:

//                                          ← sub_214F370 (emitHeader)
// Generated by NVIDIA NVVM Compiler
// Compiler Build ID: ...
// Based on NVVM 7.0.1
//
.version 8.5                                ← PTXVersion / 10 . PTXVersion % 10
.target sm_90, texmode_independent          ← subtarget name + driver interface
.address_size 64                            ← 64 or 32 from subtarget

// Start of file scope inline assembly      ← sub_215ACD0 (doInitialization)
...inline asm...
// End of file scope inline assembly

.extern .func (.param .b32 _) _Z3foov      ← sub_2151550 (forward declarations)
.global .texref my_tex;                     ← sub_2156420 (module-level globals)
.global .surfref my_surf;
.global .samplerref my_samp = { ... };
.global .align 4 .b8 data[1024];

.visible .entry _Z6kernelPf(               ← sub_215A3C0 (function header)
    .param .u64 _Z6kernelPf_param_0
)
.reqntid 256, 1, 1                          ← sub_214DA90 (kernel attributes)
.maxnreg 32
{
    .local .align 16 .b8 __local_depot0[64];← sub_2158E80 (frame + registers)
    .reg .b64   %SP;
    .reg .b64   %SPL;
    .reg .pred  %p<5>;
    .reg .b32   %r<47>;
    .reg .b64   %rd<8>;
    .reg .f32   %f<20>;

    // .loc 1 42 0                          ← sub_31D55F0 (per-instruction debug)
    ld.param.u64 %rd1, [_Z6kernelPf_param_0];
    mov.u32 %r1, %tid.x;
    ...
}
// -- End function

Header Directive Emission -- `sub_214F370`

The header is emitted once during doInitialization (sub_215ACD0). The function builds the output into a SmallString<128> buffer then flushes via OutStreamer.EmitRawText. The emission order is fixed:

Comment block. "// Generated by NVIDIA NVVM Compiler", followed by "// Compiler Build ID: " with the build identifier string, then "// Based on NVVM 7.0.1" (the version string is read from llvm.ident metadata via sub_216F7F0).
.version X.Y -- the PTX ISA version. Computed as PTXVersion / 10 for major, PTXVersion % 10 for minor. In cicc v13.0 targeting SM 90, this is typically .version 8.5.
.target sm_XX[, texmode_independent][, debug] -- the SM target name from NVPTXSubtarget::getTargetName(). The texmode_independent modifier is appended when the driver interface is NVCL (OpenCL). If the driver interface is CUDA and the subtarget lacks double-precision support, map_f64_to_f32 is appended instead. The , debug suffix is added when MCAsmInfo::doesSupportDebugInformation() returns true.
.address_size 64 (or 32) -- from NVPTXSubtarget::is64Bit(). All modern CUDA compilation uses 64-bit.

The doInitialization function (sub_215ACD0) also performs two critical rejection checks: it looks up llvm.global_ctors and llvm.global_dtors named metadata. If either is a non-empty array, it issues a fatal error: "Module has a nontrivial global ctor, which NVPTX does not support." GPU kernels have no program startup phase where global constructors could execute.

Function Declaration: .entry vs .func

The function header orchestrator (sub_215A3C0) emits the complete prologue for each function definition. The emission sequence is:

Step (a): Coroutine pragma. Checks a linked list at this+792 for metadata nodes with type byte 'N' (0x4E) matching the current function. If found, emits .pragma "coroutine";.

Step (b): Linkage directive. Calls sub_214CAD0 which emits .visible, .extern, or .common depending on the function's linkage. CUDA kernel compilation mode is gated by *(this+232)->field_952 == 1.

Step (c): Entry vs function. Calls sub_1C2F070 (isKernelFunction). If the function is a kernel: emit .entry. Otherwise: emit .func.

Step (d): Return type. For .func only. Calls sub_1C2FA50 to check whether the function returns a value. If so, calls sub_214C940 to emit the return type specification (e.g., (.param .b32 retval0)). Kernels have no return values in PTX.

Step (e): Function name. sub_214D1D0 emits the mangled C++ name.

Step (f): Parameter list. sub_21502D0 (22KB) emits the complete .param declaration list. This is the most complex part of the header -- see the next section.

Step (g): Kernel attributes. Only for .entry functions. sub_214DA90 emits launch-bound and cluster directives.

Step (h): Additional attributes. sub_214E300 emits .local_maxnreg if set.

Step (i): Noreturn. If the function has metadata attribute 29 (noreturn) and is not a kernel, emits .noreturn.

Step (j): Open body. Emits {\n.

Step (k): Frame and registers. sub_2158E80 emits the local depot, stack pointer registers, and all virtual register declarations.

.param Space Marshaling

PTX uses .param space for all function arguments. The parameter emission function sub_21502D0 handles the full taxonomy of NVPTX parameter types. The emitted parameter name follows the pattern FUNCNAME_param_N where N is a monotonic index starting at 0.

Scalar parameters are emitted as .param .TYPE _param_N where TYPE is the PTX scalar type (.b32, .b64, .f32, .f64, .pred). Scalars smaller than 32 bits are widened to 32 bits; this is the PTX rule that all .param scalars must be at least 4 bytes. The widening logic: if bit-width <= 32, widen to .b32; if 32 < bit-width < 64, widen to .b64; otherwise keep as-is.

Aggregate / byval parameters are emitted as .param .align ALIGN .b8 _param_N[SIZE] -- a byte array with explicit alignment. The alignment comes from the function's DataLayout and the parameter attribute.

Texture / surface / sampler parameters get special treatment:

.param .texref _param_N -- texture reference (direct binding)
.param .surfref _param_N -- surface reference
.param .samplerref _param_N -- sampler reference
.param .u64 .ptr .texref _param_N -- pointer to texture (indirect)
.param .u64 .ptr .surfref _param_N -- pointer to surface
.param .u64 .ptr .samplerref _param_N -- pointer to sampler

The distinction between direct references and pointer-to-references reflects whether the texture/surface handle is passed by value or by indirection through a 64-bit pointer.

Call prototypes (sub_21CF8D0, 29KB) are emitted for indirect calls. When a function pointer call occurs, the AsmPrinter generates a .callprototype declaration: prototype_N : .callprototype (.param .b32 _) _ (.param .b64 _, .param .b32 _). The prototype index N is monotonically increasing.

Register Declarations

Inside the function body, sub_2158E80 emits register declarations for every virtual register class used. The nine register classes, their vtable addresses, PTX type suffixes, prefixes, and encoded IDs are documented in Register Classes. The encoding scheme, declaration emission format, and the internal-only tenth class are covered in Register Encoding Scheme and Register Declaration Emission.

The emitted text for each class follows the pattern:

.reg .pred  %p<5>;       ← 5 predicate registers needed
.reg .b16   %rs<12>;     ← 12 short integer registers
.reg .b32   %r<47>;      ← 47 general-purpose 32-bit
.reg .b64   %rd<8>;      ← 8 double-width integer
.reg .f32   %f<20>;      ← 20 single-precision float
.reg .f64   %fd<3>;      ← 3 double-precision float

The count for each class is max_register_index + 1. The emitter iterates the function's virtual register map at this+800, deduplicates register classes using a hash table at this+808..832, and tracks the maximum index per class.

The stack frame is emitted before registers when the function has a non-zero local frame:

.local .align 16 .b8 __local_depot0[512];   ← ALIGN from frame info, N = function index
.reg .b64   %SP;                             ← stack pointer (64-bit mode)
.reg .b64   %SPL;                            ← stack pointer local

The __local_depot name is a fixed prefix (#define DEPOTNAME "__local_depot" in the source). %SP is the global stack pointer; %SPL points into the local depot. In 32-bit mode these are .reg .b32.

Global Variable & Texture Emission -- `sub_2156420`

Module-level global variables are emitted by sub_2156420 (20KB), called from emitGlobals during doInitialization. Globals must be emitted in topological order because ptxas does not support forward references. The ordering is computed by sub_2157D50 which performs a DFS over global variable use-def chains, detecting circular dependencies (fatal: "Circular dependency found in global variable set").

Texture references: .global .texref NAME; -- emitted when sub_1C2E830 classifies the global as a texture. Surface references: .global .surfref NAME;. Sampler references get an optional initializer block:

.global .samplerref my_sampler = {
    addr_mode_0 = clamp_to_edge,
    addr_mode_1 = wrap,
    filter_mode = linear,
    force_unnormalized_coords = 1
};

Address mode values: wrap, clamp_to_border, clamp_to_edge, mirror. Filter mode values: nearest, linear. The force_unnormalized_coords field is boolean.

Data globals receive an address-space qualifier from sub_214FA80: .global (addrspace 1), .shared (addrspace 3), .const (addrspace 4), .local (addrspace 5). Managed-memory globals get .attribute(.managed). Unified addressing gets .attribute(.unified) or .attribute(.unified(N)).

Skipped globals: Variables whose names start with "llvm.metadata", "llvm.", or "nvvm." are silently skipped.

Demoted globals (shared memory demotion, addrspace 3) emit a comment: "// NAME has been demoted".

Instruction Emission -- `sub_31EC4F0`

The core emission loop emitFunctionBody at sub_31EC4F0 (12KB) overrides llvm::AsmPrinter::emitFunctionBody. It allocates a 0xF28-byte stack frame (holding SmallString buffers, a DenseMap for instruction-mix statistics, and tracking structures) and proceeds through three phases:

Phase 1: Per-MBB Outer Loop

Iterates the MachineFunction's MBB linked list. The iteration strips tagged-pointer bits (AND ~7) from the ilist node pointers. For each MBB:

Calls emitBasicBlockStart(MBB) via vtable dispatch.
Enters the instruction inner loop.
Calls emitBasicBlockEnd(MBB).
Collects instruction-mix statistics when debug counters are active.

Phase 2: Per-Instruction Inner Loop

For each MachineInstr, reads the opcode at MI+0x44 (uint16) and dispatches through a 46-case jump table:

Default path (real instructions): Calls emitInstruction(MI) via [vtable+0x128], which dispatches to the tablegen-generated printInstruction(). This function uses the NVPTXGenAsmWriter.inc tables to format each instruction: printInstruction() calls NVPTXInstPrinter::printOperand for each operand, producing text like mov.u32 %r0, %r1 or add.f32 %f2, %f0, %f1. After emission, the instruction counter is incremented and, if debug info is present, sub_31D55F0 emits a .loc directive.

Inline assembly (opcodes 1, 2): Routed to sub_31F26A0 / sub_397DF10 (30KB). The inline asm handler parses ${} operand references, handles .att_syntax / .intel_syntax mode switching, and emits // begin inline asm / // end inline asm comment markers. PTX inline assembly is passed through essentially verbatim, with operand substitution.

Meta-instructions (opcodes 3-7, 10-18): These include STACKMAP, PATCHPOINT, EH_LABEL, GC_LABEL, KILL, CFI_INSTRUCTION, DBG_VALUE, DBG_VALUE_LIST, and DBG_LABEL. Most emit labels or debug comments rather than PTX instructions. The KILL pseudo emits a "kill:" comment listing each killed register with sub_2FF6320 (printReg). DBG_LABEL emits "DEBUG_LABEL: <label>".

Convergence control (opcodes 24, 33): CONVERGENCECTRL_ENTRY calls sub_31DB9B0 to mark the entry point of a convergent region. CONVERGENCECTRL_LOOP calls sub_31DB950 to mark a loop-back convergence point. These pseudo-instructions are critical for the PTX assembler to correctly track warp divergence and reconvergence. See the dedicated Convergence Control Framework section below for the full lowering pipeline.

FAKE_USE (opcode 43): Debug-only. Emits "fake_use:" followed by register operands.

MEMBARRIER (opcode 44): Emits "MEMBARRIER" as a raw comment.

Pre- and post-instruction hooks: Before each instruction, the Handlers vector at this+0x240 is iterated, calling beginInstruction(MI) on each handler. After each instruction, endInstruction() is called. The AsmPrinter maintains two handler lists (at +0x240 and +0x228) supporting both debug-info handlers and exception/unwind handlers.

Phase 3: Post-Function Processing

After all MBBs are emitted:

Zero-length function avoidance. If no real instructions were emitted (tracked by var_F30 and var_ED1), inserts a NOP via sub_31DCBB0 with comment "avoids zero-length function".
Function-end label. Creates a "func_end" temp symbol via sub_31DCC50 and emits it for DWARF range tracking.
DWARF line table finalization. Creates CIE/FDE symbols, binds them via emitAssignment, and inserts a debug-loc entry for the function-end symbol.
Handler finalization. Calls endFunction(MF) on all handlers in both lists.
PGO / BBAddrMap emission. If enabled via dword_50360A8, emits BB address maps for profile-guided optimization. Missing labels trigger diagnostic: "pgo-analysis-map is enabled for function... but it does not have labels".
End comment. Emits "-- End function\n" as a raw comment.

Debug Info Emission

Debug information in PTX is emitted as .loc and .file directives embedded in the instruction stream, not as separate DWARF sections (the PTX assembler ptxas constructs the actual DWARF from these directives).

The debug emission is layered:

Layer	Function	Behavior
Per-instruction `.loc`	`sub_31D55F0`	Emits `.loc FileIndex Line Col` for instructions with attached DebugLoc
Source-line comments	`sub_31D89B0`	Emits source location as comments when `asm-printer` debug counter is active
Function-name + inlined-at	`emitInlinedAtInfo` (NVIDIA)	Appends `, function_name LAB, inlined_at FILE LINE COL` to `.loc`
Per-MBB boundary	`sub_31E6100`	Maintains file/line-to-MCSymbol mapping for MBB boundaries
`.file` directives	`emitDwarfFileEntries`	Maps source filenames to file indices during `doFinalization`
DWARF line section	`sub_E81A00`	Binds CIE/FDE symbols for line table construction

The NVIDIA extension to .loc is the function_name and inlined_at attributes. Upstream LLVM's .loc only has file line column. cicc appends inlining context so that ptxas can reconstruct the full inline call stack in DWARF. The InlinedAtLocs set tracks which inlined-at locations have already been emitted, preventing duplicates. A work list (SmallVector<DebugLoc, 8>) is built by walking the inlined-at chain, then emitted in reverse order so that outer locations appear before inner ones.

When InterleaveSrcInPtx is enabled, the AsmPrinter reads source file lines and emits them as comments interleaved with the PTX.

Module-Level Metadata Directives

Kernel launch-bound metadata directives are emitted by sub_214DA90 in this order:

Directive	Metadata Source	Notes
`.blocksareclusters`	`nvvm.blocksareclusters`	Fatal error if `.reqntid` not also set
`.reqntid X, Y, Z`	`nvvm.reqntid` (comma-separated strtol)	Unspecified dims default to 1
`.maxntid X, Y, Z`	Structured metadata readers	Unspecified dims default to 1
`.minnctapersm N`	`sub_1C2EF70`	Min CTAs per SM
`.explicitcluster`	`nvvm.cluster_dim`	SM 90+ only (field_1212 > 0x59)
`.reqnctapercluster X, Y, Z`	Cluster dim readers	SM 90+ only
`.maxclusterrank N`	`sub_1C2EF50`	SM 90+ only
`.maxnreg N`	`sub_1C2EF90`	Register limit per thread

The .pragma "nounroll" directive is emitted at MBB level by sub_3970E40 when llvm.loop.unroll.disable metadata is detected on a loop header. This is an NVIDIA modification to the MBB printer.

The .abi_preserve family of directives is emitted by sub_3937240: .abi_preserve, .abi_preserve_after, .abi_preserve_uniform, .abi_preserve_control. These are NVIDIA-specific PTX directives for register ABI preservation across function calls.

Convergence Control Framework

CUDA's SIMT execution model requires the compiler to track which threads in a warp must execute the same instruction simultaneously. When a conditional branch causes warp divergence (some threads take one path, others take the other), the hardware needs to know where threads reconverge. The convergence control framework propagates this information from LLVM IR intrinsics through MachineInstr pseudo-instructions to the final PTX output, where ptxas uses it to emit correct convergence/reconvergence barriers in SASS.

Three-Layer Architecture

Convergence information flows through three representation layers during compilation:

LLVM IR                    MachineInstr                AsmPrinter
─────────────────────      ──────────────────────      ──────────────────
llvm.experimental          CONVERGENCECTRL_ENTRY       sub_31DB9B0
  .convergence.entry  →    (opcode 24)            →   (emitConvergenceEntry)

llvm.experimental          CONVERGENCECTRL_LOOP        sub_31DB950
  .convergence.loop   →    (opcode 33)            →   (emitConvergenceLoop)

llvm.experimental          CONVERGENCECTRL_ANCHOR      (no AsmPrinter case --
  .convergence.anchor →    (opcode 34)                  dropped before emission)

"convergencectrl"          (operand bundle tag          (verified at IR level,
 operand bundle      →      preserved through ISel)      consumed by pseudo-instrs)

Layer 1: IR intrinsics. Three llvm.experimental.convergence.* intrinsics define convergent regions at the LLVM IR level. Each returns an abstract "convergence token" (type token) that is consumed by calls carrying the convergencectrl operand bundle. The bundle ties a call to a specific convergence scope -- the verifier at sub_29ED7A0 enforces "convergent call needs convergencectrl operand" for any call marked with the convergent attribute (attribute kind 0x34 = 52).

Layer 2: MachineInstr pseudo-opcodes. During instruction selection (SelectionDAG lowering), the convergence intrinsics are lowered to target-independent MachineInstr pseudo-opcodes. These survive register allocation and all machine-level optimization passes unchanged -- they carry no register operands and produce no real instructions. Their sole purpose is to mark positions in the MBB instruction stream for the AsmPrinter.

Layer 3: AsmPrinter emission. The emitFunctionBody loop at sub_31EC4F0 dispatches opcodes 24 and 33 to dedicated emitter functions that translate the pseudo-instructions into whatever PTX annotation ptxas requires for reconvergence tracking. The CONVERGENCECTRL_ANCHOR pseudo (opcode 34) does not appear in the AsmPrinter's 46-case jump table, indicating it is either dropped during ISel or consumed by an earlier machine pass.

Convergence Token Semantics

The convergence token model enforces a strict dominance and nesting discipline:

convergence.entry produces a token that represents the function's entry convergence scope. All threads that enter the function are converged at this point. The token must dominate all its uses.
convergence.loop produces a token scoped to a natural loop. The token marks the point where loop-back-edge threads reconverge before the next iteration. The loop header must dominate all blocks in the cycle.
convergence.anchor produces a token at an arbitrary program point, used for structured convergence within non-loop regions (e.g., structured if/else regions where reconvergence is needed at the join point).
convergencectrl operand bundle attaches a convergence token to a call site. This tells the compiler "this call must execute with the set of threads defined by this token's scope." For example:

%tok = call token @llvm.experimental.convergence.entry()
%result = call float @__shfl_sync(i32 %mask, float %val, i32 %lane)
          [ "convergencectrl"(token %tok) ]

The LLVM verifier (sub_BFC6A0, 211KB) checks that convergent calls carry the bundle; the convergence verifier (sub_E35A10, 14KB) checks the structural invariants.

ConvergenceVerifier -- `sub_E35A10`

The standalone convergence verification pass at sub_E35A10 (14KB) enforces five invariants on convergence token usage:

Invariant	Diagnostic String
Token dominance	`"Convergence control token must dominate all its uses."`
Region nesting	`"Convergence region is not well-nested."`
Cycle heart dominance	`"Cycle heart must dominate all blocks in the cycle."`
Single token per cycle	`"Two static convergence token uses in a cycle..."`
Loop token type	Checks `llvm.experimental.convergence.loop` usage in cycles

The verifier calls sub_B19720 for domination checks, sub_E342D0 for cycle detection (using the generic cycle info infrastructure), sub_E45390 for diagnostic emission, and sub_E348A0 for error reporting. It runs as part of the IR verification pipeline, not as a separate pass -- the convergence invariants are checked alongside other LLVM IR well-formedness rules.

NVIDIA Convergent Branch Intrinsics

In addition to the upstream llvm.experimental.convergence.* intrinsics, cicc defines two NVIDIA-specific convergent branch intrinsics that interact with the convergence framework:

Intrinsic	Builtin ID	Minimum SM	Error on Violation
`llvm.nvvm.branch.if.all.convergent`	3755 / 8282	sm_70+ (Volta)	`"not supported on pre-Volta Architectures"`
`llvm.nvvm.branch.if.convergent`	3754 / 8283	sm_80+ (Ampere)	`"not supported on pre-Ampere Architectures"`

These intrinsics produce a boolean result that must be consumed by exactly one branch instruction (enforced by sub_2C7B6A0 with diagnostic: "result of llvm.nvvm.branch.if.convergent and llvm.nvvm.branch.if.all.convergent can only be used by exactly one branch instruction"). The .all variant tests whether all threads in the warp are converged (equivalent to a "uniform predicate" test); the non-.all variant tests whether the current execution context is convergent (the thread set matches the convergence token's scope).

SM version gating is checked in both the NVVM verifier (sub_1C36530) and the lowering pass (sub_2C7B6A0). The SM version is stored as SM * 10 internally (so sm_70 = 700, sm_80 = 800), compared against thresholds at unk_4D045E8.

The `convergent` Function Attribute (Kind 0x34)

The convergent function/call attribute (attribute kind 52, bit 0x20 at byte offset +33 in the function attribute flags) marks operations that have warp-synchronous semantics. This attribute affects multiple compilation stages:

Constant folding gate (sub_2C7B430). The NVIDIA intrinsic fold function checks hasAttribute(callee, -1, 0x34) before attempting any constant fold. If the callee is convergent, folding is rejected unconditionally -- even if all arguments are compile-time constants. This prevents __syncthreads(), __ballot_sync(), __shfl_sync(), and warp-vote operations from being eliminated.

Inline asm convergence flag. During SelectionDAG lowering of inline assembly (sub_1560260), the convergent attribute is tested via operand bundle or function attribute. If set, bit 5 of the inline asm flags word is set (isConvergent), encoding into the DAG node as: flags = hasSideEffects | (isAlignStack << 1) | (dialect << 2) | (convergent << 5).

Loop unrolling epilog forcing. When a loop body contains convergent calls (hasCallInLoop check), the unroller forces epilog remainder style rather than prolog, because epilog preserves the property that all threads participate in each full iteration of the unrolled body.

StructurizeCFG skip. Functions carrying the convergent attribute (attribute ID 56 in the attribute check at sub_B2D610) are skipped by the StructurizeCFG pass -- they are assumed to already have correct convergence structure.

Dead barrier elimination gate. The dead sync elimination engine (sub_2C83D20) identifies barrier intrinsics by checking bit 0x20 at byte +33 (the convergent attribute flag) on the callee, combined with opcode 85 (the internal barrier opcode) and a barrier intrinsic ID confirmation via sub_CEA1A0.

Operand Bundle Registration

The convergencectrl operand bundle tag is registered during LLVMContext initialization at sub_B6EEA0 (9KB), alongside the other standard bundle tags:

Operand bundle tags registered at context creation:
  "funclet"           -- EH funclet scope
  "gc-transition"     -- GC state transition
  "ptrauth"           -- pointer authentication
  "kcfi"              -- kernel control flow integrity
  "convergencectrl"   -- convergence token attachment

These tags are interned as string IDs in the context's operand bundle tag table. When the bitcode reader parses a call instruction with operand bundles (sub_14FCE40, 107KB), the convergencectrl bundle is reconstructed from the bitcode record and attached to the CallInst/InvokeInst. The inliner at sub_29ED7A0 (96KB) checks "convergent call needs convergencectrl operand" to verify that convergent calls in the callee carry appropriate bundles after inlining.

Pseudo-Instruction Lowering in emitFunctionBody

The emitFunctionBody loop at sub_31EC4F0 handles the two convergence pseudo-instructions as part of its 46-case opcode switch:

Case 24 -- CONVERGENCECTRL_ENTRY. Calls sub_31DB9B0 (emitConvergenceEntry). This function is positioned at address 0x31DB9B0, immediately after sub_31DB950 in the binary layout (the two functions are adjacent, separated by only 0x60 bytes: 0x31DB950 to 0x31DB9B0). The entry pseudo marks the function entry convergence point. It does not emit visible PTX text -- instead it updates internal state that the OutStreamer uses for reconvergence tracking in the generated object.

Case 33 -- CONVERGENCECTRL_LOOP. Calls sub_31DB950 (emitConvergenceLoop). This marks loop-back convergence points. Like the entry pseudo, it produces no visible PTX output but influences ptxas's reconvergence analysis.

Both pseudo-instructions are "silent" -- they do not increment the instruction counter (var_F30), do not trigger .loc emission, and do not invoke the beginInstruction/endInstruction handler callbacks. They fall through the switch without reaching the default path's instruction-counting logic.

Post-Function Convergence Close-Out

After all MBBs in a function are emitted, the emitFunctionBody function performs convergence-related cleanup in Phase 3a (0x31ECFFD-0x31ED0FA):

Phase 3a: Convergence control close-out
  if (var_ED1 == true):                          // any real instructions seen?
      OutStreamer->emitAlignment(MF->getAlignment())
      for sym in MF->globalSymbolTable[0x48..0x50]:
          if (sym[-0x16] & 0x7FFF) != 0:         // visibility flags
              sub_31E1750(sym)                    // resolveBlockAddress
              if block_was_removed:
                  emit diagnostic "Address of block that was removed by Co..."
                  OutStreamer->emitLabel(fallback_sym)

The var_ED1 flag tracks whether any non-meta instructions appeared in the function body. When set, the close-out phase emits function alignment, resolves block-address symbols in the global symbol table (checking visibility flags at sym[-0x16] & 0x7FFF), and handles the edge case where a basic block was removed by CodeGen after a block-address was taken -- this would produce a dangling convergence reference, so a diagnostic is emitted and a fallback label is created.

Convergence and the StructurizeCFG Pass

The StructurizeCFG pass (documented in StructurizeCFG) is the primary consumer of convergence information during the CFG transformation phase. PTX requires reducible control flow: every back-edge must target a loop header that dominates all blocks in the cycle, and every divergent branch must reconverge at a post-dominator.

The pass performs a domtree-guided reconvergence insertion that stores head/tail pointers into function metadata at *(func_obj+672) and *(func_obj+680). These pointers are read by subsequent PTX emission passes to emit correct convergence annotations. Functions with the convergent attribute (or optnone) are skipped entirely -- they are assumed to already have correct structure.

When non-uniform divergent regions are identified, the pass creates new "reconvergence" basic blocks, copies phi entries, and reroutes edges so that all divergent paths merge at a single post-dominator. The sub_35CB4A0 uniformity check and sub_35C9ED0 NCA (nearest common ancestor) computation in the dominator tree determine where reconvergence points are inserted.

NVIDIA Extensions Beyond Upstream

cicc's AsmPrinter diverges from upstream LLVM's NVPTXAsmPrinter in several important ways:

Convergence control pseudo-instructions. Upstream LLVM (as of the LLVM 20 base) has llvm.experimental.convergence.* intrinsics, but the AsmPrinter handling of CONVERGENCECTRL_ENTRY and CONVERGENCECTRL_LOOP as dedicated opcode cases (24 and 33 in the jump table) with calls to sub_31DB9B0 / sub_31DB950 is cicc-specific. These ensure correct warp-level synchronization semantics in the emitted PTX. Additionally, cicc adds two NVIDIA-specific convergent branch intrinsics (llvm.nvvm.branch.if.convergent for sm_80+ and llvm.nvvm.branch.if.all.convergent for sm_70+) that have no upstream equivalent. See the Convergence Control Framework section for the full pipeline.

Enhanced .loc with inlined-at. The function_name and inlined_at extensions to .loc directives are NVIDIA additions. Upstream LLVM's NVPTX backend emits only standard .loc file line col. cicc's version walks the full inlining chain to produce richer debug information.

Cluster directives (SM 90+). The entire cluster attribute family (.blocksareclusters, .explicitcluster, .reqnctapercluster, .maxclusterrank) and the 15 cluster special registers are NVIDIA extensions to PTX not present in upstream LLVM's NVPTX backend.

.abi_preserve directives. The register ABI preservation annotations emitted by sub_3937240 have no upstream equivalent.

.pragma "coroutine". The coroutine pragma emission in the function header orchestrator is NVIDIA-specific, supporting CUDA coroutine execution.

PGO/BBAddrMap integration. The BBAddrMap and PGO analysis info structures (0x80 and 0x98 bytes respectively, dynamically allocated when analysis passes are absent) are LLVM 16+ features that cicc integrates into the PTX emission path.

Instruction-mix statistics. The per-MBB instruction-mix collection ("INST_<name>: <count>" format) under the "asm-printer" statistic group is significantly more elaborate than upstream's simple instruction counter.

Dual handler lists. cicc maintains two separate AsmPrinterHandler lists (at this+0x240 and this+0x228), iterated independently for beginInstruction/endInstruction/endFunction. Upstream uses a single handler list.

Function Map

Function	Address	Size	Role
NVPTXAsmPrinter pass registration	`sub_214ABE0`	--	--
Return type / `.attribute(.unified)` emission	`sub_214C940`	1.9KB	--
Linkage directive emission (`.visible`/`.extern`/`.common`)	`sub_214CAD0`	2.4KB	--
Kernel attribute emission (`.reqntid`, `.maxnreg`, cluster)	`sub_214DA90`	8.7KB	--
`.local_maxnreg` emission	`sub_214E300`	1.3KB	--
`emitHeader` (`.version`, `.target`, `.address_size`)	`sub_214F370`	7.2KB	--
Address space qualifier emission	`sub_214FA80`	1.9KB	--
`emitFunctionParamList` (`.param` declarations)	`sub_21502D0`	22KB	--
Parameter name generation (`_param_N`)	`sub_2150230`	--	--
Function forward declaration emission	`sub_2151550`	3.9KB	--
`emitFunctionEntryLabel` (`.entry`/`.func`)	`sub_2151D30`	7.0KB	--
Function alias emission (`.alias`)	`sub_21518E0`	5.0KB	--
Static initializer expression emission	`sub_2153350`	5.3KB	--
Byte-level constant data emission	`sub_2153AE0`	9.9KB	--
`printModuleLevelGV` (texref/surfref/samplerref/data)	`sub_2156420`	20KB	--
Global variable topological sort	`sub_2157D50`	5.9KB	--
Register class -> encoded ID	`sub_21583D0`	4.6KB	--
Stack frame + register declaration emission	`sub_2158E80`	17KB	--
Function header orchestrator	`sub_215A3C0`	10KB	--
Module-level emission entry (ctor/dtor check, DWARF)	`sub_215ACD0`	8.1KB	--
GenericToNVVM pass registration	`sub_215DC20`	--	--
Register class -> PTX type suffix	`sub_2163730`	1.7KB	--
Register class -> PTX register prefix	`sub_21638D0`	1.6KB	--
`llvm.ident` / `"Based on NVVM 7.0.1"` reader	`sub_216F7F0`	5.7KB	--
`emitCallPrototype` (`.callprototype` for indirect calls)	`sub_21CF8D0`	29KB	--
Atomic opcode emission (13 operations)	`sub_21E5E70`	--	--
L2-hinted atomic emission (SM 80+)	`sub_21E6420`	--	--
Address space conversion (cvta) + MMA helpers	`sub_21E7FE0`	--	--
Standard special register emission (%tid, %ctaid, etc.)	`sub_21E86B0`	--	--
Cluster barrier emission (SM 90+)	`sub_21E8EA0`	--	--
Cluster special register emission (SM 90+)	`sub_21E9060`	--	--
Memory barrier emission (membar/fence)	`sub_21E94F0`	--	--
`printReg` (register number -> `%rN` string)	`sub_2FF6320`	--	--
Per-instruction `.loc` DWARF directive	`sub_31D55F0`	--	--
Instruction-level debug comment emission	`sub_31D89B0`	--	--
`emitConvergenceEntry` (CONVERGENCECTRL_ENTRY pseudo, opcode 24)	`sub_31DB9B0`	--	--
`emitConvergenceLoop` (CONVERGENCECTRL_LOOP pseudo, opcode 33)	`sub_31DB950`	--	--
ConvergenceVerifier::verify (token dominance/nesting checks)	`sub_E35A10`	14KB	--
Cycle detection for convergence verification	`sub_E342D0`	--	--
Convergence verification error reporting	`sub_E348A0`	--	--
Inliner/verifier core (`"convergent call needs convergencectrl operand"`)	`sub_29ED7A0`	96KB	--
NVVM convergent branch intrinsic SM-version gating	`sub_1C36530`	--	--
Convergent branch lowering + single-use enforcement	`sub_2C7B6A0`	--	--
Metadata kind + operand bundle tag registration (incl. `convergencectrl`)	`sub_B6EEA0`	9KB	--
`emitNops` (zero-length function avoidance)	`sub_31DCBB0`	--	--
`createTempSymbol` (`"func_end"`, `"Ltmp"`)	`sub_31DCC50`	--	--
`emitFunctionBody` (main loop)	`sub_31EC4F0`	12KB	--
`emitInlineAsm`	`sub_31F26A0`	--	--
`.abi_preserve` directive emission	`sub_3937240`	14KB	--
MBB printer + `.pragma "nounroll"`	`sub_3970E40`	18KB	--
`doFinalization`	`sub_3972F10`	24KB	--
`emitInlineAsm` (parser/streamer)	`sub_397DF10`	30KB	--

Cross-References

PTX Emission -- hub page for the emission stage with additional detail on atomic/barrier/special-register emission
Code Generation -- the MachineInstr-producing stage that feeds the AsmPrinter
SelectionDAG -- instruction selection that creates the MachineInstrs
NVPTX Call ABI -- .param space calling convention detail
Register Allocation -- determines which virtual registers exist for the register declaration phase
Inliner Cost Model -- inlining decisions that create the inlined-at debug chains the AsmPrinter must emit
StructurizeCFG -- CFG restructuring pass that creates reconvergence basic blocks for divergent control flow
Dead Sync Elimination -- dead barrier elimination engine that uses the convergent attribute to identify barrier intrinsics
SM 70-89 Architecture -- SM version gating for convergent branch intrinsics
GPU Execution Model -- SIMT warp divergence/reconvergence background

Debug Info Verification

cicc includes a custom debug info verification pass (sub_29C8000) that validates DWARF-like debug metadata after each optimization pass in the pipeline. This is not the upstream LLVM IR Verifier (llvm::Verifier::verify(Module)); it is an NVIDIA-specific implementation derived from LLVM's CheckDebugInfoPass (in Debugify.cpp) with two significant extensions: a structured JSON reporting mechanism that tracks exactly which optimization passes degrade debug info quality, and a configurable verbosity system that allows the verification overhead to be tuned from silent to exhaustive. The pass lives in a self-contained module of approximately 93 functions in the 0x29C0000--0x29FFFFF address range, alongside the Debugify synthetic debug info injector and general pass infrastructure utilities. Its purpose is to ensure that when a developer compiles with -g or -generate-line-info, the debug metadata that cuda-gdb and Nsight Compute rely on survives the aggressive optimization pipeline intact.


Primary function	`sub_29C8000` (12,480 bytes, 434 basic blocks)
Address range	`0x29C8000` -- `0x29CB0C0`
Per-instruction verifier	`sub_29C3AB0` (5,592 bytes)
Debugify injector	`sub_29C1CB0`
NewPM wrappers	`sub_22702B0` (`NewPMCheckDebugifyPass`), `sub_2270390` (`NewPMDebugifyPass`)
Pipeline parser names	`"check-debugify"` (pass #26), `"debugify"` (pass #35)
Verbose output flag	`qword_5008FC8` (bool)
Depth threshold	`qword_5008C88` (int32)
Stack frame	`0x4B8` bytes (eight tracking structures)
Upstream origin	`llvm/lib/Transforms/Utils/Debugify.cpp` -- `CheckDebugInfoPass`

Three Verification Modes

cicc supports three independent verification protocols, each activated by a different set of knobs. Understanding which protocol is active determines what diagnostic output to expect and how much overhead the verification adds.

Mode 1: Post-Pass Debug Info Verification (`verify-each`)

The default verification mode, activated by the verify-each (or its alias verify-after-all) LLVM knob. The pipeline runner invokes sub_29C8000 as a sandwich around each optimization pass:

// Pseudocode for the pipeline runner's verification protocol
// (entry: 0x29C8000, stack: 0x4B8 bytes)
snapshot_debug_metadata(M);
run_optimization_pass(M, "instcombine");
sub_29C8000(M, errs(), dbgCU, hashMap, "instcombine", 11, file, fileLen, jsonOut);

The pass name argument identifies which optimization just ran, so the JSON report can attribute any debug info degradation to the specific pass responsible. The verifier checks the full metadata inventory: subprograms, scopes, variables, types, labels, imported entities, and retained nodes. It produces ERROR diagnostics for dropped subprograms and WARNING diagnostics for dropped debug variable intrinsics.

Activation: -Xcicc -verify-each or -Xcicc -verify-after-all Overhead: One full metadata snapshot + eight hash table constructions + per-function variable scan per optimization pass. Substantial for large modules.

Mode 2: Debugify Synthetic Injection + Verification (`debugify-each`)

The full Debugify cycle injects synthetic debug metadata before each pass, runs the pass, then verifies the synthetic metadata survived. This mode is more aggressive than Mode 1 because it tests every pass even on code compiled without -g.

// Debugify cycle pseudocode
sub_29C1CB0(M, "llvm.debugify");   // inject synthetic debug info
run_optimization_pass(M, "instcombine");
sub_29C8000(M, errs(), dbgCU, hashMap, "instcombine", ...);  // verify
strip_debugify_metadata(M, "llvm.debugify");  // cleanup

The injector (sub_29C1CB0) creates "llvm.debugify" / "llvm.mir.debugify" named metadata nodes that serve as watermarks. The checker looks for these watermarks to distinguish synthetic from genuine debug info.

Activation: -Xcicc -debugify-each Sub-knobs: debugify-level (locations or location+variables), debugify-quiet, debugify-func-limit, debugify-export

Mode 3: Debug Info Preservation Checking (`verify-debuginfo-preserve`)

A lighter-weight mode that checks only whether existing debug info survives optimization, without injecting synthetic metadata. This mode is available through the New Pass Manager infrastructure and can export results via verify-di-preserve-export.

Activation: -Xcicc -verify-debuginfo-preserve Sub-knobs: verify-each-debuginfo-preserve, verify-di-preserve-export

Mode Selection Matrix

Knob	Scope	Injects synthetic?	Checks variables?	JSON output?
`verify-each`	All passes	No	Yes (if `-g`)	If jsonOutput != NULL
`debugify-each`	All passes	Yes	Configurable via `debugify-level`	Via `debugify-export`
`verify-debuginfo-preserve`	All passes	No	Yes	Via `verify-di-preserve-export`
(none, `-g` active)	--	No	No per-pass check	No

Pipeline Integration

The verifier operates as an interleaved "check" pass. The New Pass Manager registers it via two wrappers in the pipeline construction code at 0x2270000--0x227FFFF:

Address	Registration string	Role
`sub_22702B0`	`"NewPMCheckDebugifyPass]"`	Verification after each pass
`sub_2270390`	`"NewPMDebugifyPass]"`	Synthetic injection before each pass
`sub_2270470`	`"VerifierPass]"`	Standard IR verifier (separate)

The pipeline text parser (sub_2272BE0, 14KB) recognizes these as named module passes:

Slot	Pipeline name	Class	Level
#26	`"check-debugify"`	`NewPMCheckDebugifyPass`	Module
#35	`"debugify"`	`NewPMDebugifyPass`	Module

When debugify-each is active, the pipeline builder (sub_2277440, 60KB -- buildDefaultPipeline() equivalent) wraps every optimization pass in a debugify/check-debugify pair. When verify-each is active, only the check-debugify wrapper is inserted.

Verification Function Signature

The function signature reconstructed from the binary:

bool sub_29C8000(
    Module*       module,       // rdi
    raw_ostream&  output,       // rsi -- diagnostic stream
    NamedMDNode*  dbgCU,        // rdx -- "llvm.dbg.cu" metadata
    DenseMap*     hashMap,      // rcx -- metadata identity table
    const char*   passName,     // r8
    size_t        passNameLen,  // stack+0x00
    const char*   fileName,     // stack+0x08
    size_t        fileNameLen,  // stack+0x10
    raw_ostream*  jsonOutput,   // stack+0x18 -- NULL if no JSON report
    ...
);
// Returns: true = all checks passed, false = any violation detected

Verification Algorithm

The pass proceeds through nine sequential phases within a single function call. The 0x4B8-byte stack frame holds eight separate tracking data structures.

Phase 1: Module-Level Guard (`0x29C8000` -- `0x29C807A`)

Looks up the "llvm.dbg.cu" named metadata node via sub_BA8DC0 (Module::getNamedMetadata). If absent or empty, prints ": Skipping module without debug info\n" and returns 0. This is the fast path for modules compiled without -g.

Phase 2: Pre-Pass Metadata Snapshot (`0x29C8080` -- `0x29C8AE5`)

Initializes eight SmallVector/DenseMap structures on the stack and walks the compile unit metadata tree:

Stack offset	Purpose	Copy helper
`var_1F0`	DISubprogram tracking set	`sub_29C6AD0`
`var_1D0`	Scope chain working set	`sub_29C1190`
`var_1A0`	DIVariable tracking	`sub_29C1060`
`var_170`	Scope-to-function mapping	--
`var_140`	DICompileUnit refs	--
`var_130`	Primary metadata node buffer	--

For each DICompileUnit operand, the pass walks the subprogram list and retained types, recording every metadata node in hash tables for O(1) identity comparison. The hash function is:

uint64_t hash = ((ptr >> 4) ^ (ptr >> 9)) & (bucket_count - 1);

This is the standard DenseMap pointer hash with LLVM-layer sentinels. See Hash Table and Collection Infrastructure for the complete specification.

Phase 3: DISubprogram Iteration (`0x29C82BE` -- `0x29C84C8`)

Walks the subprogram list attached to each compile unit via linked-list traversal ([node+8] = next pointer). For each subprogram, reads the metadata tag byte at [node-18h]:

Tag byte	DWARF tag	Action
`0x54` (`'T'`)	`DW_TAG_template_parameter`	Skip
`0x55` (`'U'`)	Compile unit / subprogram variant	Special handling
`0x44` (`'D'`)	`DW_TAG_subprogram`	Validate
`0x45` (`'E'`)	`DW_TAG_lexical_block`	Validate scope chain
`0x46` (`'F'`)	`DW_TAG_lexical_block_file`	Validate scope chain
`0x47` (`'G'`)	`DW_TAG_namespace`	Validate scope chain

The flag byte at [rdx+21h] & 0x20 tests the "definition" bit (only defined, non-declaration subprograms are tracked). Values outside 0x44--0x47 are flagged as invalid scope types.

Phase 4: Hash Table Construction (`0x29C8508` -- `0x29C8AC2`)

Allocates and populates eight sorted hash tables via sub_C7D670 (aligned_alloc, alignment=8), each holding 16-byte entries [pointer, secondary_key]:

Object offset	Table contents	Purpose
`+18h`	DISubprogram	Function-level metadata
`+28h`	DIScope	Scope hierarchy
`+48h`	DIGlobalVariable	Module-level variables
`+58h`	DILocalVariable	Function-local variables
`+78h`	DIType	Type descriptions
`+88h`	DIImportedEntity	`using` declarations
`+A8h`	DILabel	Label metadata
`+B8h`	Retained nodes	Misc retained metadata

The MDNode operand access pattern used during population:

// MDNode internal layout decoding (0x29C8508+)
byte flags = *(ptr - 0x10);
if (flags & 0x02) {          // distinct metadata
    operands = *(ptr - 0x20);  // operand array is before the node
} else {
    int count = (flags >> 2) & 0x0F;
    operands = ptr - 0x10 - (count * 8);  // inline operands
}

Phase 5: Per-Function Debug Variable Checking (`0x29C8B3B` -- `0x29C9060`)

Iterates every function in the module. For each, looks up its DISubprogram in the hash table and cross-references dbg.value() / dbg.declare() intrinsics against the pre-snapshot. Two diagnostic levels:

ERROR (pass dropped a subprogram entirely):

ERROR: <pass> dropped DISubprogram of <function> from <file>
ERROR: <pass> did not generate DISubprogram for <function> from <file>

WARNING (pass dropped individual variable tracking):

WARNING: <pass> drops dbg.value()/dbg.declare() for <var> from function <func> (file <file>)

The distinction between "dropped" and "did not generate" is significant: "dropped" means metadata existed before the pass and was deleted; "not-generate" means the pass created new IR (e.g., from inlining or outlining) without attaching corresponding debug metadata. This taxonomy is important for GPU compilation because kernel outlining and device function inlining frequently create new IR nodes.

The variable name is resolved by:

Getting DISubprogram from the metadata ref
Calling sub_AF34D0 (DIScope::getScope()) to walk the scope chain upward
Getting the file via operand [10h] of the scope's file ref
Calling sub_B91420 (MDString::getString()) to convert MDString to StringRef

Phase 6: Per-Instruction Location Verification (`0x29C8D42` -- `0x29C8D85`)

Delegated to sub_29C3AB0 (5,592 bytes), which performs detailed checks:

Every instruction with a DebugLoc has a valid DILocation
DILocation scope chains resolve to a valid DISubprogram
No orphaned debug locations reference deleted subprograms
BB-level consistency: all instructions in a basic block share compatible scopes
Dropped location tracking: emits "dropped DILocation" diagnostics

The JSON output from this sub-pass uses structured field names: "DILocation", "bb-name", "fn-name", "action" (with values "drop" or "not-generate").

Phase 7: JSON Structured Output (`0x29C90BC` -- `0x29C94E2`)

When a non-null JSON output stream is provided (the jsonOutput parameter), the pass serializes a structured report via sub_2241E40 (YAML/JSON serializer):

{"file":"kernel.cu", "pass":"instcombine", "bugs": [
  {"metadata":"DISubprogram", "name":"_Z6kernelPf", "fn-name":"_Z6kernelPf", "action":"drop"},
  {"metadata":"dbg-var-intrinsic", "name":"idx", "fn-name":"_Z6kernelPf", "action":"not-generate"}
]}

This JSON reporting mechanism is an NVIDIA extension with no upstream LLVM equivalent. It feeds into NVIDIA's internal CI infrastructure to track debug info quality regressions across compiler versions. The "no-name" string serves as fallback when the pass name pointer is NULL.

The serialization calls sub_CB7060 (YAML::IO constructor) and proceeds through sub_C6D380 (object emission), sub_C6C710 (array emission), and sub_C6B0E0 (key writer). After serialization, the stream is flushed via sub_CB7080 and freed via sub_CB5B00. If the file descriptor is valid (fd != -1), it is closed via sub_C837B0 (close(fd)).

Phase 8: Result Reporting and Metadata Reconstruction (`0x29C94E2` -- `0x29C9A27`)

Prints the summary line ("<pass>: PASS\n" or "<pass>: FAIL\n"), then reconstructs the module's metadata tables from the verified versions -- reallocating subprogram, type, variable, label, and global variable arrays and copying verified metadata back into the compile unit structures.

The result is a 3-way outcome in bit flags (combined at 0x29C9073--0x29C9080 via AND):

Bit 0: any verification failure (determines PASS/FAIL)
Bit 1: JSON report was requested and successfully written

The final result is PASS only if all sub-checks passed AND the JSON report (if requested) was successfully written.

Cleanup frees all eight temporary hash tables (each via sub_C7D6A0 -- sized dealloc with alignment 8), linked list nodes via j_j___libc_free_0, and SmallVector inline buffers are detected by pointer comparison (if ptr == stack_addr, skip free).

Phase 9: Return (`0x29C9A12` -- `0x29C9A27`)

Returns var_420 (bool) in the al register. Standard epilog restores rbx, r12--r15, rbp.

Complete Diagnostic Code Table

Every diagnostic string emitted by the debug verification subsystem, with exact provenance and trigger conditions.

Verification Pass Diagnostics (`sub_29C8000`)

#	Severity	Diagnostic string	Trigger condition	Address range
D01	INFO	`": Skipping module without debug info\n"`	`"llvm.dbg.cu"` absent or empty	`0x29C8000`--`0x29C807A`
D02	ERROR	`"ERROR: <pass> dropped DISubprogram of <func> from <file>\n"`	DISubprogram existed pre-pass, absent post-pass	`0x29C8C08`--`0x29C8D2E`
D03	ERROR	`"ERROR: <pass> did not generate DISubprogram for <func> from <file>\n"`	New function has no DISubprogram	`0x29C8C08`--`0x29C8D2E`
D04	WARNING	`"WARNING: <pass> drops dbg.value()/dbg.declare() for <var> from function <func> (file <file>)\n"`	Variable intrinsics lost for a tracked variable	`0x29C8E4E`--`0x29C9060`
D05	SUMMARY	`"<pass>: PASS\n"`	All checks passed	`0x29C94E2`+
D06	SUMMARY	`"<pass>: FAIL\n"`	Any check failed	`0x29C94E2`+
D07	ERROR	`"Could not open file: <path>\n"`	JSON report file I/O failure	`0x29C90BC`--`0x29C94E2`

Per-Instruction Verifier Diagnostics (`sub_29C3AB0`)

#	Severity	Diagnostic string	Trigger condition
D08	ERROR	`"<pass> dropped DILocation"`	Instruction had DILocation pre-pass, absent post-pass
D09	ERROR	`"<pass> did not generate DISubprogram"`	DILocation references nonexistent subprogram
D10	ERROR	(scope chain invalid)	DILocation scope chain does not resolve to a valid DISubprogram
D11	WARNING	(BB inconsistency)	Instructions within a basic block reference incompatible scopes

JSON Report Field Schema

#	Field key	Type	Values	Context
J01	`"file"`	string	Source filename	Top-level report
J02	`"pass"`	string	Pass name, or `"no-name"` if NULL	Top-level report
J03	`"bugs"`	array	Array of bug objects	Top-level report
J04	`"metadata"`	string	`"DISubprogram"`, `"dbg-var-intrinsic"`, `"DILocation"`	Per-bug object
J05	`"name"`	string	Entity name (function or variable)	Per-bug object
J06	`"fn-name"`	string	Containing function name	Per-bug object
J07	`"bb-name"`	string	Basic block name	Per-bug object (location bugs)
J08	`"action"`	string	`"drop"` or `"not-generate"`	Per-bug object

Action Value Taxonomy

Action	Meaning	Common cause in GPU compilation
`"drop"`	Pass explicitly or inadvertently deleted existing debug metadata	Dead code elimination removing a function with debug info
`"not-generate"`	Pass created new IR without attaching corresponding debug metadata	Kernel outlining, device function inlining, or loop transformation creating new BBs

String Encoding Details

Several diagnostic strings are constructed inline using immediate mov instructions rather than string table references:

String	Encoding	Instruction
`"ERRO"`	`0x4F525245`	`mov dword [rsp+X], 0x4F525245`
`"R:"`	`0x3A52`	`mov word [rsp+X+4], 0x3A52`
`"WARNING:"`	`0x3A474E494E524157`	`mov qword [rsp+X], 0x3A474E494E524157`

These inline immediate constructions avoid string table lookups and are a common LLVM raw_ostream optimization for short fixed strings.

Compile Unit Descriptor Layout

The verification pass reads and reconstructs a per-CU descriptor object (referenced at [rbp+var_440]) with the following layout:

Offset	Type	Contents	Copy helper
`+08h`	`void**`	Subprogram array data pointer	--
`+10h`	`void**`	Subprogram array end pointer	--
`+18h`	`size_t`	Subprogram count	--
`+20h`	`void*`	Scope chain data	--
`+28h`	`size_t`	Scope chain count	--
`+38h`	`void**`	Global variable array data	--
`+40h`	`void**`	Global variable array end	--
`+48h`	`size_t`	Global variable count	--
`+50h`	`void*`	Local variable list head	--
`+58h`	`size_t`	Local variable count	--
`+68h`	`void**`	Type array data	--
`+70h`	`void**`	Type array end	--
`+78h`	`size_t`	Type count	--
`+80h`	`void*`	Imported entities list	`sub_29C2230` (32-byte node deep copy)
`+88h`	`size_t`	Imported entities count	--
`+98h`	`void**`	Label array data	--
`+A0h`	`void**`	Label array end	--
`+A8h`	`size_t`	Label count	--
`+B0h`	`void*`	Retained nodes list	`sub_29C0F30`
`+B8h`	`size_t`	Retained nodes count	--

DISubprogram Node Layout

Accessed during Phase 3 scope chain validation:

Offset	Type	Contents
`[node-38h]`	`void*`	Pointer to compile unit / parent scope
`[node-18h]`	`byte`	Metadata tag byte (DWARF tag discriminator)
`[node-14h]`	`uint32`	Flags field (lower 27 bits = operand index)
`[node+08h]`	`void*`	Next pointer in linked list
`[node+18h]`	`void*`	Linked list head for child scopes
`[node+20h]`	`void*`	Linked list tail for child scopes
`[node+28h]`	`void*`	Variable attachment (DIVariable list)
`[node+38h]`	`void*`	Additional metadata ref
`[node+48h]`	`void*`	Subprogram scope list head
`[node+50h]`	`void*`	Subprogram scope list tail

Debugify Injector (`sub_29C1CB0`)

The Debugify injector creates synthetic debug metadata to test whether optimization passes preserve debug info correctly. It is the counterpart to the verifier -- the injector sets up the watermarks, and the verifier checks them.

Named metadata markers:

"llvm.debugify" -- marks the module as containing synthetic debug info (standard Debugify)
"llvm.mir.debugify" -- marks MIR-level synthetic debug info

Behavior controlled by debugify-level:

locations -- inject only DILocation on every instruction (cheaper, tests location preservation)
location+variables -- inject DILocation plus synthetic dbg.value()/dbg.declare() for every SSA value (full coverage, higher overhead)

The injector assigns monotonically increasing line numbers to every instruction and creates one DILocalVariable per SSA value that produces a result. The variable names follow the pattern "dbg_var_N" where N is the SSA value index. After injection, the module has guaranteed 100% debug coverage, making any coverage loss attributable to the subsequent optimization pass.

Verbosity Control

Two global flags provide fine-grained control over verification output:

`qword_5008FC8` -- Verbose Diagnostic Output Enable

Boolean flag (byte). Controls the output stream selection:

When 0: uses sub_CB72A0 (null/discard stream constructor) -- diagnostics silently discarded
When non-zero: uses sub_CB7330 (stderr stream accessor) -- diagnostics printed to stderr

This flag gates the ERROR and WARNING messages. The JSON structured output is controlled separately by the jsonOutput parameter. Setting qword_5008FC8 = 0 suppresses text diagnostics while still producing JSON output.

`qword_5008C88` -- Metadata Depth Threshold

Signed 32-bit integer, read at 0x29C8371. Controls how deep the scope chain walk goes:

When <= 0: the deep scope chain walk is skipped for non-subprogram metadata. Only top-level DISubprogram validation runs.
When > 0: full scope chain traversal validates every DILexicalBlock, DILexicalBlockFile, and DINamespace in the hierarchy.

This allows production builds to run lightweight verification (subprogram-only) while development builds run exhaustive scope chain checking.

Debugify-Specific Knobs

Knob	Type	Default	Registration	Effect
`debugify-quiet`	bool	off	`ctor_493` at `0x556960`	Suppress all debugify text output
`debugify-func-limit`	int	unlimited	`ctor_493` at `0x556960`	Max functions to inject synthetic debug info into
`debugify-level`	enum	`location+variables`	`ctor_493` at `0x556960`	`locations` or `location+variables`
`debugify-function`	string	--	`ctor_493` at `0x556960`	Restrict debugify to a single named function
`check-debugify-function`	string	--	`ctor_493` at `0x556960`	Restrict check-debugify to a single named function
`debugify-each`	bool	off	`ctor_377` at `0x516190`	Wrap every pass in debugify/check-debugify
`debugify-export`	string	--	`ctor_377` at `0x516190`	Export debugify results to file

GPU Debug Info: What PTX Needs

DWARF for PTX differs fundamentally from DWARF for x86. PTX is a virtual ISA -- there are no physical registers, no real stack, and no fixed instruction encoding. The debug metadata cicc emits serves two consumers: cuda-gdb (which maps PTX locations back to source) and ptxas (which carries debug info forward into SASS/ELF for the hardware debugger).

The .loc Directive

The AsmPrinter (sub_31D55F0) emits DWARF .loc directives before each PTX instruction that has a valid DebugLoc:

.loc 1 42 0          // file 1, line 42, column 0
ld.param.u64 %rd1, [_Z6kernelPf_param_0];
.loc 1 43 5
mul.wide.u32 %rd2, %r1, 4;

The .file directives (sub_31E4280) establish the file table, and sub_31E6100 maintains a file/line-to-MCSymbol mapping for line table construction.

The dwarf-extended-loc knob (enum: Default/Enable/Disable, registered at 0x490000 range) controls whether extended flags appear in .loc directives. When disabled, cicc emits bare .loc file line column without the is_stmt, prologue_end, or discriminator extensions. This is relevant because older ptxas versions do not parse extended .loc flags.

The line-info-inlined-at Extension

The -line-info-inlined-at LLVM knob (registered at ctor_043 / 0x48D7F0, exposed as -no-lineinfo-inlined-at in the cicc CLI, which sets -line-info-inlined-at=0 on the backend) controls whether inlined-at chains are preserved in PTX line info. When enabled (the default), every .loc directive for inlined code carries the full inlining chain so cuda-gdb can reconstruct the call stack at any point in the inlined code. When disabled, only the immediate source location is emitted, losing the inlining context but producing smaller PTX.

The -show-src / nvptx-emit-src Feature

The -show-src CLI flag (stored at flag struct offset +808, routed to the backend as -nvptx-emit-src) enables source line interleaving in PTX output. When active, the AsmPrinter annotates each .loc directive with the corresponding source line as a PTX comment:

// kernel.cu:42    float val = input[idx];
.loc 1 42 0
ld.global.f32 %f1, [%rd2];
// kernel.cu:43    val = val * val;
.loc 1 43 0
mul.f32 %f2, %f1, %f1;

This is purely a readability feature for developers inspecting PTX output. It has no effect on cuda-gdb or debug quality -- the source text is embedded as comments that ptxas ignores.

NvvmDebugVersion

The NVVM container format includes a debug version field (NvvmDebugVersion, packed as {Major:uint16, Minor:uint16} at container offset 0x08--0x09). The current version is Major=3, Minor<=2. The reader (sub_CD41B0) validates that Major equals 3 and warns if Minor exceeds 2. If absent, the default {3, 2} is assumed. This version tracks the debug metadata schema independently of the NVVM IR version, allowing debug format evolution without breaking IR compatibility.

The standalone pipeline (sub_12BFF60) performs a consistency check: if the container declares debug_info_present (bit 4 of flags) AND the debug mode flag is set AND the debug version has not been validated, it returns error code 3 (incompatible).

DbgRecord Format (LLVM 20)

cicc v13.0 uses LLVM 20's DbgRecord format by default (write-experimental-debuginfo = true, registered at ctor_025). This replaces traditional dbg.value()/dbg.declare() intrinsics with non-intrinsic debug records attached directly to instructions. Related knobs:

Knob	Default	Registration	Effect
`write-experimental-debuginfo`	`true`	`ctor_025`	Use DbgRecord format for new debug info
`write-experimental-debuginfo-iterators-to-bitcode`	`true`	`ctor_018`	Serialize DbgRecords to bitcode
`preserve-input-debuginfo-format`	`false`	`ctor_018`	When true, preserve whichever format the input uses

The verifier handles both formats: it checks for dbg.value()/dbg.declare() intrinsics AND for DbgRecord attachments.

Debug Info Stripping Passes

cicc includes five stripping passes registered in the pipeline parser (at sub_12C6910 and related):

Pipeline name	Slot	LLVM pass	Effect
`"strip-dead-debug-info"`	#110	`StripDeadDebugInfoPass`	Remove debug info for dead functions/globals
`"strip-debug-declare"`	#112	`StripDebugDeclarePass`	Remove `dbg.declare()` intrinsics only
`"strip-nondebug"`	#113	`StripNonDebugSymbolsPass`	Remove non-debug symbols (keep debug)
`"strip-nonlinetable-debuginfo"`	#114	`StripNonLineTableDebugInfoPass`	Strip everything except line tables

The strip-nonlinetable-debuginfo pass is the key one for the -generate-line-info mode: it strips all debug metadata except .loc / .file directives, producing line-number-only debug info without variable locations, type descriptions, or scope trees. This is what nvcc's --generate-line-info flag triggers -- enough for profiler source correlation but not enough for stepping through code in cuda-gdb.

The core debug info stripping implementation lives at 0xAE0000 (Zone 3 of the type system module), which calls stripDebugInfo() to remove all llvm.dbg.* intrinsics from the module.

Debug Compilation Modes

cicc supports three debug info levels, controlled by CLI flags that route through the flag dispatch table:

CLI flag	Flag offset	Backend routing	Debug level
`-g`	`+296`	`-debug-compile` to both linker and optimizer	Full debug info (FullDebug emission kind)
`-generate-line-info`	`+328`	`-generate-line-info` to optimizer only	Line tables only (LineTablesOnly emission kind)
(neither)	--	--	No debug info (NoDebug)

When -g is active, cicc emits DICompileUnit with full emission kind, preserves all DISubprogram, DILocalVariable, DIType, and scope metadata through the pipeline, and the backend emits complete DWARF sections. The verifier runs at full depth.

When -generate-line-info is active, the StripNonLineTableDebugInfoPass runs early in the pipeline, leaving only line table metadata. The verifier still runs but only checks DILocation / DISubprogram consistency (variable checks are skipped because the variable metadata was intentionally stripped).

Key routing difference: -g routes to BOTH the linker (-debug-compile) and optimizer (-debug-compile), because libdevice linking needs the debug flag to preserve user debug info during merging. -generate-line-info routes to the optimizer only.

The frontend uses two independent guard mechanisms for debug emission:

dword_4D046B4 -- global flag checked at statement/parameter level by sub_9433F0 (per-param debug), sub_943430 (per-global debug)
[ctx+0x170] -- compile unit pointer checked at module finalization level by sub_915400

The NVVM container carries a dedicated DebugInfo enum (3 values: NONE, LINE_INFO, DWARF) at deserialized struct offset +12, separate from the module metadata.

Complete Knob Reference

Knob	Type	Default	Registration	Effect
`-g` / `-debug-compile`	bool	off	`ctor_043` at `0x48D7F0`	Full debug compilation
`-generate-line-info`	bool	off	`ctor_043` at `0x48D7F0`	Line tables only
`-no-lineinfo-inlined-at`	bool	off	CLI flag dispatch	Disable inlined-at tracking (sets `-line-info-inlined-at=0`)
`-show-src` / `-nvptx-emit-src`	bool	off	Flag offset `+808`	Interleave source in PTX comments
`dwarf-extended-loc`	enum	Default	`0x490000` range	`Default`/`Enable`/`Disable` extended `.loc` flags
`dwarf-version`	unsigned	(platform)	LLVM default	DWARF version for debug sections
`debugify-each`	bool	off	`ctor_377` at `0x516190`	Run Debugify+CheckDebugify around every pass
`debugify-level`	enum	location+variables	`ctor_493` at `0x556960`	`locations` or `location+variables`
`debugify-quiet`	bool	off	`ctor_493` at `0x556960`	Suppress debugify diagnostics
`debugify-func-limit`	int	unlimited	`ctor_493` at `0x556960`	Max functions to debugify
`debugify-function`	string	--	`ctor_493` at `0x556960`	Restrict debugify to named function
`check-debugify-function`	string	--	`ctor_493` at `0x556960`	Restrict check-debugify to named function
`debugify-export`	string	--	`ctor_377` at `0x516190`	Export debugify results to file
`verify-each`	bool	off	`ctor_043` at `0x48D7F0`	Run IR verifier after every pass
`verify-after-all`	alias	--	`ctor_043` at `0x48D7F0`	Alias for `verify-each`
`verify-debuginfo-preserve`	bool	off	`ctor_376` at `0x512DF0`	Enable debug info preservation checking
`verify-each-debuginfo-preserve`	bool	off	`ctor_377` at `0x516190`	Per-pass debug info preservation
`verify-di-preserve-export`	string	--	`ctor_377` at `0x516190`	Export preservation results to file
`no-inline-line-tables`	bool	off	`sub_29E2B40`	Prevent inlining from merging line tables
`write-experimental-debuginfo`	bool	true	`ctor_025`	Use DbgRecord format
`preserve-input-debuginfo-format`	bool/default	false	`ctor_018`	Preserve input debug format
`qword_5008FC8`	bool	off	--	Verbose diagnostic output enable
`qword_5008C88`	int32	>0	--	Metadata depth threshold (<=0 skips deep scope walk)
`CAN_FINALIZE_DEBUG`	env var	--	`sub_60F290` et al.	Debug finalization control
`NVVM_IR_VER_CHK`	env var	enabled	`sub_12BFF60`	Override debug version checking (set "0" to disable)

DWARF Emission Backend

The actual DWARF section emission lives in a separate module at 0x3990000--0x39DF000:

Address	Size	Function
`sub_399B1E0`	29KB	`DwarfDebug::beginModule()` -- initializes from `llvm.dbg.cu`
`sub_3997B50`	33KB	`.debug_aranges` emission
`sub_399D1D0`	12KB	Range list emission (`DW_RLE_*`)
`sub_399EB70`	12KB	Register location expressions
`sub_39BDF60`	38KB	`.debug_names` accelerator table
`sub_39B6390`	33KB	DWARF form size calculator
`sub_215ACD0`	8.1KB	Module-level emission entry (NVPTX Debug Info Emission)

The module-level entry sub_215ACD0 checks *(a1+240)->field_344 to determine if DWARF is enabled, then looks up the "NVPTX DWARF Debug Writer" / "NVPTX Debug Info Emission" pass info. The NVPTX backend does not emit physical register locations (GPUs have no DWARF register numbering scheme that maps to hardware); instead, it emits virtual register references that cuda-gdb resolves through ptxas's SASS-level debug info.

Function Map

Function	Address	Size	Role
`"llvm.global_ctors"` utility	`sub_29C00F0`	--	--
`errs()` diagnostic output stream accessor	`sub_29C0AE0`	--	--
PassManager / PassAdaptor infrastructure (`"PassManager"`, `"PassAdaptor"`)	`sub_29C0DC0`	--	--
Copy retained-nodes list (SmallVector deep copy)	`sub_29C0F30`	--	--
Copy local-variable list	`sub_29C1060`	--	--
Copy scope-chain list	`sub_29C1190`	--	--
Validate scope chain connectivity	`sub_29C12C0`	--	--
Debugify synthetic debug info injector (`"llvm.debugify"`, `"llvm.mir.debugify"`)	`sub_29C1CB0`	--	--
Merge/update tracking sets after verification	`sub_29C1F00`	--	--
Serialize verification result to stream	`sub_29C20D0`	--	--
Copy imported-entities list (32-byte node deep copy)	`sub_29C2230`	--	--
Per-instruction `DILocation` verifier	`sub_29C3AB0`	5,592B	--
`DenseMap::FindAndConstruct` for tracking map	`sub_29C5270`	--	--
Set insert with metadata key normalization	`sub_29C6AD0`	--	--
Set insert variant (different key extraction)	`sub_29C6DE0`	--	--
Debug info verification pass (main entry)	`sub_29C8000`	12,480B	--
`no-inline-line-tables` flag handler	`sub_29E2B40`	--	--
`NewPMCheckDebugifyPass` wrapper	`sub_22702B0`	--	--
`NewPMDebugifyPass` wrapper	`sub_2270390`	--	--
`VerifierPass` wrapper (standard IR verifier)	`sub_2270470`	--	--
Pass pipeline text parser	`sub_2272BE0`	14KB	--
`buildDefaultPipeline()` equivalent	`sub_2277440`	60KB	--
Flag filter (checks `-debug-compile`, `-g`, `-generate-line-info`)	`sub_12C6910`	--	--
Emit per-instruction `.loc` DWARF directive	`sub_31D55F0`	--	--
Emit `.file`/`.loc` directives (function scope)	`sub_31E4280`	--	--
`insertDebugLocEntry` (file/line to symbol mapping)	`sub_31E6100`	--	--
`DwarfDebug::beginModule()`	`sub_399B1E0`	29KB	--
`.debug_aranges` emission	`sub_3997B50`	33KB	--
Module-level emission entry / NVPTX Debug Info Emission	`sub_215ACD0`	8.1KB	--
NVVM IR version + debug version validator	`sub_12BFF60`	~9KB	--
NVVM container debug version check	`sub_CD41B0`	--	--
Emit `DILocalVariable` for parameter (frontend)	`sub_9433F0`	--	--
Emit debug info for GlobalVariable (frontend)	`sub_943430`	--	--
Set `DebugLoc` from EDG source position (frontend)	`sub_941230`	--	--
Finalize: `"Debug Info Version"` = 3 (frontend)	`sub_915400`	--	--

LLVM Infrastructure Functions Used

Address	Identity	Called from
`sub_BA8DC0`	`Module::getNamedMetadata(StringRef)`	Phase 1
`sub_B2FC80`	`isa<DISubprogram>` or similar MDNode type check	Phase 3
`sub_B2FC00`	MDNode type check (different metadata kind)	Phase 3
`sub_B92180`	`MDNode::getContext()`	Phase 4
`sub_B91420`	`MDString::getString()`	Phase 5
`sub_B91A10`	`MDNode::getOperand(unsigned)`	Phase 4
`sub_B14240`	MDNode operand range iterator	Phase 4
`sub_AF34D0`	`DIScope::getScope()` -- walk scope chain upward	Phase 5
`sub_AF4500`	`DISubprogram::describes(Function)`	Phase 5
`sub_B58DC0`	`DenseSet::insert`	Phase 2
`sub_B96E90`	`DenseMap::insert_or_assign`	Phase 4
`sub_B91220`	`DenseMap::erase`	Phase 8
`sub_C7D670`	`aligned_alloc(size, alignment=8)`	Phase 4
`sub_C7D6A0`	`aligned_free_sized(ptr, size, alignment=8)`	Phase 8
`sub_CB7330`	`errs()` -- get stderr `raw_ostream`	Phase 5
`sub_CB72A0`	`nulls()` -- get null/discard `raw_ostream`	Phase 5 (quiet mode)
`sub_CB6200`	`raw_ostream::write(const char*, size_t)`	Phase 5, 7
`sub_CB5D20`	`raw_ostream::write(char)`	Phase 5
`sub_CB5B00`	`raw_ostream` destructor / free	Phase 7
`sub_CB7060`	YAML::IO output constructor	Phase 7
`sub_CB7080`	`raw_ostream::flush()`	Phase 7

NVIDIA Modifications vs Stock LLVM

The key differences from upstream LLVM's CheckDebugInfoPass:

JSON structured output -- Upstream only prints text diagnostics. NVIDIA added a YAML/JSON serializer (sub_2241E40, sub_CB7060) that produces machine-parseable bug reports with "file", "pass", "bugs" fields and per-bug "action" classification ("drop" vs "not-generate").
Verbosity control -- Two global flags (qword_5008FC8 for output enable, qword_5008C88 for depth threshold) allow fine-grained control over verification overhead. Upstream has only the debugify-quiet knob.
Eight-table metadata tracking -- Upstream CheckDebugInfoPass tracks DISubprograms and debug variable intrinsics. NVIDIA's version maintains eight separate hash tables covering subprograms, scopes, global variables, local variables, types, imported entities, labels, and retained nodes -- a much more comprehensive snapshot.
Metadata reconstruction -- After verification, NVIDIA's pass reconstructs the module's metadata tables from the verified versions (Phase 8), which upstream does not do. This means the verifier can also serve as a "repair" pass that normalizes metadata after an optimization pass corrupts it.
No kernel-specific handling -- The verifier treats __global__ and __device__ functions identically. CUDA-specific debug info (address space annotations, shared memory debug, warp-level location info) is validated elsewhere, likely during NVPTX backend emission.
DbgRecord format support -- cicc v13.0 defaults to the LLVM 20 DbgRecord format (write-experimental-debuginfo = true), so the verifier handles both intrinsic-based and record-based debug info transparently.

Cross-References

AsmPrinter & PTX Body Emission -- .loc/.file directive emission, per-instruction debug annotation
PTX Emission -- module-level emission entry, DWARF debug writer lookup
Debug Info Pipeline -- end-to-end debug info flow from frontend to backend
CLI Flags -- -g, -generate-line-info, -show-src flag routing
LLVM Knobs -- debugify-*, verify-each, dwarf-* knobs
Pipeline & Ordering -- where debug verification fits in the pass pipeline
Hash Infrastructure -- DenseMap/DenseSet implementation used by tracking tables
Diagnostics -- broader diagnostic and remark system
NVVM Container -- NvvmDebugVersion field

Bitcode Reader/Writer

CICC v13.0 contains the complete LLVM 20.0.0 bitcode serialization infrastructure -- reader, writer, metadata loader, module summary IO, and the full intrinsic upgrader -- spread across two address ranges. The 0x9F0000--0xA2FFFF range hosts a first copy of the bitcode reader/writer core used by the standalone libNVVM pipeline, while the 0x1500000--0x157FFFF range hosts the primary copy used by the two-phase compilation path. Both copies are structurally identical LLVM BitcodeReader.cpp and BitcodeWriter.cpp compiled at different link addresses. The reader is stock upstream LLVM 20.0.0 with no NVIDIA modifications to the deserialization logic itself. The writer, however, contains a single critical NVIDIA change: it stamps "LLVM7.0.1" as the bitcode producer identification string rather than the true "LLVM20.0.0", preserving backward compatibility with the NVVM IR ecosystem.

The bitcode subsystem sits at the boundary between all pipeline stages. The standalone pipeline validates magic bytes on entry, the module linker reads bitcode from separate compilation objects, the two-phase orchestrator serializes per-function bitcode blobs between Phase I and Phase II, and the NVVM container wraps bitcode payloads in a proprietary envelope. Every bitcode load also runs the intrinsic upgrader -- a 700+ KB AutoUpgrade subsystem that includes roughly 240 KB of effectively-dead x86 intrinsic renaming tables.

Key Facts

Property	Value
Reader (primary copy)	`sub_151B070` (`0x151B070`, 123 KB) -- parseFunctionBody
Reader (standalone copy)	`sub_9F2A40` (`0x9F2A40`, 185 KB) -- parseFunctionBody
Writer	`sub_1538EC0` (`0x1538EC0`, 58 KB) -- writeModule
Metadata reader	`sub_A09F80` (`0xA09F80`, 121 KB) -- MetadataLoader::parseOneMetadata
X86 AutoUpgrade (name)	`sub_156E800` (`0x156E800`, 593 KB) -- UpgradeIntrinsicFunction
X86 AutoUpgrade (call)	`sub_A939D0` (`0xA939D0`, 457 KB) -- UpgradeIntrinsicCall
NVVM version checker	`sub_157E370` (`0x157E370`, 7 KB)
NVVM version checker (standalone)	`sub_12BFF60` (`0x12BFF60`, 9 KB)
Producer init (ctor_036)	`0x48CC90` (544 bytes) -- reads `LLVM_OVERRIDE_PRODUCER`
Producer init (ctor_154)	`0x4CE640` (215 bytes) -- reads `LLVM_OVERRIDE_PRODUCER`
Address range (primary)	`0x1500000`--`0x157FFFF`
Address range (standalone copy)	`0x9F0000`--`0xA2FFFF`
Address range (AutoUpgrade)	`0xA80000`--`0xABFFFF`
Hardcoded producer string	`"LLVM7.0.1"` (writer), `"20.0.0"` (internal fallback)
NVVM IR version gate	major == 3, minor <= 2
Upstream source	`lib/Bitcode/Reader/BitcodeReader.cpp`, `lib/Bitcode/Writer/BitcodeWriter.cpp`, `lib/IR/AutoUpgrade.cpp`

Bitcode Format Basics

LLVM bitcode uses two magic signatures. The pipeline validates both at module load time:

Magic Bytes	Meaning	Where Checked
`0xDE 0xC0 0x17 0x0B`	Raw LLVM bitcode stream	`sub_12C06E0` (module linker)
`0x42 0x43 0xC0 0xDE`	Bitcode wrapper format (offset + size header around raw stream)	Same function

If neither signature matches, the pipeline sets *error_code = 9 ("invalid bitcode") and aborts. The wrapper format is more common in practice -- nvcc generates wrapper-format .bc files that embed the raw stream at an offset specified in the wrapper header. The wrapper header is 20 bytes:

struct BitcodeWrapperHeader {
    uint32_t magic;       // 0x42, 0x43, 0xC0, 0xDE
    uint32_t version;     // wrapper version (0)
    uint32_t offset;      // byte offset to raw bitcode within file
    uint32_t size;        // size of raw bitcode in bytes
    uint32_t cpu_type;    // target CPU type (0 for NVPTX)
};

After magic validation, the bitstream enters the block-structured reader. LLVM bitcode is organized into nested blocks, each identified by a block ID. The reader uses abbreviation tables (defined in BLOCKINFO blocks) to decode records within each block efficiently using variable-bit-rate (VBR) encoding.

An epoch check runs after magic validation: "Incompatible epoch: Bitcode '<X>' vs current: '<Y>'". This ensures the bitcode was produced by a compatible LLVM generation.

Bitcode Reader

Module Parser (`sub_1505110`, 60 KB)

The top-level entry reads MODULE_BLOCK records from the bitcode stream. It processes:

Global variable declarations and definitions
Function declarations (bodies are deferred for lazy materialization)
Calling conventions and comdat groups
Module-level metadata, type tables, and value symbol tables
Data layout and target triple strings

Error strings: "Invalid calling convention ID", "Invalid function comdat ID", "Invalid global variable comdat ID", "Invalid type for value".

parseFunctionBody (`sub_151B070` / `sub_9F2A40`)

The function body parser is the largest single reader function. The standalone copy sub_9F2A40 is 185 KB (5,706 decompiled lines) with 174 error string references. The primary copy sub_151B070 is 123 KB. Both decode the same FUNCTION_BLOCK records:

57 FUNC_CODE instruction record types (switch cases 1--65), covering every LLVM IR opcode: INST_BINOP, INST_CAST, INST_GEP, INST_SELECT, INST_CMP, INST_RET, INST_BR, INST_SWITCH, INST_INVOKE, INST_CALL (opcode 85), INST_UNREACHABLE, INST_PHI, INST_ALLOCA, INST_LOAD, INST_STORE, INST_ATOMICRMW, INST_CMPXCHG, INST_FENCE, INST_EXTRACTVAL, INST_INSERTVAL, INST_LANDINGPAD, INST_RESUME, INST_CLEANUPPAD, INST_CATCHPAD, INST_CATCHSWITCH, INST_CALLBR, INST_FREEZE, and others.
4 nested sub-blocks: constants (0xB), metadata (0xE), use-list order (0x10), operand bundles (0x12).
53 unique error strings including: "Alignment value is too large", "Invalid record", "Invalid record: Unsupported version of DISubrange", "METADATA_NAME not followed by METADATA_NAMED_NODE".

For each INST_CALL record (opcode 85), the reader calls into the AutoUpgrade machinery to rename deprecated intrinsics. This is the hook that triggers the 700+ KB x86 upgrader on every call instruction -- even though the upgrader's x86 branches are dead code for NVPTX targets.

Pseudocode for the top-level body parse loop:

Error parseFunctionBody(Function *F) {
    SmallVector<uint64_t, 64> Record;
    while (true) {
        BitstreamEntry Entry = Stream.advance();
        switch (Entry.Kind) {
        case BitstreamEntry::Error:
            return error("Malformed block");
        case BitstreamEntry::EndBlock:
            return resolveForwardRefs();
        case BitstreamEntry::SubBlock:
            switch (Entry.ID) {
            case CONSTANTS_BLOCK_ID:  // 0xB
                parseConstants(); break;
            case METADATA_BLOCK_ID:   // 0xE
                parseMetadataAttachment(); break;
            case USELIST_BLOCK_ID:    // 0x10
                parseUseListBlock(); break;
            case OPERAND_BUNDLE_TAGS_BLOCK_ID: // 0x12
                parseOperandBundleTags(); break;
            }
            break;
        case BitstreamEntry::Record:
            unsigned Code = Stream.readRecord(Entry.ID, Record);
            switch (Code) {
            case FUNC_CODE_INST_BINOP: /* ... */ break;
            case FUNC_CODE_INST_CAST:  /* ... */ break;
            // ... 55 more cases ...
            case FUNC_CODE_INST_CALL:
                // Parse callee, args, calling convention
                // If callee is intrinsic:
                //   UpgradeIntrinsicFunction(callee, &newCallee);
                //   if (newCallee) UpgradeIntrinsicCall(CI, newCallee);
                break;
            }
        }
    }
}

Lazy Materialization (`sub_1503DC0`, 13 KB)

Function bodies are not parsed eagerly. The module parser records each function's byte offset in the bitcode stream, and materializeFunctions seeks to that position on demand. Error strings: "Could not find function in stream", "Expect function block", "Expect SubBlock", "Trying to materialize functions before seeing function blocks". The two-phase compilation exploits this by materializing individual functions for per-function Phase II optimization.

Bitstream Infrastructure

Function	Address	Size	Role
`readBlockInfoBlock`	`0x150F8E0`	42 KB	Reads BLOCKINFO block (abbreviation definitions)
`readAbbreviatedField`	`0x1510D70`	38 KB	Expands abbreviated records (fixed, VBR, array, blob)
`readAbbrevRecord`	`0x1513230`	20 KB	Reads one abbreviation-defined record
`readRecord`	`0x150E2B0`	19 KB	Core `BitstreamCursor::readRecord`
`parseMetadataBlock`	`0x1518180`	29 KB	Parses METADATA_BLOCK for function-level metadata
`parseFunctionMetadata`	`0x1520420`	32 KB	Metadata/value-table builder during function parse
`parseMetadataStrings`	`0x1522160`	13 KB	Reads metadata string table
`parseTypeBlock / constants`	`0x15083D0`	26 KB	TYPE_BLOCK or CONSTANTS_BLOCK parser
`parseValueRecord`	`0x1515740`	9 KB	Value record decoder
`string table reader`	`0x15140E0`	13 KB	Bitcode string table entries
`readBlobRecord`	`0x1514C40`	9 KB	Blob-type record reader
`skipBlock`	`0x15127D0`	13 KB	Block skipping and cursor navigation
`parseModuleSummaryIndex`	`0x150B5F0`	63 KB	ThinLTO summary parser
`materializeFunctions`	`0x1503DC0`	13 KB	Lazy function body materialization
`parseModule`	`0x1505110`	60 KB	Top-level MODULE_BLOCK parser
`ThinLTO GUID lookup`	`0x150A160`	7 KB	GUID-based summary index lookup
`parseGlobalInits`	`0x1504A60`	8 KB	Global variable initializer parser

Bitcode Writer

writeModule (`sub_1538EC0`, 58 KB)

The top-level writer serializes an entire Module to a bitcode stream. It orchestrates sub-writers in a fixed order:

Enumerate all values via ValueEnumerator (sub_15467B0, 23 KB)
Write identification block (with producer string -- see next section)
Write MODULE_BLOCK header
Write type table (sub_1530240, 12 KB)
Write attribute groups (sub_152F610, 8 KB)
Write global variables
Write function declarations
For each defined function: writeFunction (sub_1536CD0, 40 KB)
Write metadata (sub_1531F90, 27 KB) + metadata records (sub_15334D0, 8 KB)
Write value symbol table (sub_1533CF0, 16 KB)
Write named metadata / comdat records (sub_15311A0, 14 KB)
If ThinLTO: write module summary (sub_1535340, 26 KB)

writeFunction (`sub_1536CD0`, 40 KB)

Writes one FUNCTION_BLOCK containing all instructions, each encoded via writeInstruction (sub_1528720, 27 KB). Instructions are encoded as (opcode, operand_ids...) records where operand IDs are relative to the value table. The writer uses abbreviations for compact encoding of common instruction patterns.

Value Enumeration

Before writing, the ValueEnumerator assigns a dense numeric ID to every value in the module. This is the reverse of what the reader does (mapping IDs back to Values).

Function	Address	Size	Role
`enumerateModule`	`0x15467B0`	23 KB	Top-level module enumeration
`enumerateValues`	`0x1542B00`	26 KB	Assigns numeric IDs to all values
`optimizeConstants`	`0x1548410`	8 KB	Reorders constants for better compression
`TypeFinder helper`	`0x153E1D0`	7 KB	Recursive type discovery

Writer Function Map

Function	Address	Size	Role
`writeModule`	`0x1538EC0`	58 KB	Top-level module serializer
`writeFunction`	`0x1536CD0`	40 KB	Per-function FUNCTION_BLOCK writer
`writeMetadata`	`0x1531F90`	27 KB	METADATA_BLOCK writer
`writeInstruction`	`0x1528720`	27 KB	Single instruction encoder
`writeModuleSummary`	`0x1535340`	26 KB	ThinLTO summary serializer
`writeValueSymbolTable`	`0x1533CF0`	16 KB	VALUE_SYMTAB_BLOCK writer
`writeNamedMetadata`	`0x15311A0`	14 KB	Named metadata / comdat writer
`writeType / globalVar`	`0x1530240`	12 KB	Type descriptors or global variable records
`emitAbbreviation`	`0x152AB40`	11 KB	Abbreviation definition writer
`emitRecord`	`0x152A250`	9 KB	Low-level record emission
`writeConstants helper`	`0x1527BB0`	9 KB	Constant value encoder
`writeMetadataRecords`	`0x15334D0`	8 KB	Dispatcher for 37 metadata node types
`writeAttributeGroup`	`0x152F610`	8 KB	ATTRIBUTE_GROUP_BLOCK writer
`emitVBR`	`0x15271D0`	7 KB	Variable bit-rate integer encoding
`emitCode`	`0x15263C0`	7 KB	Core abbreviated/unabbreviated record emission
`emitBlob`	`0x1528330`	--	Blob data emission

Producer String Hack

This is the single most important NVIDIA deviation in the bitcode subsystem. Two global constructors cooperate to set the producer identification string:

ctor_036 at 0x48CC90 (544 bytes): Reads LLVM_OVERRIDE_PRODUCER from the environment. If unset, falls back to the string "20.0.0" (the true LLVM version). Stores the result in the global qword_4F837E0. Also registers disable-bitcode-version-upgrade (cl::opt<bool>).

ctor_154 at 0x4CE640 (215 bytes): Also reads LLVM_OVERRIDE_PRODUCER. Falls back to "7.0.1". Stores into a separate global.

When writeModule (sub_1538EC0) writes the IDENTIFICATION_BLOCK, it emits the string "LLVM7.0.1" as the producer. This is assembled from the prefix "LLVM" plus the version string "7.0.1" loaded from the ctor_154 global.

The consequence is that any tool reading CICC's output bitcode (including older libNVVM, nvdisasm, or third-party NVVM IR consumers) sees producer "LLVM7.0.1" and interprets the bitcode as LLVM 7.x-era IR. Internally, the IR is LLVM 20.0.0 -- all modern instruction opcodes, metadata formats, and type encodings are present. The producer string is purely a compatibility marker that tells downstream tools which NVVM IR version spec to apply, not the actual LLVM version.

Why 7.0.1 specifically: NVVM IR 2.0 was defined against LLVM 7.0.1. The NVVM toolchain ecosystem (libNVVM, nvcc's device compilation pipeline) standardized on this version string as the "NVVM IR format identifier." Upgrading the producer string would require coordinated changes across the entire CUDA toolkit and all consumers.

// Pseudocode for producer string initialization
static const char *producer_version;

void ctor_036() {  // at 0x48CC90
    const char *env = getenv("LLVM_OVERRIDE_PRODUCER");
    if (!env) env = "20.0.0";  // true LLVM version
    global_4F837E0 = env;
    // Also registers: -disable-bitcode-version-upgrade (cl::opt<bool>)
}

void ctor_154() {  // at 0x4CE640
    const char *env = getenv("LLVM_OVERRIDE_PRODUCER");
    if (!env) env = "7.0.1";   // NVVM IR compat marker
    producer_version = env;
}

// In writeModule (sub_1538EC0):
void writeIdentificationBlock(BitstreamWriter &Stream) {
    Stream.EnterSubblock(IDENTIFICATION_BLOCK_ID);
    // Writes: "LLVM" + producer_version → "LLVM7.0.1"
    Stream.EmitRecord(IDENTIFICATION_CODE_STRING, "LLVM");
    Stream.EmitRecord(IDENTIFICATION_CODE_EPOCH, CurrentEpoch);
    Stream.ExitBlock();
}

Reimplementation note: A reimplementation must write "LLVM7.0.1" as the producer for compatibility with the existing NVVM ecosystem. Setting LLVM_OVERRIDE_PRODUCER to a different value will change the embedded string. The disable-bitcode-version-upgrade flag controls whether the reader's AutoUpgrade logic activates for version-mismatched bitcode.

X86 AutoUpgrade -- Why to Skip It

The intrinsic upgrader is the single largest code mass in the entire cicc binary. Two functions dominate:

Function	Address	Size	Role
`UpgradeIntrinsicFunction`	`sub_156E800`	593 KB	Name-based intrinsic rename lookup (271 string patterns)
`UpgradeIntrinsicCall`	`sub_A939D0`	457 KB	Call instruction rewriter
`X86 intrinsic upgrade helper`	`sub_A8A170`	195 KB	SSE/AVX/AVX-512 family tables
`UpgradeIntrinsicCall (2nd copy)`	`sub_15644B0`	89 KB	Companion call upgrader
`NVVM upgrade dispatcher`	`sub_A8E250`	52 KB	nvvm.atomic, nvvm.shfl, nvvm.cp.async, nvvm.tcgen05, nvvm.cluster, nvvm.ldg
`NVVM call rewriting`	`sub_A91130`	28 KB	NVVM-specific call rewriter
`NVVM annotation metadata upgrade`	`sub_A84F90`	14 KB	maxclusterrank, maxntid, etc.
`UpgradeModuleFlags`	`0x156C720`	10 KB	Module flag upgrader
`UpgradeLoopMetadata`	`0x156A1F0`	7 KB	llvm.loop.interleave.count, llvm.loop.vectorize.*

Total intrinsic upgrader code: approximately 1.4 MB across all copies and helpers.

The x86 portion (roughly 1.0 MB) handles SSE/SSE2/SSE4.1/SSE4.2/SSSE3, AVX2, AVX-512 (mask operations, conversions, FMA variants), and ARM NEON patterns (^arm\.neon\.vld, ^arm\.neon\.vst). These branches are functionally dead for NVPTX -- no CUDA program will ever contain an @llvm.x86.sse2.padds.b intrinsic. However, the code is NOT unreachable in the CFG sense: the reader calls UpgradeIntrinsicFunction on every intrinsic name, the function does a string-prefix match, and falls through the x86/ARM branches without matching. The x86 code paths simply never activate.

Reimplementation guidance: You can safely exclude the x86 and ARM AutoUpgrade tables (sub_A8A170, the x86 portions of sub_A939D0, and the ARM patterns in sub_15644B0). The NVVM-relevant upgraders must be preserved:

Preserved	NVVM Intrinsic Families
`sub_A8E250`	`nvvm.atomic.`, `nvvm.shfl.`, `nvvm.cp.async.`, `nvvm.tcgen05.`, `nvvm.cluster.`, `nvvm.ldg.`
`sub_A91130`	NVVM-specific call rewrites
`sub_A84F90`	NVVM annotation metadata (maxclusterrank, maxntid, etc.)
`sub_156A1F0`	Loop vectorization metadata (llvm.loop.interleave.count)
`sub_156C720`	Module flags

Stripping the x86 upgrader saves approximately 1.0 MB of binary size and significant reverse-engineering effort, with zero functional impact on GPU compilation.

Metadata Reader

MetadataLoader::parseOneMetadata (`sub_A09F80`, 121 KB)

The metadata reader handles 42 distinct metadata record types in a single switch statement. Each case constructs one metadata node:

DI metadata nodes: DISubprogram, DIFile, DICompileUnit, DIVariable, DILocation, DIType, DIExpression, DISubrange, DIEnumerator, DIGlobalVariableExpression, DIModule, DINamespace, DITemplateTypeParameter, DITemplateValueParameter, DICompositeType, DIDerivedType, DIBasicType, DILexicalBlock, DILexicalBlockFile, DILabel, DIImportedEntity, DIMacro, DIMacroFile, DICommonBlock, DIGenericSubrange, DIStringType, DIArgList
LLVM metadata nodes: MDTuple, MDString, named metadata
NVVM annotations: nvvm.annotations (parsed as named metadata carrying per-kernel attributes)

The function is called from parseMetadataBlock (sub_1518180, 29 KB), which reads the block structure, and parseFunctionMetadata (sub_1520420, 32 KB), which processes function-level metadata attachments.

Value materialization (sub_A10370, 33 KB) handles forward references in metadata. When a metadata node references a value that hasn't been parsed yet, the materializer resolves it once the value becomes available.

Module Summary Serialization

Two pairs of functions handle ThinLTO module summary IO:

Summary Writer (`sub_1535340`, 26 KB)

Writes the MODULE_STRTAB_BLOCK and GLOBALVAL_SUMMARY_BLOCK into the bitcode stream. For each function/alias/global:

Encodes the GUID hash (64-bit FNV-1a on the mangled name)
Writes call graph edges with hotness annotations
Writes reference edges (global value references)
For ThinLTO: writes module path strings, type test GUIDs

Error string: "Unexpected anonymous function when writing summary".

The NVIDIA-extended summary fields (import priority, complexity budget, kernel bit, CUDA attributes) are written by the NVModuleSummary builder into the standard summary records via additional flag bits and extended record fields.

Summary Reader (`sub_150B5F0`, 63 KB)

Reads the summary index from bitcode. Handles GUID hashes, function/alias summaries, module paths. Error strings: "Alias expects aliasee summary", "Invalid hash length", "Invalid Summary Block: version expected", "Malformed block".

Summary Writer (standalone copy) (`sub_A2D2B0`, 48 KB)

A second copy of the summary/metadata writer exists at 0xA2D2B0 in the standalone pipeline's address range.

NVVM IR Version Validation

CICC gates bitcode acceptance on two version checks:

Module-Level Version Gate (`sub_157E370`, 7 KB)

After parsing the module, this function reads the "nvvmir.version" named metadata node. The metadata contains a pair of integers (major, minor). The check enforces:

major == 3  AND  minor <= 2

If the check fails, the function calls sub_16BD130 which emits "Broken module found, compilation aborted!" and terminates compilation. If the module passes the version check, it proceeds to sub_166CBC0 (verifyModule [MEDIUM confidence] -- identification based on call position after bitcode parsing and before optimization, consistent with LLVM's standard verify-after-parse pattern, but no diagnostic string directly confirms the function name) for structural IR verification, then sub_15ACB40 for post-verification processing.

A second instance at sub_12BFF60 (9 KB) in the standalone pipeline performs the same check with additional llvm.dbg.cu debug info presence validation.

Environment Override (`NVVM_IR_VER_CHK`)

The NVVM_IR_VER_CHK environment variable controls whether version validation runs at all:

Value	Effect
Unset or non-`"0"`	Version check enabled (default)
`"0"`	Version check bypassed, no version mismatch errors

The check is: if (!env || strtol(env, NULL, 10) != 0) then enforce version. This means any non-zero numeric string also enables the check. Only the literal string "0" disables it.

Two verifier instances exist:

sub_12BFF60 at 0x12BFF60 (standalone pipeline)
sub_2259720 at 0x2259720 (second instance, possibly duplicate link unit)

Configuration

Environment Variables

Variable	Effect	Default
`LLVM_OVERRIDE_PRODUCER`	Overrides bitcode producer identification string	`"7.0.1"` (ctor_154) / `"20.0.0"` (ctor_036)
`NVVM_IR_VER_CHK`	Set to `"0"` to bypass NVVM IR version validation	Enabled

cl::opt Flags

Flag	Type	Default	Effect
`disable-bitcode-version-upgrade`	bool	false	Disable automatic bitcode upgrade for version mismatch
`bitcode-mdindex-threshold`	int	25	Number of metadata entries above which an index is emitted
`disable-ondemand-mds-loading`	bool	false	Disable lazy metadata loading
`write-relbf-to-summary`	bool	false	Write relative block frequency to ThinLTO function summary
`print-summary-global-ids`	bool	false	Print global IDs when reading module summary
`import-full-type-definitions`	bool	false	Import full type definitions in ThinLTO

Differences from Upstream LLVM

Aspect	Upstream LLVM 20.0.0	CICC v13.0
Producer string	`"LLVM20.0.0"`	`"LLVM7.0.1"` (hardcoded via ctor_154)
Producer override	`LLVM_OVERRIDE_PRODUCER` env var	Same mechanism, different default
Version upgrade disable	`disable-bitcode-version-upgrade` exists	Same, registered in ctor_036
NVVM IR version gate	Does not exist	`nvvmir.version` metadata check (major==3, minor<=2)
NVVM IR version bypass	Does not exist	`NVVM_IR_VER_CHK=0` environment variable
X86 AutoUpgrade	Active for x86 targets	Present but dead code (NVPTX only)
NVVM intrinsic upgrade	Does not exist	nvvm.atomic, nvvm.shfl, nvvm.cp.async, etc. upgraders added
NVVM annotation upgrade	Does not exist	maxclusterrank, maxntid metadata upgrader added
Module summary	Standard ModuleSummaryAnalysis	Extended with NVModuleSummary (import priority, kernel bit, complexity budget)
Binary copies	Single instance	Two copies (0x9F range, 0x150 range) at different link addresses

Function Map

Reader (primary, `0x1500000`--`0x1522000`)

Address	Size	Function
`0x1503DC0`	13 KB	`materializeFunctions`
`0x1504A60`	8 KB	`parseGlobalInits`
`0x1505110`	60 KB	`parseModule`
`0x15083D0`	26 KB	`parseTypeBlock / Constants`
`0x150A160`	7 KB	`ThinLTO GUID lookup`
`0x150B5F0`	63 KB	`parseModuleSummaryIndex`
`0x150E2B0`	19 KB	`readRecord`
`0x150F8E0`	42 KB	`readBlockInfoBlock`
`0x1510D70`	38 KB	`readAbbreviatedField`
`0x1513230`	20 KB	`readAbbrevRecord`
`0x15127D0`	13 KB	`skipBlock`
`0x15140E0`	13 KB	`string table reader`
`0x1514C40`	9 KB	`readBlobRecord`
`0x1515740`	9 KB	`parseValueRecord`
`0x15177F0`	7 KB	`bitcode record helper`
`0x1518180`	29 KB	`parseMetadataBlock`
`0x1519820`	7 KB	`bitcode record helper`
`0x1519BD0`	7 KB	`bitcode record helper`
`0x151B070`	123 KB	`parseFunctionBody`
`0x1520420`	32 KB	`parseFunctionMetadata`
`0x1522160`	13 KB	`parseMetadataStrings`

Reader (standalone copy, `0x9F0000`--`0xA20000`)

Address	Size	Function
`0x9F2A40`	185 KB	`parseFunctionBody`
`0xA09F80`	121 KB	`MetadataLoader::parseOneMetadata`
`0xA10370`	33 KB	`value materialization`
`0x9FF220`	31 KB	`writer helper`
`0xA2D2B0`	48 KB	`module summary / metadata writer`

Writer (`0x1525000`--`0x1549000`)

Address	Size	Function
`0x15263C0`	7 KB	`emitCode`
`0x15271D0`	7 KB	`emitVBR`
`0x1527BB0`	9 KB	`writeConstants helper`
`0x1528720`	27 KB	`writeInstruction`
`0x152A250`	9 KB	`emitRecord`
`0x152AB40`	11 KB	`emitAbbreviation`
`0x152F610`	8 KB	`writeAttributeGroup`
`0x1530240`	12 KB	`writeType / GlobalVar`
`0x15311A0`	14 KB	`writeNamedMetadata / comdat`
`0x1531F90`	27 KB	`writeMetadata`
`0x15334D0`	8 KB	`writeMetadataRecords` (37 callees)
`0x1533CF0`	16 KB	`writeValueSymbolTable`
`0x1535340`	26 KB	`writeModuleSummary` (ThinLTO)
`0x1536CD0`	40 KB	`writeFunction`
`0x1538EC0`	58 KB	`writeModule`

Intrinsic Upgrader (`0xA80000`--`0xABFFFF` + `0x1560000`--`0x1580000`)

Address	Size	Function
`0x156E800`	593 KB	`UpgradeIntrinsicFunction`
`0xA939D0`	457 KB	`UpgradeIntrinsicCall`
`0xA8A170`	195 KB	`X86 intrinsic upgrade helper`
`0x15644B0`	89 KB	`UpgradeIntrinsicCall` (2nd copy)
`0xA8E250`	52 KB	`NVVM upgrade dispatcher`
`0xA91130`	28 KB	`NVVM call rewriting`
`0xA84F90`	14 KB	`NVVM annotation metadata upgrade`
`0xA7CD60`	10 KB	`UpgradeIntrinsicFunction` (short, matches "nvvm.", "ftz.")
`0x156C720`	10 KB	`UpgradeModuleFlags`
`0x156A1F0`	7 KB	`UpgradeLoopMetadata`

NVVM Version / Producer

Address	Size	Function
`0x157E370`	7 KB	`NVVM version checker` (primary)
`0x12BFF60`	9 KB	`NVVM version checker` (standalone)
`0x2259720`	--	`NVVM version checker` (duplicate instance)
`0x48CC90`	544 B	`ctor_036` -- producer init + disable-bitcode-version-upgrade
`0x4CE640`	215 B	`ctor_154` -- producer init ("7.0.1" default)

Value Enumeration (`0x1540000`--`0x1549000`)

Address	Size	Function
`0x1542B00`	26 KB	`enumerateValues`
`0x15467B0`	23 KB	`enumerateModule`
`0x1548410`	8 KB	`optimizeConstants`
`0x15445A0`	11 KB	`metadata enumeration helper`
`0x15450E0`	9 KB	`ValueEnumerator helper`
`0x1547D80`	9 KB	`ValueEnumerator helper`
`0x1543FA0`	7 KB	`ValueEnumerator helper`
`0x1542750`	7 KB	`ValueEnumerator helper`
`0x153E1D0`	7 KB	`TypeFinder helper`

Cross-References

NVVM Container -- wraps bitcode in the proprietary transport format
LTO & Module Optimization -- consumes bitcode from separate compilation objects
NVModuleSummary Builder -- extends module summary with CUDA-specific fields; serialized by sub_1535340
Two-Phase Compilation -- serializes/deserializes per-function bitcode between phases
Pipeline Entry -- magic byte validation on bitcode input
Environment Variables -- LLVM_OVERRIDE_PRODUCER, NVVM_IR_VER_CHK
Binary Layout -- address range context for reader/writer clusters

Concurrent Compilation

CICC implements a two-phase concurrent compilation model that is entirely absent from upstream LLVM. The optimizer runs twice over the same module: Phase I performs whole-module analysis and early IR optimizations on a single thread, then Phase II runs per-function backend optimization in parallel across a thread pool. The design exploits the fact that most backend passes (instruction selection prep, register pressure reduction, peephole) are function-local and do not require cross-function information once Phase I has completed interprocedural analysis.

The two-phase protocol lives in sub_12E7E70 (9,405 bytes), which calls the same master pipeline function sub_12E54A0 twice, discriminated only by a TLS phase counter. The concurrency infrastructure spans the 0x12D4000--0x12EA000 address range and includes a GNU Make jobserver integration for build-system-aware parallelism throttling -- a feature that allows make -j8 to correctly limit total system load even when each cicc invocation itself wants to spawn threads.


Phase I/II orchestrator	`sub_12E7E70` (9,405 bytes)
Phase counter (TLS)	`qword_4FBB3B0` -- values 1, 2, 3
Concurrency eligibility	`sub_12D4250` (626 bytes)
Function sorting	`sub_12E0CA0` (23,422 bytes)
Concurrent entry	`sub_12E1EF0` (51,325 bytes)
Worker entry	`sub_12E7B90` (2,997 bytes)
Per-function callback	`sub_12E8D50`
Per-function optimizer	`sub_12E86C0` (7,687 bytes)
GNU jobserver init	`sub_16832F0`
MAKEFLAGS parser	`sub_1682BF0`
Thread pool create	`sub_16D4AB0`
Thread pool enqueue	`sub_16D5230`
Thread pool join	`sub_16D4EC0`
Disable env var	`LIBNVVM_DISABLE_CONCURRENT_API` -- `byte_4F92D70`
Pipeline function	`sub_12E54A0` (49,800 bytes) -- called by both phases

Two-Phase Architecture

Both phases call the same optimization pipeline function sub_12E54A0(context, input, output, opts, errCb). The only difference is the value stored in the TLS variable qword_4FBB3B0 before each call. Individual optimization passes read this TLS variable to decide whether to run: Phase I passes fire when the counter equals 1; Phase II passes fire when it equals 2. This avoids running codegen-oriented passes during analysis and vice versa.

Phase Counter Protocol

The phase counter qword_4FBB3B0 is a TLS variable accessed via sub_16D40E0 (set) and sub_16D40F0 (get). It stores a pointer to a heap-allocated 4-byte integer. Three values are defined:

Value	Meaning	Set point
1	Phase I active -- analysis + early IR optimization	Before first `sub_12E54A0` call
2	Phase II active -- backend optimization + codegen prep	Before second `sub_12E54A0` call
3	Compilation complete for this module	After second `sub_12E54A0` returns

Sequential Path (sub_12E7E70)

When verbose logging is disabled and the module contains only one defined function, the orchestrator takes a fast path:

// Single-function fast path: no phase counter set at all
if (!verbose && num_defined_functions <= 1) {
    sub_12E54A0(ctx, input, output, opts, errCb);  // single un-phased call
    return;
}

This means the optimizer runs both phases in a single invocation -- passes see no phase counter and run unconditionally. For multi-function modules or when verbose logging is active, the full two-phase protocol engages:

// Phase I
int *phase = malloc(4);
*phase = 1;
tls_set(qword_4FBB3B0, phase);
sub_12E54A0(ctx, input, output, opts, errCb);

if (error_reported(errCb))
    return;  // abort on Phase I error

// Concurrency decision
bool concurrent = sub_12D4250(ctx, opts);
// Diagnostic: "Concurrent=Yes" or "Concurrent=No"

// Phase II
*phase = 2;
tls_set(qword_4FBB3B0, phase);
sub_12E54A0(ctx, input, output, opts, errCb);

// Done
*phase = 3;
tls_set(qword_4FBB3B0, phase);

The diagnostic string construction between phases is notable: v46 = 3LL - (v41 == 0) computes the length of "Yes" (3) vs "No" (2, but expression yields 2 via 3 - 1), then logs "Phase II" with the "Concurrent=Yes/No" annotation appended.

Concurrent Path (sub_12E7B90)

When the thread count exceeds 1, the orchestrator dispatches to sub_12E7B90 instead of running Phase II sequentially:

sub_12E7B90(ctx, module_ptr, thread_count, opts, ...)
    |
    |-- Phase I: *phase=1, sub_12E54A0(...)        // whole-module, single thread
    |-- sub_12D4250(ctx, opts)                     // eligibility check
    |
    +-- if eligible (>1 defined function):
    |     sub_12E1EF0(...)                         // concurrent Phase II
    |     *phase = 3
    |
    +-- else (single defined function):
          *phase = 2, sub_12E54A0(...)             // sequential Phase II
          *phase = 3

Phase I always runs single-threaded on the whole module because interprocedural analyses (alias analysis, call graph construction, inlining decisions) require a consistent global view. Only after Phase I completes does the system split the module into per-function chunks for parallel Phase II processing.

Eligibility Check

sub_12D4250 (626 bytes) determines whether the module qualifies for concurrent compilation. The check is straightforward:

int sub_12D4250(Module *mod, Options *opts) {
    int defined_count = 0;
    for (Function &F : mod->functions()) {
        if (!sub_15E4F60(&F))    // !isDeclaration()
            defined_count++;
    }
    if (defined_count <= 1)
        return 0;                // not eligible: only 0 or 1 defined function

    byte force = *(byte*)(opts + 4064);   // NVVMPassOptions slot 201 (0xC9)
    if (force != 0)
        return force;            // user-forced concurrency setting

    return sub_12D3FC0(mod, opts);  // auto-determine thread count
}

The key gate is defined_count > 1. A module with a single kernel and no device functions will always compile sequentially regardless of thread count settings. The opts + 4064 byte (NVVMPassOptions slot 201, type BOOL_COMPACT, default 0) allows the user to force concurrent mode on or off. When zero (default), sub_12D3FC0 auto-determines the thread count based on module characteristics.

Function Priority Sorting

Before distributing functions to worker threads, sub_12E0CA0 (23,422 bytes) sorts them by compilation priority. This step is critical for load balancing: larger or more complex functions should start compiling first so they don't become tail stragglers.

Sorting Algorithm

The sort uses a hybrid strategy consistent with libstdc++ std::sort:

Input size	Algorithm	Function
Small N	Insertion sort	`sub_12D48A0`
Large N	Introsort (quicksort + heapsort fallback)	`sub_12D57D0`

The threshold between insertion sort and introsort is 256 bytes of element data (consistent with the libstdc++ template instantiation pattern observed elsewhere in the binary).

Priority Source

Priority values come from function attributes extracted by sub_12D3D20 (585 bytes). The sorted output is a vector of (name_ptr, name_len, priority) tuples with 32-byte stride, used directly by the per-function dispatch loop to determine compilation order. Functions with higher priority (likely larger or more critical kernels) are submitted to the thread pool first.

Enumeration Phase

Before sorting, sub_12E0CA0 enumerates all functions and globals via an iterator callback table:

Callback	Address	Purpose
Next function	`sub_12D3C60`	Advance to next function in module
Iterator advance	`sub_12D3C80`	Step iterator forward
End check	`sub_12D3CA0`	Test if iterator reached end

For each function, the enumeration:

Checks the node type discriminator at *(byte*)(node + 16) -- type 0 = Function, type 1 = GlobalVariable
For functions: calls sub_15E4F60 (isDeclaration check), sub_12D3D20 (priority), sub_1649960 (name), inserts into v359 hash table (name to function) and v362 hash table (name to linkage type)
For global variables: walks the parent/linked GlobalValue chain via sub_164A820, inserts callee references into v365 hash table for split-module tracking

GNU Jobserver Integration

When cicc is invoked by GNU Make with -j, it can participate in the make jobserver protocol to avoid oversubscribing the system. The jobserver flag is passed from nvcc via the -jobserver CLI flag, which sets opts + 3288 (NVVMPassOptions slot 163, type BOOL_COMPACT, default 0).

Initialization (sub_16832F0)

The jobserver init function allocates a 296-byte state structure and calls sub_1682BF0 to parse the MAKEFLAGS environment variable:

int sub_16832F0(JobserverState *state, int reserved) {
    memset(state, 0, 296);
    state->flags[8] = 1;                    // initialized marker

    int err = sub_1682BF0(state);            // parse MAKEFLAGS
    if (err) return err;

    pipe(state->local_pipe);                 // local token pipe
    // state+196 = read FD, state+200 = write FD

    pthread_create(&state->thread,           // state+208
                   NULL, token_manager, state);

    reserve_vector(state, token_count);
    return 0;  // success
}

MAKEFLAGS Parsing (sub_1682BF0)

The parser searches the MAKEFLAGS environment variable for --jobserver-auth= and supports two formats:

Format	Example	Mechanism
Pipe FDs	`--jobserver-auth=3,4`	Read FD = 3, Write FD = 4 (classic POSIX pipe)
FIFO path	`--jobserver-auth=fifo:/tmp/gmake-jobserver-12345`	Named FIFO (GNU Make 4.4+)

The pipe format uses comma-separated read/write file descriptors inherited from the parent make process. The FIFO format uses a named pipe in the filesystem. In both cases, the jobserver protocol works the same way: a thread reads tokens from the pipe/FIFO before starting each per-function compilation, and writes tokens back when the function completes. This ensures cicc never runs more concurrent compilations than make's -j level permits.

Error Handling

if (jobserver_init_error) {
    if (error_code == 5 || error_code == 6) {
        // Warning: jobserver pipe not accessible (probably not in make context)
        emit_warning(severity=1);
        // Fall through: continue without jobserver
    } else {
        // Fatal: "GNU Jobserver support requested, but an error occurred"
        sub_16BD130("GNU Jobserver support requested, but an error occurred", 1);
    }
}

Error codes 5 and 6 are non-fatal (the jobserver pipe may not be available if cicc is invoked outside a make context). All other errors are fatal.

Thread Pool Management

Creation (sub_16D4AB0)

The thread pool is LLVM's standard ThreadPool (the binary contains "llvm-worker-{0}" thread naming at sub_23CE0C0). Creation occurs at line 799 of sub_12E1EF0:

int actual_threads = min(requested_threads, num_functions);
sub_16D4AB0(thread_pool, actual_threads);

The thread count is clamped to the number of functions -- there is no point spawning more threads than there are work items.

Thread Count Resolution

Thread count is resolved through a fallback chain in sub_12E7E70:

int thread_count = opts[1026];    // NVVMPassOptions slot 203 (offset 4104), default -1
if (thread_count < 0)
    thread_count = opts[1036];    // NVVMPassOptions slot 205 (offset 4144), default -1
if (thread_count == 0)
    thread_count = sub_22420F0(); // get_nprocs() -- number of online CPUs

Source	Slot	Offset	Default	Meaning
Primary	203 (0xCB)	4104	-1 (auto)	Explicit thread count
Fallback	205 (0xCD)	4144	-1 (auto)	Secondary thread count
System	--	--	--	`get_nprocs()` return value

The -1 sentinel means "auto-detect." When both slots are -1, the system falls back to the number of online CPUs. In practice, this means a 16-core machine will spawn 16 threads for concurrent Phase II compilation unless the user or calling tool overrides the count.

Per-Function Context Packaging

Before submitting each function to the thread pool, sub_12E1EF0 builds a self-contained compilation context (4,632 bytes) containing everything the worker needs:

Bitcode extraction: sub_1AB9F40(&buffer, *module_bc, &iterators, sub_12D4BD0, &filter_state) -- extracts the bitcode for a single function using sub_12D4BD0 as a filter callback that checks the function name against the work hash table
Bitcode parsing: sub_153BF40(module_data, &context, 0, 0, 0, 0) -- parses the extracted bitcode into an LLVM Module
Context copy: Copies all iterator state (v391--v395), context/options/error callbacks (v405--v409), module name string, the full 4,480-byte NVVMPassOptions struct, and module bitcode vectors

Each function gets its own independent copy of the options struct and module -- there is no shared mutable state between worker threads during Phase II.

Worker Entry and Completion

Work Submission

Each function is submitted to the thread pool with two callbacks:

v373 = sub_12E8D50;    // completion callback (runs the optimizer)
v372 = sub_12D4D90;    // destructor callback (cleanup)
sub_16D5230(work_item, thread_pool, context);  // enqueue

Per-Function Callback (sub_12E8D50)

The completion callback extracts the module from the packaged context and calls the Phase II per-function optimizer:

void sub_12E8D50(Context *ctx) {
    Module *mod = extract_module(ctx);
    sub_12E86C0(ctx, function_index, opts, module_name);
}

Per-Function Phase II Optimizer (sub_12E86C0, 7,687 bytes)

This function sets the TLS phase counter to 2 and runs the pass pipeline on the individual function's module:

void sub_12E86C0(Context *ctx, int func_idx, Options *opts, StringRef name) {
    int *phase = malloc(4);
    *phase = 2;
    tls_set(qword_4FBB3B0, phase);
    // Run Phase II pass pipeline on this function's module
    sub_12E54A0(ctx, ...);
}

Because qword_4FBB3B0 is TLS, each worker thread has its own phase counter. All worker threads see phase=2 concurrently without interference.

Post-Compilation Merge

After all worker threads complete (sub_16D4EC0 joins the thread pool):

Jobserver cleanup: sub_1682740 checks for jobserver errors and releases tokens
Error check: If any per-function callback reported an error, the compilation fails
Normal mode (opt_level >= 0): Appends a null byte to the output buffer (bitcode stream terminator)
Split-compile mode (opt_level < 0): Re-reads each function's bitcode via sub_153BF40, links all per-function modules via sub_12F5610 (the LLVM module linker), and restores linkage attributes from the v362 hash table. Specifically:
- Linkage values 7--8: set only low 6 bits (external linkage types)
- Other values: set low 4 bits, then check (value & 0x30) != 0 for visibility bits
- Sets byte+33 |= 0x40 (dso_local flag)

Configuration

Environment Variables

Variable	Check	Effect
`LIBNVVM_DISABLE_CONCURRENT_API`	`getenv() != NULL`	Sets `byte_4F92D70 = 1`. Disables concurrent/thread-safe LibNVVM API usage entirely. Any non-NULL value triggers it. Checked in global constructor `ctor_104` at `0x4A5810`.
`MAKEFLAGS`	Parsed by `sub_1682BF0`	Searched for `--jobserver-auth=` to enable GNU Make jobserver integration

NVVMPassOptions Slots

Slot	Offset	Type	Default	Purpose
163 (0xA3)	3288	`BOOL_COMPACT`	0	Jobserver integration requested (set by `-jobserver` flag)
201 (0xC9)	4064	`BOOL_COMPACT`	0	Force concurrency on/off (0 = auto)
203 (0xCB)	4104	`INTEGER`	-1	Primary thread count (-1 = auto)
205 (0xCD)	4144	`INTEGER`	-1	Fallback thread count (-1 = auto)

CLI Flags

Flag	Route	Effect
`-jobserver`	`opt "-jobserver"`	Enables GNU jobserver integration (sets slot 163)
`-split-compile=<N>`	`opt "-split-compile=<N>"`	Enables split-module compilation (opt_level set to -1)
`-split-compile-extended=<N>`	`opt "-split-compile-extended=<N>"`	Extended split-compile (also sets `+1644 = 1`)
`--sw2837879`	Internal	Concurrent ptxStaticLib workaround flag

Phase State Machine

  START
    |
    v
  [phase=1] --> sub_12E54A0 (Phase I: whole-module analysis)
    |
    v
  error? --yes--> RETURN (abort)
    |no
    v
  count_defined_functions()
    |
    +--(1 func)--> [phase=2] --> sub_12E54A0 (Phase II sequential)
    |                                |
    |                                v
    |                            [phase=3] --> DONE
    |
    +--(N funcs, threads>1)--> sub_12E1EF0 (concurrent)
    |                             |
    |                             +-- sort functions by priority
    |                             +-- create thread pool
    |                             +-- init jobserver (if requested)
    |                             +-- for each function:
    |                             |     extract per-function bitcode
    |                             |     parse into independent Module
    |                             |     [phase=2] per-function (TLS)
    |                             |     submit to thread pool
    |                             +-- join all threads
    |                             +-- link split modules (if split-compile)
    |                             +-- [phase=3] --> DONE
    |
    +--(N funcs, threads<=1)--> [phase=2] --> sub_12E54A0 (sequential)
                                    |
                                    v
                                [phase=3] --> DONE

Differences from Upstream LLVM

Upstream LLVM has no two-phase compilation model. The standard LLVM pipeline runs all passes in a single invocation with no phase discrimination. CICC's approach is entirely custom:

Phase counter TLS variable: Upstream LLVM passes have no concept of reading a global phase counter to decide whether to run. Every pass in CICC must check qword_4FBB3B0 and early-return if it belongs to the wrong phase.
Per-function module splitting: Upstream LLVM's splitModule() (in llvm/Transforms/Utils/SplitModule.h) exists for ThinLTO and GPU offloading, but CICC's splitting at sub_1AB9F40 with the sub_12D4BD0 filter callback is a custom implementation integrated with the NVVMPassOptions system.
GNU jobserver integration: No upstream LLVM tool participates in the GNU Make jobserver protocol. This is entirely NVIDIA-specific, implemented to play nicely with make -j in CUDA build systems.
Function priority sorting: Upstream LLVM processes functions in module iteration order. CICC's priority-based sorting via sub_12E0CA0 ensures that expensive functions start compiling first, reducing tail latency in the thread pool.

Function Map

Function	Address	Size	Role
Function iterator: next	`sub_12D3C60`	~200	--
Function iterator: advance	`sub_12D3C80`	~230	--
Function iterator: end check	`sub_12D3CA0`	~260	--
Function attribute/priority query	`sub_12D3D20`	585	--
Auto thread count determination	`sub_12D3FC0`	3,600	--
Concurrency eligibility check	`sub_12D4250`	626	--
Insertion sort (small N)	`sub_12D48A0`	--	--
Per-function bitcode filter callback	`sub_12D4BD0`	2,384	--
Work item destructor callback	`sub_12D4D90`	2,742	--
Introsort (large N)	`sub_12D57D0`	--	--
Function sorting and enumeration	`sub_12E0CA0`	23,422	--
Concurrent compilation top-level entry	`sub_12E1EF0`	51,325	--
Master pipeline assembly (both phases)	`sub_12E54A0`	49,800	--
Concurrent worker entry	`sub_12E7B90`	2,997	--
Phase I/II orchestrator	`sub_12E7E70`	9,405	--
Per-function Phase II optimizer	`sub_12E86C0`	7,687	--
Per-function completion callback	`sub_12E8D50`	--	--
LLVM module linker (post-merge)	`sub_12F5610`	7,339	--
Bitcode reader/verifier	`sub_153BF40`	--	--
`isDeclaration()` check	`sub_15E4F60`	--	--
Get function name	`sub_1649960`	--	--
Walk to parent GlobalValue	`sub_164A820`	--	--
Jobserver error check/cleanup	`sub_1682740`	--	--
MAKEFLAGS `--jobserver-auth=` parser	`sub_1682BF0`	--	--
GNU jobserver init (296-byte state)	`sub_16832F0`	--	--
TLS set (`qword_4FBB3B0`)	`sub_16D40E0`	--	--
TLS get (`qword_4FBB3B0`)	`sub_16D40F0`	--	--
Thread pool create	`sub_16D4AB0`	--	--
Thread pool join	`sub_16D4EC0`	--	--
Thread pool enqueue work item	`sub_16D5230`	--	--
Per-function bitcode extraction	`sub_1AB9F40`	--	--
`get_nprocs()` wrapper	`sub_22420F0`	--	--

Cross-References

Entry Point & CLI -- pipeline dispatch that leads to the optimizer, including -jobserver flag routing
Optimizer Pipeline -- sub_12E54A0, the pipeline function called by both phases
NVVMPassOptions -- the 222-slot options table including thread count and jobserver slots
Environment Variables -- LIBNVVM_DISABLE_CONCURRENT_API and MAKEFLAGS
CLI Flags -- -jobserver, -split-compile, -split-compile-extended
Bitcode I/O -- sub_153BF40 bitcode reader used for per-function module extraction

Diagnostics & Optimization Remarks

CICC v13.0 contains three independent diagnostic systems that operate at different phases of compilation and serve different audiences. The EDG frontend diagnostic engine handles C++/CUDA language-level errors and warnings with rich terminal formatting or SARIF JSON output. The LLVM optimization remark infrastructure reports pass-level decisions (what was optimized, what was missed, and why) through the standard DiagnosticInfo hierarchy. NVIDIA's custom "profuse" framework provides verbose per-pass diagnostic output that is entirely separate from both EDG diagnostics and LLVM remarks, controlled by dedicated knobs like profuseinline and profusegvn.

Understanding these three layers is essential for reimplementation because they share no code. EDG diagnostics live in the 0x670000-0x6FFFFF address range and operate on EDG's internal diagnostic record format. LLVM remarks use the stock OptimizationRemarkEmitter analysis pass and the DiagnosticInfoOptimizationBase class hierarchy. The profuse framework is a pure NVIDIA invention that writes directly to stderr through cl::opt<bool> guards with no connection to either of the other two systems.


EDG terminal emitter	`sub_681D20` (37KB, 1,342 lines) at `0x681D20`
EDG dispatch/SARIF emitter	`sub_6837D0` (20KB) at `0x6837D0`
Diagnostic format selector	`unk_4D04198`: 0 = text, 1 = SARIF
Format CLI flag	`--diagnostics_format=text\|sarif` (case 0x125 in `sub_617BD0`)
EDG output mode CLI	`--output_mode text\|sarif` (case 293 in lgenfe_main)
LLVM remark registration	`ctor_152` at `0x4CE3F0` (3 regex `cl::opt`s)
LLVM remark YAML serializer	`sub_15CAD70` (13KB) at `0x15CAD70`
LLVM remark bitstream serializer	`sub_F01350` (23KB) at `0xF01350`
Profuse inlining knob	`profuseinline` at `0x4DBEC0` (ctor_186_0), default off
Profuse GVN knob	`profusegvn` at `0x4FAE7E0` (ctor_201), default true
Diagnostic output stream	`qword_4F07510` (FILE*, typically stderr)
Terminal width	`dword_4D039D0` (columns, for word-wrapping)
ANSI color enable	`dword_4F073CC[0]` (nonzero = enabled)
Upstream LLVM equivalent	`llvm/include/llvm/IR/DiagnosticInfo.h`, `llvm/lib/Analysis/OptimizationRemarkEmitter.cpp`

EDG Frontend Diagnostics

Dispatch Architecture

Every EDG frontend diagnostic passes through sub_6837D0, which acts as the single dispatch point. This function performs filtering (severity threshold, duplicate suppression, pragma-based suppression), increments error/warning counters, and then routes to one of two renderers based on the global unk_4D04198:

sub_6837D0(diag_record)
  |
  +-- severity < byte_4F07481[0]?  --> suppress (return)
  +-- duplicate? (byte_4CFFE80[4*errnum+2] bit flags) --> count only
  +-- pragma disabled? (sub_67D520) --> suppress
  +-- error limit reached? (unk_4F074B0 + unk_4F074B8 >= unk_4F07478) --> error 1508, abort
  |
  +-- unk_4D04198 == 0  -->  sub_681D20(diag)   [terminal text renderer]
  +-- unk_4D04198 == 1  -->  inline SARIF JSON   [JSON renderer within sub_6837D0]

The format is selected by the --diagnostics_format flag (case 0x125 in sub_617BD0), which is surfaced as --output_mode text|sarif in the lgenfe CLI.

Diagnostic Record Layout

EDG diagnostic records are approximately 192-byte structures organized as a tree. Each record can have child diagnostics, notes, context diagnostics (include-stack annotations), and an extra child list, all stored as linked lists.

Offset	Size	Field	Description
+0	4	`type`	0 = top-level, 1 = unknown, 2 = child-with-parent, 3 = continuation
+8	8	`next_sibling`	Linked list next pointer
+16	8	`parent_diag`	Pointer to parent diagnostic node
+24	8	`child_list`	Linked list of child diagnostics
+40	8	`extra_child_list`	Secondary child list (always emitted)
+56	8	`note_list`	Linked list of attached notes
+72	8	`context_list`	Context diagnostics (include-stack annotations)
+96	4	`has_source_location`	Nonzero if source info is present
+100	2	`column_number`	Column in source line (unsigned short)
+120	8	`source_file_info`	Passed to `sub_723260` to get filename string
+128	4	`line_number`	Source line number (unsigned int)
+136	4	`file_id`	File table index (0 = no file)
+140	2	`column_end`	End column for underlining range
+144	4	`is_command_line`	Nonzero means "command line" prefix
+152	8	`source_entity`	If nonzero, use `sub_723640` for decorated location
+160	8	`display_name_ptr`	Filename string pointer
+168	4	`display_line`	Line number for display
+172	4	`tab_stop_width`	Tab stop setting for source display
+176	4	`diagnostic_number`	Numeric ID for `-W` flags, becomes SARIF ruleId
+180	1	`severity`	Severity code (see severity enum below)

Terminal Text Renderer (`sub_681D20`)

The 37KB terminal renderer is the larger and more complex of the two backends. It handles ANSI color output, word-wrapping to terminal width, source context display with caret underlining, and recursive child diagnostic emission.

Location prefix. The source location is formatted before the severity label. For file-based diagnostics, sub_722FC0 or sub_723640 produces the filename, followed by (line_number) in parentheses, wrapped in ANSI color code 5 (file path color). Command-line diagnostics use string ID 1490 ("command line"). Diagnostics with no file have no location prefix.

Severity label. The label string is looked up via sub_67C860(string_id) from a localized string table. The string table base v57 is offset by 0 for normal diagnostics, 1 for command-line diagnostics. When diagnostic numbering is enabled (unk_4D04728 set) and severity is 5 or below with a nonzero diagnostic number at +176, the renderer appends #<number> after the severity label, converted by sub_67D2D0.

ANSI color system. CICC does not emit standard ANSI escape sequences directly. Instead, it uses an internal 2-byte marker system where byte 0 is 0x1B (ESC) and byte 1 is a color code from 1 to 5. These internal markers are translated to real terminal escapes by the output layer.

Internal Code	Semantic	Typical Terminal Mapping
1	Reset/default	`\033[0m`
2	Error	Red
3	Caution/severe-warning	Yellow/magenta
4	Location highlight	Bold/cyan
5	File path / remark	Dim/blue

Color output is gated by dword_4F073CC[0] (nonzero = enabled) and dword_4F073C8 (nonzero = "rich" escape mode; zero = "simple" mode that skips escape bytes entirely).

Word-wrapping. Two code paths exist depending on whether ANSI colors are active.

Without colors (Path A), the algorithm is straightforward: compute available width as dword_4D039D0 - left_margin, scan for the last space within that width, break there, and emit newline plus indent. The left margin and continuation indent depend on the diagnostic type:

Type (+0)	Left Margin	Continuation Indent
0 (top-level)	0	10
1	12	22
2 (child)	10 or 12	20 or 22
3 (continuation)	1	11

For type 2, the margin is +2 if the current diagnostic is not the first child of its parent.

With colors (Path B), the algorithm tracks character-by-character with color state (v40 = current color, v41 = at-start-of-line flag, v152 = remaining columns). On encountering an ESC marker, it consumes the 2-byte pair and updates color state via sub_67BBF0. When the column limit is hit, the algorithm attempts to break at the last recorded space position (with buffer rewind to v147), falling back to a forced break at the current position.

The global qword_4F07468 controls wrap behavior: the low 32 bits disable wrapping entirely when nonzero, and the high 32 bits suppress source context display when nonzero.

Source context display. After the message text, the renderer displays the source line with caret underlining. sub_729B10(file_id, ...) retrieves source line data. Each source position entry is a linked list node with a 24+ byte layout: +0 next pointer, +8 source text pointer, +16 entry type (0 = normal char, 1 = same-position, 2 = 2-byte char, 3 = tab), +24 replacement character. The display renders two lines: the source text and a caret/tilde underline line, where ^ marks the error column and ~ extends the range to column_end. Multi-byte character handling uses sub_721AB0 to determine byte counts.

Recursive emission. After the main diagnostic and source context, child diagnostics are emitted recursively in this order: child_list (+24), note_list (+56, skipped for severity 2 remarks), context_list (+72, with parent pointer set before recursion), extra_child_list (+40). After all children, a blank line separator is emitted (unless compact mode is active), the output buffer is null-terminated, and the result is written via fputs to qword_4F07510 followed by fflush.

Machine-readable log. When qword_4D04908 (log FILE*) is set and the diagnostic type is not 3 (continuation), the renderer writes a single-line record:

<severity-char> "<filename>" <line> <col> <message>\n

The severity character is indexed from the string "rwweeccccCli" by (severity - 4). For child diagnostics, the character is lowercased.

Index	Character	Meaning
0 (sev 4)	r	remark
1 (sev 5)	w	warning
2 (sev 6)	w	caution (displayed as warning)
3 (sev 7)	e	error
4 (sev 8)	e	error (promoted)
5 (sev 9)	c	catastrophe
6 (sev 10)	c	catastrophe
7 (sev 11)	C	catastrophe (alternate)
8	l	unknown
9	i	internal error

SARIF JSON Renderer

The SARIF backend is implemented inline within sub_6837D0. Rather than emitting a complete SARIF document (no $schema, no runs[] envelope), it writes one JSON object per diagnostic as a comma-separated stream to qword_4F07510. The caller or a post-processing tool is expected to wrap the stream.

Each diagnostic object has this structure:

{
  "ruleId": "EC<number>",
  "level": "error"|"warning"|"remark"|"catastrophe"|"internal_error",
  "message": {"text": "<JSON-escaped message>"},
  "locations": [
    {
      "physicalLocation": {
        "artifactLocation": {"uri": "file://<path>"},
        "region": {"startLine": N, "startColumn": N}
      }
    }
  ],
  "relatedLocations": [
    {
      "message": {"text": "..."},
      "physicalLocation": { ... }
    }
  ]
}

The ruleId is constructed by sprintf("%lu", *(uint32*)(diag+176)) -- the decimal diagnostic number prefixed with "EC". The level string is mapped from the severity byte at +180 via a switch statement. The message.text is produced by sub_683690, which renders the diagnostic text into qword_4D039E8 via sub_681B50 and then copies it character-by-character into qword_4D039D8 with JSON escaping of " and \ characters. The locations array is present only when *(diag+136) != 0 (valid file ID). The physicalLocation is built by sub_67C120, which calls sub_729E00 to decompose the packed source position and sub_722DF0 to resolve the file ID to a path string. The relatedLocations array carries note sub-diagnostics from the linked list at diag+72.

Multiple diagnostics are comma-separated: a comma is prepended before { when unk_4F074B0 + unk_4F074B8 > 1 (more than one diagnostic emitted so far).

Include-stack annotations. When include depth (dword_4F04C64) is greater than zero, sub_6837D0 walks the include stack (776-byte records at qword_4F04C68) calling sub_67B7E0 to build #include context annotations. These are linked as children at diag+40/+48. Error 453 gives "in file included from ..." context, error 1150 gives ellipsis "..." when too many include levels exist, and errors 1063/1064 give file-reference footers.

Warning-as-error promotion. When a warning (severity 5) has been emitted and unk_4D04728 is set, the function creates a synthetic "warnings treated as errors" diagnostic via sub_67D610(0xE7D, ..., 4) with severity 4 (remark), then recursively calls sub_6837D0 on it.

Diagnostic Filtering and Suppression

Filtering happens in sub_6837D0 before either renderer is invoked:

Severity threshold: byte_4F07481[0] stores the minimum severity. Diagnostics below this level are silently suppressed.
Duplicate detection: byte_4CFFE80[4*errnum + 2] bit flags track "already seen" diagnostics. Bit 0 marks first occurrence, bit 1 marks already emitted. On second hit, the diagnostic is counted but not emitted.
Pragma suppression: sub_67D520 checks whether the diagnostic is disabled via #pragma diag_suppress or similar EDG pragmas. sub_67D470 records the suppression.
Error limit: When unk_4F074B0 + unk_4F074B8 >= unk_4F07478, error 1508 ("error limit reached") is emitted and sub_7235F0(9) aborts compilation.

Diagnostic Severity Enum

The severity byte at diag+180 encodes the following levels, used by both the terminal and SARIF renderers:

Value	Name	Terminal Color	SARIF Level	Log Char	Label
2	remark	ESC 5 (blue)	`"remark"`	R	R
4	warning	ESC 5 (blue)	`"warning"`	r	W
5	caution	ESC 3 (yellow)	`"warning"`	w	W (lowercase)
6	severe-warning	ESC 3 (yellow)	(falls through to error)	w	E (lowercase)
7	error	ESC 2 (red)	`"error"`	e	E
8	error (promoted)	ESC 2 (red)	`"error"`	e	E
9	catastrophe	ESC 2 (red)	`"catastrophe"`	c	C
10	catastrophe	ESC 2 (red)	`"catastrophe"`	c	C
11	internal-error	ESC 2 (red)	`"internal_error"`	i	special

Severity values 9, 10, and 11 are fatal: after emission, sub_7AFBD0 (longjmp / error propagation [LOW confidence] -- the function is called on fatal error paths and does not return to its caller, consistent with longjmp or exit, but could also be a custom abort-style handler; no setjmp/longjmp string evidence found) and sub_7235F0(severity) terminate compilation. Internal errors (11) additionally prepend "(internal error) " to the log output and use the prefix for error 3709.

Note: severity 2 (remark) is distinct from LLVM optimization remarks -- it is an EDG frontend remark (e.g., template instantiation notes). Remarks at severity 2 suppress their note_list children during recursive emission.

LLVM Optimization Remarks

Registration and CLI Surface

Three cl::opt<std::string> knobs are registered at ctor_152 (0x4CE3F0), each taking a regex pattern:

Knob	Description	Filters
`pass-remarks`	Enable optimization remarks from passes whose name matches the pattern	Passed (successful) optimizations
`pass-remarks-missed`	Enable missed optimization remarks	Optimizations that were considered but not applied
`pass-remarks-analysis`	Enable analysis remarks	Intermediate analysis results and explanations

These are stock LLVM cl::opt registrations. CICC exposes them through the flag catalog (sub_9624D0) via the -inline-info convenience flag, which routes to the opt phase as:

-Xopt -pass-remarks=inline
-Xopt -pass-remarks-missed=inline
-Xopt -pass-remarks-analysis=inline

Additional remark-related knobs registered at ctor_376_0 (0x512DF0):

Knob	Purpose
`pass-remarks-with-hotness`	Include PGO hotness information in remarks
`pass-remarks-hotness-threshold`	Minimum hotness for remark emission
`pass-remarks-output`	File path for remark output (YAML or bitstream)
`pass-remarks-filter`	Additional filter for remark pass names
`pass-remarks-format`	Format: `yaml` or `bitstream`

The -w flag (suppress warnings) routes to both opt and llc as -w. The -Werror flag routes to both as -Werror, promoting warnings to errors.

Remark Emission Protocol

LLVM passes emit remarks through a three-step protocol observed consistently across all analyzed passes:

Step 1: Construct the remark. The pass creates a DiagnosticInfoOptimizationBase subclass object via one of these constructors:

Constructor	Address	Creates
`sub_B17560`	`0xB17560`	`OptimizationRemark` (pass succeeded)
`sub_15CA330`	`0x15CA330`	`OptimizationRemark` (alternative constructor)
`sub_15CA540`	`0x15CA540`	`OptimizationRemarkMissed` (pass failed/skipped)
`sub_B178C0`	`0xB178C0`	Warning-level `DiagnosticInfo` (non-remark warning)

The constructor takes a pass name string (e.g., "coro-split", "wholeprogramdevirt", "loop-distribute") and a remark ID string (e.g., "Devirtualized", "Distribute", "CoroSplit").

Step 2: Build the message. The message is assembled through a builder pattern:

Builder Function	Address	Purpose
`sub_B18290`	`0xB18290`	Append raw string to remark message
`sub_B16430`	`0xB16430`	Create named string attribute (e.g., `"FunctionName"`)
`sub_B16B10`	`0xB16B10`	Create named integer attribute (e.g., `"frame_size"`)
`sub_B16530`	`0xB16530`	Append named value (used in analysis remarks)
`sub_B180C0`	`0xB180C0`	Finalize and prepare remark for emission

A typical emission sequence (from CoroSplit at 0x24F05D1):

call sub_B17560("coro-split", "CoroSplit")      // create remark
call sub_B18290("Split '")                       // append prefix
call sub_B16430("function", fn_name)             // named attribute
call sub_B18290("' (frame_size=")                // literal text
call sub_B16B10("frame_size", N)                 // integer attribute
call sub_B18290(", align=")                      // literal text
call sub_B16B10("align", M)                      // integer attribute
call sub_B18290(")")                             // closing paren

Resulting remark text: Split '<function_name>' (frame_size=N, align=M)

Step 3: Publish. sub_1049740 publishes the remark to the diagnostic handler registered on the LLVMContext. The handler consults the pass-remarks / pass-remarks-missed / pass-remarks-analysis regex filters to decide whether to emit or suppress the remark.

After emission, remark objects are cleaned up: vtable-based destructors free the remark structure, and SSO string cleanup checks whether each temporary string pointer differs from its inline buffer address (indicating heap allocation that needs free).

Remark Categories

Standard LLVM categories:

Category	YAML Tag	Meaning
Passed	`!Passed`	Optimization was successfully applied
Missed	`!Missed`	Optimization was considered but not applied
Analysis	`!Analysis`	Intermediate analysis information
Failure	`!Failure`	Internal failure during optimization

NVIDIA-specific categories added to the remark framework:

Category	YAML Tag	Purpose
AnalysisFPCommute	`!AnalysisFPCommute`	GPU floating-point commutativity analysis feedback
AnalysisAliasing	`!AnalysisAliasing`	GPU memory aliasing analysis feedback

These NVIDIA-specific categories are registered in the YAML serializer at sub_15CAD70 and the YAML parser at sub_C30A00.

Serialization Backends

YAML serializer (sub_15CAD70, 13KB at 0x15CAD70): Emits structured YAML with fields Pass, Name, DebugLoc, and the remark type tag. Uses a vtable-based streaming API at offsets +96 (writeKey), +120 (beginMapping), +128 (endMapping).

Bitstream serializer (sub_F01350, 23KB at 0xF01350): Emits remarks in LLVM's binary bitstream format (used for -fsave-optimization-record). Record types include "Remark", "Remark header", "Remark debug location", "Remark hotness", "Argument with debug location", and "Argument". Uses sub_EFD2C0 for VBR-encoded record emission and sub_EFCCF0 for abbreviation definitions.

Remark serializer factory (sub_C2E790, 6KB at 0xC2E790): llvm::remarks::createRemarkSerializer dispatches to YAML or bitstream format based on configuration. Returns an error for unknown formats: "Unknown remark serializer format.".

OptimizationRemarkEmitter Analysis

Two analysis passes provide remark emission capability to function-level and machine-function-level passes:

Pass	Pipeline Name	Level
`OptimizationRemarkEmitterAnalysis`	`"opt-remark-emit"` (pipeline ID 181)	Function analysis
`MachineOptimizationRemarkEmitterAnalysis`	`"machine-opt-remark-emitter"` (pipeline ID 467)	MachineFunction analysis

Passes that emit remarks must request the appropriate analysis and store the resulting OptimizationRemarkEmitter*. For example, the TwoAddressInstruction pass stores it at this+272, obtained via analysis lookup unk_4FC4534.

Passes Known to Emit Remarks

This is a non-exhaustive list of passes observed emitting optimization remarks in the binary:

Pass	Remark Name	Remark Examples
CoroSplit	`"coro-split"`	`Split '<fn>' (frame_size=N, align=M)`
WholeProgramDevirt	`"wholeprogramdevirt"`	`Devirtualized '<fn>'`
LoopDistribute	`"loop-distribute"`	`Distribute`, `NoUnsafeDeps`, `TooManySCEVRuntimeChecks`
LoopVectorize	`"loop-vectorize"`	Vectorization success/failure details
LoopUnroll	`"loop-unroll"`	Unroll factor and failure reasons
LoopInterchange	`"loop-interchange"`	`Cannot interchange loops...`
LICM	`"licm"`	Hoist success/failure reasons
SLPVectorizer	`"slp-vectorizer"`	SLP vectorization decisions
MachinePipeliner	`"pipeliner"`	`Pipelined succesfully!` [sic]
MachineOutliner	`"machine-outliner"`	Outlining decisions
OpenMP SPMD Transform	`"openmp-opt"`	OMP120 (remark), OMP121 (warning)
InstCombine	`"instcombine"`	Visit decisions (via `instcombine-visit` filter)
FastISel	`"fastisel"`	FastISel failure reports
IRCE	`"irce"`	Range check elimination decisions
TwoAddressInstruction	`"twoaddressinstruction"`	Two-address conversion decisions

NVIDIA Profuse Framework

Design and Purpose

The "profuse" diagnostic framework is an NVIDIA-specific verbose output system that has no connection to the LLVM OptimizationRemark infrastructure. It predates LLVM's remark system and serves a different purpose: providing NVIDIA compiler engineers with extremely detailed, unstructured diagnostic output from specific optimization passes.

The name "profuse" is unfortunately overloaded in the cicc binary. Two completely unrelated systems use the word:

PGO profuse: The profuse knob registered at ctor_375 (0x512720) is a boolean that enables profile-guided optimization data consumption. It is set via -profile-instr-use <file> which routes to -Xopt -profuse=true -Xopt -proffile=<file>. This is a PGO control flag, not a diagnostic system.
Diagnostic profuse: The profuseinline and profusegvn knobs are NVIDIA diagnostic toggles that control verbose output from specific optimization passes. These are the "profuse framework" discussed here.

`profuseinline`

Registered at ctor_186_0 (0x4DBEC0) as a cl::opt<bool> with default value off (false).

When enabled, the NVIDIA custom inliner (sub_1864060, the shouldInline / inline cost computation) emits verbose diagnostic output for every inlining decision. This includes the computed cost, threshold comparison, argument type-size coercion details, and the final accept/reject decision.

The profuse inlining output goes directly to stderr through fprintf-style calls within the inliner code. It is not routed through OptimizationRemarkEmitter and does not appear in remark YAML/bitstream output. This is distinct from the LLVM inline-remark-attribute knob which annotates the IR with remark metadata.

The -inline-info CLI flag does not enable profuseinline. Instead, -inline-info routes to the three standard pass-remarks knobs filtered for "inline". To enable profuse output, one must pass -Xopt -profuseinline=true (or -Xcicc -opt -profuseinline=true through nvcc).

Comparison of the two diagnostic channels for inlining:

Feature	profuseinline	-inline-info (pass-remarks)
Output format	Unstructured stderr text	Structured LLVM remark
Controlled by	`cl::opt<bool>`	Regex filter on pass name
Default	Off	Off
YAML/bitstream output	No	Yes (if `-pass-remarks-output` set)
Cost model details	Yes (full cost breakdown)	No (accept/reject only)
NVIDIA-specific metrics	Yes (GPU opcode bonus, struct analysis)	No

`profusegvn`

Registered at ctor_201 (0x4E0990) as a cl::opt<bool> with default value true (enabled). Global address: 0x4FAE7E0. Description: "profuse for GVN".

When the knob is active (which it is by default), the GVN pass (sub_1900BB0, 83KB) emits verbose diagnostic output at the following decision points:

Value replacement decisions (when a leader is found in the value numbering table)
Store/load expression hash table matches
PRE (Partial Redundancy Elimination) insertion decisions

The output is written directly to stderr, bypassing the LLVM remark system entirely. The profuse GVN output is not captured by -pass-remarks-output and does not appear in remark YAML or bitstream files.

To disable the verbose output, pass -Xopt -profusegvn=false. The fact that this defaults to true (unlike profuseinline which defaults to false) suggests it may be gated by an additional runtime check (possibly wizard mode or an optimization level gate) to prevent user-visible noise in release builds.

Profuse vs. LLVM Remarks Summary

Aspect	Profuse Framework	LLVM Optimization Remarks
Origin	NVIDIA custom	Upstream LLVM
Passes	Inliner, GVN only (observed)	Most optimization passes
Output	Raw stderr fprintf	Structured DiagnosticInfo
Format	Unstructured text	YAML, bitstream, or terminal
Filtering	Per-knob boolean	Regex on pass name
Serialization	None	YAML and bitstream serializers
IDE integration	None	SARIF (with post-processing)
Default	Off (inline) / On (GVN)	Off (requires `-pass-remarks`)

Filtering and Configuration

CLI Flags for Diagnostic Control

EDG frontend diagnostics (Phase I):

Flag	Route	Effect
`--diagnostics_format=sarif`	EDG direct	Switch output to SARIF JSON
`--output_mode text\|sarif`	EDG direct (case 293)	Same as above, alternative spelling
`-w`	opt `-w`, llc `-w`	Suppress all warnings
`-Werror`	opt `-Werror`, llc `-Werror`	Promote warnings to errors
`--error_limit N`	EDG direct	Maximum errors before abort (`unk_4F07478`)
`#pragma diag_suppress N`	EDG source	Suppress specific diagnostic by number

LLVM optimization remarks (Phase II / opt):

Flag	Route	Effect
`-inline-info`	opt: `-pass-remarks=inline`, `-pass-remarks-missed=inline`, `-pass-remarks-analysis=inline`	Enable inline-specific remarks
`-Xopt -pass-remarks=<regex>`	opt direct	Enable passed remarks matching pattern
`-Xopt -pass-remarks-missed=<regex>`	opt direct	Enable missed remarks matching pattern
`-Xopt -pass-remarks-analysis=<regex>`	opt direct	Enable analysis remarks matching pattern
`-Xopt -pass-remarks-output=<file>`	opt direct	Write remarks to file (YAML or bitstream)
`-Xopt -pass-remarks-format=yaml\|bitstream`	opt direct	Select output format
`-Xopt -pass-remarks-with-hotness`	opt direct	Include PGO hotness in remarks
`-Xopt -pass-remarks-hotness-threshold=N`	opt direct	Minimum hotness for emission
`-Xopt -pass-remarks-filter=<regex>`	opt direct	Additional pass name filter

NVIDIA profuse diagnostics:

Flag	Route	Effect
`-Xopt -profuseinline=true`	opt direct	Enable verbose inlining diagnostics
`-Xopt -profusegvn=false`	opt direct	Disable verbose GVN diagnostics (on by default)

Debug and verbose output:

Flag	Route	Effect
`-enable-verbose-asm`	llc `-asm-verbose`	Verbose assembly comments
`-show-src`	llc `-nvptx-emit-src`	Embed source in PTX output
`-time-passes`	special (must be only flag)	Time each LLVM pass

Global Variables Controlling Diagnostic Behavior

Address	Type	Name	Purpose
`unk_4D04198`	int	`diagnostic_format`	0 = text, 1 = SARIF
`byte_4F07481[0]`	byte	`min_severity_threshold`	Minimum severity for emission
`unk_4F074B0`	uint	`error_count`	Running error counter
`unk_4F074B8`	uint	`warning_count`	Running warning/non-error counter
`unk_4F07478`	uint	`error_limit`	Maximum errors before abort
`unk_4F07490`	flag	`print_counters`	Whether to print summary counters
`unk_4D04728`	byte	`diag_numbering`	Diagnostic numbering enabled
`unk_4D042B0`	byte	`command_line_mode`	Command-line diagnostic prefix
`unk_4D042B8`	flag	`werror_flag`	Promote severity to 7 for warnings
`dword_4D039D0`	int	`terminal_width`	Columns for word-wrapping
`dword_4F073CC[0]`	int	`ansi_color_enabled`	ANSI color output flag
`dword_4F073C8`	int	`rich_escape_mode`	Rich (2-byte ESC) vs simple mode
`qword_4F07468`	int64	`wrap_control`	Low32: disable wrap. High32: suppress context
`qword_4F07510`	FILE*	`diag_output_stream`	Output stream (stderr)
`qword_4D04908`	FILE*	`diag_log_file`	Machine-readable log file
`byte_4CFFE80`	array	`diag_seen_flags`	Per-diagnostic duplicate tracking

Growable String Buffer Infrastructure

All three diagnostic systems share the same growable string buffer used for message formatting. The buffer structure appears at qword_4D039D8 (output buffer), qword_4D039E0 (prefix buffer), and qword_4D039E8 (header/message buffer):

Offset	Size	Field	Description
+0	8	(tag/type)	Unused or type discriminator
+8	8	capacity	Maximum bytes before realloc
+16	8	length	Current write position
+24	8	(unused)	Padding
+32	8	data	`char*` pointer to the actual buffer

Helper	Address	Operation
`sub_823800`	`0x823800`	Reset/clear buffer (set length to 0)
`sub_823810`	`0x823810`	Grow buffer capacity (realloc)
`sub_8238B0`	`0x8238B0`	Append data: `memcpy(buf->data + buf->length, str, len)`
`sub_8237A0`	`0x8237A0`	Allocate new buffer (initial capacity = 1024)

Function Map

Function	Address	Size	Role
`sub_67B780`	`0x67B780`	--	EDG: Increment error/warning counters
`sub_67B7E0`	`0x67B7E0`	--	EDG: Build include-stack annotation
`sub_67B9F0`	`0x67B9F0`	--	EDG: Diagnostic record pool allocator
`sub_67BB20`	`0x67BB20`	--	EDG: Argument node allocator
`sub_67BBF0`	`0x67BBF0`	--	EDG: Set ANSI color state for output
`sub_67BD40`	`0x67BD40`	--	EDG: Emit newline/flush for source context
`sub_67BDC0`	`0x67BDC0`	--	EDG: Load file metadata and tab stop width
`sub_67C120`	`0x67C120`	--	EDG/SARIF: Emit `physicalLocation` JSON
`sub_67C860`	`0x67C860`	--	EDG: Localized string lookup by ID
`sub_67D2D0`	`0x67D2D0`	--	EDG: Convert internal diag ID to user-visible number
`sub_67D470`	`0x67D470`	--	EDG: Record pragma-based suppression
`sub_67D520`	`0x67D520`	--	EDG: Check pragma-based suppression
`sub_67D610`	`0x67D610`	--	EDG: Create synthetic diagnostic (warnings-as-errors)
`sub_681B50`	`0x681B50`	--	EDG: Populate message text into header buffer
`sub_681D20`	`0x681D20`	37KB	EDG: Terminal text diagnostic renderer
`sub_683690`	`0x683690`	--	EDG/SARIF: Emit JSON-escaped `message` object
`sub_6837D0`	`0x6837D0`	20KB	EDG: Diagnostic dispatch and SARIF renderer
`sub_721AB0`	`0x721AB0`	--	EDG: Multi-byte character byte count
`sub_722DF0`	`0x722DF0`	--	EDG/SARIF: Resolve file-id to path string
`sub_722FC0`	`0x722FC0`	--	EDG: Format filename into buffer
`sub_723260`	`0x723260`	--	EDG: Get filename string from file info
`sub_723640`	`0x723640`	--	EDG: Get decorated source location string
`sub_729B10`	`0x729B10`	--	EDG: Retrieve file/line data for source context
`sub_729E00`	`0x729E00`	--	EDG/SARIF: Decompose packed source position
`sub_729F80`	`0x729F80`	--	EDG: Promote severity (hard error)
`sub_7235F0`	`0x7235F0`	--	EDG: Fatal exit with severity code
`sub_7AF1D0`	`0x7AF1D0`	--	EDG: Newline character mapping lookup
`sub_823800`	`0x823800`	--	Shared: Reset/clear growable string buffer
`sub_823810`	`0x823810`	--	Shared: Grow/realloc string buffer
`sub_8237A0`	`0x8237A0`	--	Shared: Allocate new growable buffer
`sub_8238B0`	`0x8238B0`	--	Shared: Append to string buffer
`sub_B16430`	`0xB16430`	--	LLVM Remark: Create named string attribute
`sub_B16530`	`0xB16530`	--	LLVM Remark: Append named value
`sub_B16B10`	`0xB16B10`	--	LLVM Remark: Create named integer attribute
`sub_B157E0`	`0xB157E0`	--	LLVM Remark: Get DebugLoc for remark source location
`sub_B17560`	`0xB17560`	--	LLVM Remark: Construct `OptimizationRemark` (passed)
`sub_B178C0`	`0xB178C0`	--	LLVM Remark: Construct warning-level `DiagnosticInfo`
`sub_B180C0`	`0xB180C0`	--	LLVM Remark: Finalize and prepare remark for emission
`sub_B18290`	`0xB18290`	--	LLVM Remark: Append raw string to remark message
`sub_B2BE50`	`0xB2BE50`	--	LLVM Remark: `getRemarkStreamer`
`sub_B6EA50`	`0xB6EA50`	--	LLVM Remark: `isEnabled` check
`sub_B6F970`	`0xB6F970`	--	LLVM Remark: `getRemarkFilter`
`sub_B91220`	`0xB91220`	--	LLVM Remark: Free remark string
`sub_C2E790`	`0xC2E790`	6KB	LLVM Remark: `createRemarkSerializer` factory
`sub_C302C0`	`0xC302C0`	4KB	LLVM Remark: YAML remark serializer emit
`sub_C30A00`	`0xC30A00`	6KB	LLVM Remark: YAML remark parser (6 type tags)
`sub_C31010`	`0xC31010`	8KB	LLVM Remark: YAML remark field parser
`sub_EFCCF0`	`0xEFCCF0`	9KB	LLVM Remark: Bitstream abbreviation emitter
`sub_EFD2C0`	`0xEFD2C0`	18KB	LLVM Remark: Bitstream record writer
`sub_EFE900`	`0xEFE900`	30KB	LLVM Remark: Bitstream remark parser
`sub_F01350`	`0xF01350`	23KB	LLVM Remark: Bitstream remark serializer
`sub_1049740`	`0x1049740`	--	LLVM Remark: Publish remark to diagnostic handler
`sub_15CA330`	`0x15CA330`	--	LLVM Remark: `OptimizationRemark` constructor
`sub_15CA540`	`0x15CA540`	--	LLVM Remark: `OptimizationRemarkMissed` constructor
`sub_15CAB20`	`0x15CAB20`	--	LLVM Remark: `OptimizationRemark::operator<<(StringRef)`
`sub_15CAD70`	`0x15CAD70`	13KB	LLVM Remark: YAML remark serializer (NVIDIA-extended)
`sub_1DCCCA0`	`0x1DCCCA0`	--	LLVM Remark: `OptimizationRemarkEmitter::emit`

Cross-References

Entry Point & CLI -- flag routing for -w, -Werror, -inline-info, -Xopt pass-through
GVN -- profusegvn knob and GVN diagnostic output
Inliner Cost Model -- profuseinline knob and inline cost diagnostics
LLVM Pass Pipeline -- opt-remark-emit and machine-opt-remark-emitter analysis pass registration
EDG Frontend -- EDG option registration including --diagnostics_format
CLI Flags -- complete flag-to-pipeline routing table
Knobs -- profuseinline, profusegvn, and remark-related knobs
AsmPrinter -- remark emission during code generation

Hash Table and Collection Infrastructure

Every associative container in cicc v13.0 is built from the same handful of primitives: a pointer-hash DenseMap/DenseSet with quadratic probing, a wyhash-v4-family string hasher, and a SmallVector with inline buffer optimization. Before this page existed, the same hash table description was duplicated across 30+ wiki pages. This is the single source of truth. If you are reimplementing cicc's data structures, start here.

There are no NVIDIA-specific modifications to the DenseMap hashing or probing logic -- cicc links the LLVM 20.0.0 implementation unmodified. The only NVIDIA-original hash infrastructure is the wyhash-v4 string hasher used for the builtin name table.

DenseMap Layout

Two variants exist, distinguished by bucket stride. Both share the same 28-byte inline header, the same hash function, the same probing sequence, the same sentinel values, and the same growth policy. The header is always embedded directly inside a larger structure (context object, analysis result, pass state) -- never heap-allocated on its own.

Variant A -- DenseSet (8 bytes/bucket)

Offset	Size	Type	Field
+0	8	`uint64_t`	`NumEntries`
+8	8	`ptr`	`Buckets` (heap-allocated array)
+16	4	`uint32_t`	`NumItems` (live entries)
+20	4	`uint32_t`	`NumTombstones`
+24	4	`uint32_t`	`NumBuckets` (always power of 2)

Bucket array size: NumBuckets * 8 bytes. Each bucket holds either a valid pointer, an empty sentinel, or a tombstone sentinel.

Variant B -- DenseMap (16 bytes/bucket)

Same 28-byte header. Each bucket holds a key-value pair at a 16-byte stride:

v30 = (_QWORD *)(buckets + 16LL * slot);   // sub_163D530 line 561
*v30 = key;                                  // +0: key
v30[1] = value;                              // +8: value

Variant B is used by the SelectionDAG builder (context offsets +120 and +152), the NVVM IR node uniquing tables, and any subsystem that maps pointers to pointers.

Where the Variants Appear

Subsystem	Variant	Context offset	Purpose
NVVM IR uniquing (`sub_162D4F0`)	B (16B)	context qw[130..178]	Node deduplication per opcode
SelectionDAG builder (`sub_163D530`)	B (16B)	+120, +152	Node mapping
SelectionDAG builder (`sub_163D530`)	A (8B)	+184	Worklist set
Per-node analysis structures	A (8B)	+72 inside v381	Visited set
CSSA PHI map (`sub_3720740`)	B (16B)	r15+0x60	PHI-to-ID mapping
Coroutine spill tracking	B (16B)	+0x18 inline	Spill/reload tracking
Builtin name table	custom (12B stride)	context+480	Name-to-ID with hash cache

Pointer Hash Function

Every DenseMap/DenseSet instance in cicc that uses pointer keys employs the same hash:

hash(ptr) = (ptr >> 9) ^ (ptr >> 4)

This is LLVM's DenseMapInfo<void*>::getHashValue, unchanged. The right-shift by 4 discards the low bits that are always zero due to 8- or 16-byte alignment. The right-shift by 9 mixes in higher-order address bits to break up the stride patterns that arise from slab allocation (where consecutive objects are separated by a fixed power-of-two). The XOR combines these two views of the pointer into a single hash value that distributes well for both heap-allocated and slab-allocated objects.

Representative decompiled evidence (appears identically in dozens of functions):

v9 = (v12 - 1) & (((unsigned int)v11 >> 9) ^ ((unsigned int)v11 >> 4));

Integer-Key Hash Variant

A separate hash function is used for DenseMap<unsigned, T> instances (integer keys rather than pointers):

hash(key) = key * 37

This is LLVM's DenseMapInfo<unsigned>::getHashValue. It appears in the instruction emitter (sub_2E29BA0), the two-address pass (sub_1F4E3A0), the vector legalization tables, and the SelectionDAG instruction selection cost table (sub_3090F90). Integer-key maps use a different sentinel pair: 0xFFFFFFFF (empty) and 0xFFFFFFFE (tombstone).

wyhash v4 String Hasher -- `sub_CBF760`

The NVVM builtin name table uses a separate, NVIDIA-original hash function for string keys. sub_C92610 is a thin wrapper that tail-calls sub_CBF760. The function dispatches on input length into six code paths, each using different constant sets and mixing strategies:

Length Dispatch Table

Length	Strategy	Constants
0	Return constant	`0x2D06800538D394C2`
1--3	3-byte read + XOR + multiply	seed `0x87275A9B`, mul `0xC2B2AE3D27D4EB4F`, avalanche `0x165667B19E3779F9`
4--8	2x uint32 + combine + rotate	XOR `0xC73AB174C5ECD5A2`, mul `0x9FB21C651E98DF25`
9--16	2x uint64 + 128-bit multiply	XOR `0x6782737BEA4239B9` / `0xAF56BC3B0996523A`, avalanche `0x165667919E3779F9`
17--128	Paired 16B reads from both ends	Per-pair constants, 128-bit multiplies, length mixed with `0x61C8864E7A143579`
129--240	Extended mixing	Delegates to `sub_CBF370`
240+	Bulk processing	Delegates to `sub_CBF100`

Pseudocode (length 1--3, the most common case for short builtins)

fn wyhash_short(data: &[u8], len: usize) -> u32 {
    let a = data[0] as u64;
    let b = data[len / 2] as u64;
    let c = data[len - 1] as u64;
    let combined = a | (b << 8) | (c << 16) | (len as u64) << 24;
    let mixed = combined ^ 0x87275A9B;
    let wide = mixed.wrapping_mul(0xC2B2AE3D27D4EB4F);
    let folded = wide ^ (wide >> 32);
    let result = folded.wrapping_mul(0x165667B19E3779F9);
    (result ^ (result >> 32)) as u32
}

Pseudocode (length 17--128, covering most `__nvvm_*` names)

fn wyhash_medium(data: &[u8], len: usize) -> u32 {
    let pairs = [
        (0x1CAD21F72C81017C, 0xBE4BA423396CFEB8),  // pair 0
        (0x1F67B3B7A4A44072, 0xDB979083E96DD4DE),  // pair 1
        (0x2172FFCC7DD05A82, 0x78E5C0CC4EE679CB),  // pair 2
        // ... additional pairs for 64/96/128 thresholds
    ];
    let (mut v8, mut v10) = (0u64, 0u64);
    // read 16 bytes from front, 16 from back, mix with pair constants
    for i in 0..((len + 15) / 32) {
        let front = read_u128(&data[i * 16..]);
        let back  = read_u128(&data[len - (i + 1) * 16..]);
        (v8, v10) = mix_128(v8, v10, front, back, pairs[i]);
    }
    let combined = v8 ^ v10 ^ (len as u64 ^ 0x61C8864E7A143579);
    let result = 0x165667919E3779F9u64.wrapping_mul(combined ^ (combined >> 37));
    (result ^ (result >> 32)) as u32
}

The final return value is always a uint32 -- the high dword of the 64-bit result XORed with the low dword. Most NVVM builtin names are 8--35 bytes, hitting the optimal 4--8 and 9--16 and 17--128 paths.

Probing Strategy

All DenseMap instances use quadratic probing with triangular-number increments:

slot = hash & (capacity - 1)      // initial probe
step = 1
loop:
    if bucket[slot] == key   -> found
    if bucket[slot] == EMPTY -> not found (insert here)
    if bucket[slot] == TOMBSTONE -> record for reuse
    slot = (slot + step) & (capacity - 1)
    step++

The probe sequence for initial position h visits:

h, h+1, h+3, h+6, h+10, h+15, h+21, ...
h + T(k) where T(k) = k*(k+1)/2   (triangular numbers)

This guarantees that for a power-of-2 table size n, all n slots are visited before any index repeats. The proof relies on the fact that the differences T(k+1) - T(k) = k+1 produce all residues modulo n when n is a power of 2.

Comparison Guard (Builtin Table)

The builtin name hash table (sub_C92740, sub_C92860) adds a triple comparison guard before performing the expensive memcmp:

Cached hash equality: hash_cache[slot] == search_hash
Length equality: entry->length == search_length
Content equality: memcmp(search_data, entry->string_data, length) == 0

The hash cache is stored in a separate array immediately after the bucket array and the end-of-table sentinel. This layout avoids polluting bucket cache lines with hash values that are only needed on collision.

Probing Label: "Linear" vs "Quadratic"

Some analysis reports describe the probing as "linear" because the step variable increments by 1 each iteration. The actual probe position advances quadratically (by accumulating triangular numbers). Both descriptions refer to the same code. This page uses the technically precise term: quadratic probing with triangular numbers.

Growth Policy

Load Factor Threshold -- 75%

After every successful insertion, the map checks whether to grow:

if (4 * (NumItems + 1) >= 3 * NumBuckets)
    // load factor > 75% -> double capacity
    new_capacity = 2 * NumBuckets

Tombstone Compaction -- 12.5%

If the load factor is acceptable but tombstones have accumulated:

elif (NumBuckets - NumTombstones - NumItems <= NumBuckets >> 3)
    // fewer than 12.5% of slots are truly empty
    // rehash at same capacity to clear tombstones
    new_capacity = NumBuckets

Rehash Procedure -- `sub_C929D0`

calloc(new_capacity + 1, bucket_stride) for the new array.
Write the end-of-table sentinel at position new_capacity.
For each live (non-empty, non-tombstone) entry in the old table, reinsert into the new table using quadratic probing.
Copy the cached hash (if the table has a hash cache).
Track the new position of a "current slot" pointer so the caller can continue using the entry it just inserted.
Free the old array.
Reset NumTombstones to 0.
Update NumBuckets to new_capacity.
Return the new position of the tracked slot.

Capacity Constraints

Power of 2: always. Enforced by the bit-smearing pattern: x |= x>>1; x |= x>>2; x |= x>>4; ...; x += 1.
Minimum: 64 buckets for standard DenseMap instances. The builtin name table starts at 16 and grows through 16 -> 32 -> 64 -> 128 -> 256 -> 512 -> 1024 as its 770 entries are inserted.
Allocation: sub_22077B0 (operator new[]), freed via j___libc_free_0.

Sentinel Values

Two sentinel families exist, distinguished by magnitude. Both are chosen to be impossible values for aligned pointers.

NVVM-Layer Sentinels (small magnitude)

Used by the NVVM IR uniquing tables, the SelectionDAG builder maps, and the builtin name table:

Role	Value	Hex	Why safe
Empty	-8	`0xFFFFFFFFFFFFFFF8`	Low 3 bits = `0b000` after masking, but no 8-byte-aligned pointer is this close to `(uint64_t)-1`
Tombstone	-16	`0xFFFFFFFFFFFFFFF0`	Same reasoning, distinct from -8

The builtin name table also uses a value of 2 as an end-of-table sentinel placed at bucket_array[capacity].

LLVM-Layer Sentinels (large magnitude)

Used by the majority of LLVM pass infrastructure -- SCEV, register coalescing, block placement, SLP vectorizer, StructurizeCFG, machine pipeliner, prolog-epilog, and others:

Role	Value	Hex	Decimal
Empty	`0xFFFFFFFFFFFFF000`	`-4096`	-4096
Tombstone	`0xFFFFFFFFFFFFE000`	`-8192`	-8192

Integer-Key Sentinels

Used by DenseMap<unsigned, T> instances (instruction emitter, two-address pass):

Role	Value	Hex
Empty	`0xFFFFFFFF`	32-bit all-ones
Tombstone	`0xFFFFFFFE`	32-bit all-ones minus 1

Which Sentinel Set to Expect

Subsystem	Sentinel pair
NVVM IR uniquing, SelectionDAG builder	-8 / -16
Builtin name table	-8 (tombstone), 0 (empty), 2 (end marker)
SCEV, block placement, SLP vectorizer	-4096 / -8192
Register coalescing, machine pipeliner	-4096 / -8192
StructurizeCFG, prolog-epilog	-4096 / -8192
Instruction emitter, two-address	0xFFFFFFFF / 0xFFFFFFFE
Coroutine spill tracking	0xFFFFFFFFF000 / 0xFFFFFFFFE000
CSSA PHI map	0xFFFFFFFFF000 / 0xFFFFFFFFE000
Debug verify	0xFFFFFFFFF000 / 0xFFFFFFFFE000
LazyCallGraph	0xFFFFFFFFF000 / 0xFFFFFFFFE000

The -8/-16 pair appears exclusively in NVVM-layer (NVIDIA-original) code. The -4096/-8192 pair is the standard LLVM DenseMapInfo<void*> sentinel set. The difference is cosmetic -- both pairs are safe for the same reasons -- but it reveals code provenance: if you see -8/-16, the code was written or heavily modified by NVIDIA; if you see -4096/-8192, it is stock LLVM.

SmallVector Pattern

SmallVector is the universal dynamic array throughout cicc, with two growth implementations:

Layout

[BeginPtr, Size:Count:Capacity, InlineData...]

Offset	Size	Field
+0	8	`data_ptr` (points to inline buffer initially, heap after growth)
+8	4	`size` (live element count)
+12	4	`capacity` (allocated slots)
+16	N	Inline buffer (N = `InlineCapacity * element_size`)

When size == capacity on insertion, the vector grows.

Growth Functions

Function	Address	Description
`SmallVector::grow`	`sub_C8D5F0`	Generic growth -- copies elements, used for non-POD types
`SmallVectorBase::grow_pod`	`sub_C8D7D0`	POD-optimized growth -- uses `realloc` when buffer is heap-allocated
`SmallVector::grow` (MIR)	`sub_16CD150`	Second copy in the MachineIR address range, identical logic
`SmallVector::grow` (extended)	`sub_C8E1E0`	Larger variant (11KB), handles edge cases

Growth Policy

The standard LLVM SmallVector growth: double the current capacity, with a minimum of 1. If the current buffer is the inline buffer, malloc a new heap buffer and memcpy the contents. If the buffer is already on the heap, realloc it (for POD types) or malloc + copy + free (for non-POD types).

new_capacity = max(2 * old_capacity, required_capacity)
if (data_ptr == &inline_buffer)
    heap_buf = malloc(new_capacity * elem_size)
    memcpy(heap_buf, inline_buffer, size * elem_size)
else
    // POD: heap_buf = realloc(data_ptr, new_capacity * elem_size)
    // non-POD: heap_buf = malloc(...); copy; free(old)
data_ptr = heap_buf
capacity = new_capacity

Common Inline Capacities

Observed across the codebase:

Inline capacity	Element size	Total inline bytes	Typical use
2	8	16	SCEV delinearization terms
4	8	32	LazyCallGraph SCC lists, basic block worklists
8	8	64	NVVMReflect call collection, PHI operand lists
16	8	128	AA evaluation pointer sets
22	8	176	Printf argument arrays (stack-allocated)
8	56	448	SROA slice descriptors

Builtin Name Table -- Specialized Hash Table

The builtin name table at context+480 is a specialized variant that does not use the standard DenseMap layout. It stores string entries rather than pointers, includes a parallel hash cache, and uses the wyhash function instead of the pointer hash.

Table Structure (20 bytes)

Offset	Size	Field
+0	8	`bucket_array_ptr`
+8	4	`capacity` (power of 2)
+12	4	`count` (live entries)
+16	4	`tombstone_count`

Memory Layout

[0 .. 8*cap-1]                    bucket_array: cap QWORD pointers
[8*cap .. 8*cap+7]                sentinel: value 2 (end-of-table)
[8*cap+8 .. 8*cap+8+4*cap-1]     hash_cache: uint32 per slot

String Entry (heap-allocated via `sub_C7D670`)

Offset	Size	Field
+0	8	`string_length`
+8	4	`builtin_id` (set after insertion)
+16	N+1	Null-terminated string data

Total allocation: length + 17 bytes, 8-byte aligned. The string data offset (16) is stored at hashtable+20 for use during comparison.

See Builtins for the complete 770-entry builtin ID inventory.

Usage Across the Compiler

Subsystems Using DenseMap (pointer hash, -8/-16 sentinels)

NVVM IR uniquing (sub_162D4F0): 8+ DenseMap instances in the NVVM context object, one per opcode range (0x04--0x1F). Tables at fixed qword-indexed offsets, spaced 32 bytes apart.
SelectionDAG builder (sub_163D530): Three maps at context offsets +120, +152, +184. Map A and B are 16-byte-stride (key-value), Set C is 8-byte-stride (keys only).
Per-node analysis structures: Embedded DenseSet at +72 within analysis objects created during DAG construction.
Memory space optimization (sub_1C6A6C0): DenseMap-style tables for address space tracking.

Subsystems Using DenseMap (pointer hash, -4096/-8192 sentinels)

SCEV (sub_F03CD0 and family): Expression caching, range computation, back-edge taken count.
Register coalescing (sub_1F2F8F0): Already-coalesced set, equivalence class map.
Block placement (sub_2E3B720): Chain membership, tail-merge candidates.
SLP vectorizer (sub_1ACCE50): AllOps and Scalars hash tables (32-byte entries).
StructurizeCFG (sub_1B66CF0): Flow-block mapping, region membership.
Machine pipeliner (sub_20C40D0): Schedule stage tracking.
CSSA (sub_3720740): PHI-to-ID mapping.
Debug/verify (sub_265D050): Instruction validation tables.
LazyCallGraph (sub_D1A040): Edge membership, SCC identity.

Subsystems Using DenseMap (integer hash `key * 37`)

Instruction emitter (sub_2E1F350): Opcode-to-constraint mapping. Sentinels: 0xFFFFFFFF / 0xFFFFFFFE.
Two-address pass (sub_1F4BFE0): TiedOperandMap (56-byte entries, 4 inline). EqClassMap.
Vector legalization (sub_3302A00): Type-split record mapping.
SelectionDAG isel (sub_3090F90): Argument cost table.

Subsystems Using wyhash (string keys)

Builtin name table (sub_90AEE0): 770 NVVM/CUDA builtin names. Uses the specialized 20-byte table header with hash cache.
This is the only known use of sub_CBF760 in cicc.

Key Functions

Function	Address	Size	Role
DenseMap pointer hash	inline	--	`(ptr >> 9) ^ (ptr >> 4)` -- always inlined
DenseMap integer hash	inline	--	`key * 37` -- always inlined
wyhash v4	`sub_CBF760`	~4 KB	String hash, length-dispatched
wyhash wrapper	`sub_C92610`	tiny	Tail-calls `sub_CBF760`
Builtin insert-or-find	`sub_C92740`	~2 KB	Quadratic probe with hash cache
Builtin find-only	`sub_C92860`	~1 KB	Read-only variant of `sub_C92740`
Builtin rehash	`sub_C929D0`	~1 KB	75% load factor, tombstone compaction
Builtin table init	`sub_C92620`	tiny	Creates 16-bucket initial table
SmallVector::grow	`sub_C8D5F0`	~2 KB	Generic element growth
SmallVectorBase::grow_pod	`sub_C8D7D0`	~5 KB	POD-optimized realloc growth
SmallVector::grow (MIR)	`sub_16CD150`	~2 KB	Duplicate in MachineIR range
SmallPtrSet::insertOrFind	`sub_C9A3C0`	~16 KB	Small pointer set with growth
DenseMap grow (LLVM passes)	varies per pass	--	Each pass has its own inlined or outlined rehash

Cross-References

Builtins -- Hash Table and ID Inventory -- complete 770-entry builtin table with wyhash usage
DenseMap and Symbol Table Structures -- original page (now a subset of this one, kept for EDG node layout)
NVVM IR Node -- NVVM context object with DenseMap uniquing tables
CSSA -- PHI hash map with -4096/-8192 sentinels
Register Coalescing -- integer-key and pointer-key hash map variants
SLP Vectorizer -- 32-byte-entry DenseMap with -4096/-8192 sentinels
SCEV -- SCEV expression caching with -4096/-8192 sentinels
Instruction Emitter -- integer-key hash with key * 37

CoroSplit & CoroFrame: Coroutine Lowering on GPU

cicc v13.0 carries the complete LLVM coroutine lowering pipeline -- CoroEarly, CoroSplit, CoroElide, CoroAnnotationElide, and CoroCleanup -- largely unchanged from upstream LLVM 19. The pass infrastructure processes C++20 co_await/co_yield/co_return coroutines emitted by the EDG 6.6 frontend, splitting a single coroutine function into separate resume, destroy, and cleanup functions while computing a coroutine frame struct to carry live state across suspend points. NVIDIA adds one proprietary intrinsic (llvm.nvvm.coro.create.suspend) and emits a .pragma "coroutine" annotation in PTX, but the core splitting and frame layout algorithms are stock LLVM. The practical constraint is that coroutine frame allocation on GPU defaults to malloc in device heap -- extremely expensive on current architectures -- making CoroElide (which replaces heap allocation with a caller-stack alloca) the pass that determines whether GPU coroutines are viable or pathological.

Key Facts

Property	Value
CoroSplit pass entry	`sub_24EF980` (71 KB, address range `0x24EF980`--`0x24F2300`)
CoroFrame layout computation	`sub_24F6730` (11,249 bytes, stack frame 5,624 bytes)
Core frame layout workhorse	`sub_24F5860` (called from CoroFrame)
createResumeFunction	`sub_2284030`
createDestroyFunction	`sub_2284040`
CoroEarly pass	`sub_24DCD10` (41 KB)
CoroElide pass	`sub_24DF350` (80 KB)
CoroAnnotationElide pass	`sub_24E2340` (33 KB)
CoroSplit Cloner/Driver	`sub_25CA370` (55 KB)
CoroFrame Materializer	`sub_25C5C80` (49 KB, heap-to-stack frame layout)
CoroFrame Spill Analysis	`sub_25C1030` (37 KB)
Pass name / debug type	`"CoroSplit"` / `"coro-split"` (at `0x4388A37` / `0x4387AC3`)
Coroutine metadata table	`unk_4F8FAE8`
Pipeline parser ID	#156 (CGSCC pass, param: `reuse-storage`)
CoroElide pipeline ID	#220 (Function pass)
CoroAnnotationElide pipeline ID	#155 (CGSCC pass)
CoroEarly pipeline ID	#29 (Module pass)
CoroCleanup pipeline ID	#28 (Module pass)
NVIDIA intrinsic	`llvm.nvvm.coro.create.suspend` (single constant integer argument)
PTX annotation	`.pragma "coroutine";`

The Coroutine Lowering Pipeline

Five passes run in a fixed sequence across the optimizer pipeline. The first and last are module-level bookends; the middle three do the real work inside the CGSCC (Call Graph SCC) pipeline where inlining decisions interact with coroutine splitting.

CoroEarly (module)         Lowers coroutine setup intrinsics.
                           Materializes the NoopCoro.Frame global.
                           Replaces llvm.coro.resume, llvm.coro.destroy,
                           llvm.coro.promise, llvm.coro.free with
                           concrete operations on the frame pointer.
        |
        v
CoroSplit (CGSCC)          Identifies coroutine functions by scanning for
                           llvm.coro.suspend / llvm.coro.end intrinsics.
                           Invokes CoroFrame to compute the frame layout.
                           Clones the function into resume + destroy variants.
                           Builds the state machine dispatch switch.
        |
        v
CoroAnnotationElide (CGSCC) Annotation-driven elision: when the callee is
                           marked "elide_safe_attr" and the call site has
                           ".noalloc", converts heap alloc to alloca in the
                           caller's frame. New in LLVM 19 / cicc v13.0.
        |
        v
CoroElide (function)       Classic elision: proves the coroutine frame
                           lifetime is bounded by the caller, replaces
                           coro.alloc with alloca. Emits optimization
                           remarks "'<name>' elided in '<caller>'" or
                           "'<name>' not elided in '<caller>'".
        |
        v
CoroCleanup (module)       Removes remaining coroutine intrinsic stubs
                           that survived lowering (e.g., coro.subfn.addr).
                           Final cleanup pass -- no coroutine intrinsics
                           survive past this point.

The coro-cond module analysis (registered in the pipeline parser at sub_2337E30) gates whether the coroutine passes activate at all. If no function in the module contains llvm.coro.id, the entire pipeline is skipped. This zero-cost guard is important because the vast majority of CUDA kernels contain no coroutines.

CoroSplit as a CGSCC Pass

CoroSplit is registered as CGSCC pass #156 with an optional reuse-storage parameter. When reuse-storage is active, the pass attempts to reuse the storage of coroutine frames that are provably dead -- relevant for generators where the frame is allocated once and resumed many times. In the CGSCC context, CoroSplit runs alongside the inliner (inline) and function-attrs, allowing newly split resume/destroy functions to be immediately considered for inlining into callers within the same SCC.

CoroSplit: Suspend Point Detection and Function Splitting

Detection Phase

sub_24EF980 iterates over every function in the module. For each function, it scans all instructions using a bitmask-based opcode test to identify coroutine suspension intrinsics:

// Suspend point detection (at 0x24F00E6)
// Stack frame: 0x860+ bytes, callee-saved: r15, r14, r13, r12, rbx
// Key locals:
//   [rbp-0x7F8] = outer iteration end pointer
//   [rbp-0x7E8] = current coroutine info
//   [rbp-0x7E0] = suspend point instruction
//   [rbp-0x740] = original coroutine function
//   [rbp-0x750] = resume function pointer
//   [rbp-0x748] = destroy function pointer

uint8_t opcode = inst->getOpcode();
unsigned normalized = opcode - 0x22;
if (normalized > 51) continue;  // not in range [0x22, 0x55]

uint64_t mask = 0x8000000000041ULL;
if (!((mask >> normalized) & 1)) continue;  // bit not set

The bitmask 0x8000000000041 encodes three intrinsic opcodes:

Bit position	Opcode	Intrinsic
0	`0x22`	`llvm.coro.suspend` -- normal suspend point
6	`0x28`	`llvm.coro.suspend.retcon` -- returned-continuation suspend
51	`0x55`	`llvm.coro.end` -- coroutine termination

This single 64-bit bt (bit-test) instruction replaces what would otherwise be a three-way comparison or switch, a pattern upstream LLVM uses in its Intrinsic::ID checking.

Validation

After finding a suspend point, CoroSplit validates the coroutine structure (at 0x24F010E):

// Coroutine validation pseudocode (0x24F010E-0x24F0179)
Value *coro_id_inst = ...;
if (coro_id_inst->getOpcode() != 0x55)    // must be 'U' = coro.id
    goto skip;

Function *parent = coro_id_inst->getParent();  // [rax-20h]
if (!parent || parent->getOpcode() != 0)        // entry block check
    goto skip;

Value *promise = coro_id_inst->getOperand(4);   // [rcx+50h]
if (parent->getContext() != promise)             // [rax+18h] == promise
    goto skip;

if (!(parent->getFlags() & 0x20))               // "has personality" bit 5 of +0x21
    goto skip;

if (parent->getIntrinsicID() != 59)             // 0x3B = coro.id
    goto skip;

This is a thorough validation ensuring:

The instruction is indeed llvm.coro.id (opcode 0x55 = 'U', intrinsic ID 59 = 0x3B)
It belongs to a valid function (parent pointer non-null, starts with opcode 0)
The promise alloca matches between coro.id and function context
The function has the correct personality (bit 5 of byte at offset +0x21)
The intrinsic ID equals 59 (cmp dword [rax+24h], 0x3B)

Nested coroutines receive additional validation (at 0x24F017F): the pass checks that coro.begin (opcode range 0x1E--0x28, ID 57 = 0x39) references the correct parent function, preventing cross-coroutine confusion when one coroutine is nested inside another.

// Nested coroutine check (0x24F017F-0x24F01D6)
unsigned operand_count = inst->getNumOperands() & 0x7FFFFFF;  // mask out type bits
Value *parent_ref = inst->getOperand(-operand_count);         // computed offset
if (parent_ref != current_function)
    goto skip;  // different coroutine -- do not cross wires

uint8_t begin_opcode = begin_inst->getOpcode();
if (begin_opcode - 0x1E > 0x0A)   // must be in [0x1E, 0x28]
    goto skip;  // not a coro.begin-related instruction

Value *frame_ptr = begin_inst->getOperand(2);  // [rdx+28h]

Suspend Point Collection

Validated suspend points are collected into a deduplicated array. The dedup check at 0x24F02F9 scans existing entries, following def-use chains ([rbx+10h]) to avoid processing the same suspend point twice when multiple CFG paths reach it. For each suspend point, the pass extracts the value operand at instruction offset +0x28.

// Suspend point collection with dedup (0x24F02F9-0x24F040A)
unsigned count = suspend_array_size;
for (unsigned i = 0; i < count; i++) {
    if (suspend_array[i] == new_suspend)
        goto already_collected;  // follow chain: [rbx+10h]
}
// Extract value operand:
Value *value_operand = suspend_inst->getOperand(2);  // [rdx+28h]
suspend_array[count++] = new_suspend;

The Split Algorithm

After collecting all suspend points, the split proceeds in three phases:

Phase 1: Frame layout computation. CoroSplit invokes sub_24F6730 (CoroFrame) to determine which SSA values are live across suspend points and must be stored in the frame struct (see the CoroFrame section below).

Phase 2: Function cloning and specialization. The split mode field at [rbp-0x3F8] controls which function variants are created:

// Function splitting dispatch (at 0x24F0540)
int split_mode = frame_state->split_mode;  // [rbp-0x3F8]

if (split_mode == 0) {
    // Returned-continuation style: destroy function only
    Function *destroy = createDestroyFunction(state, orig_fn, suspends, ...);
} else if (split_mode >= 1 && split_mode <= 3) {
    // Standard C++20 coroutine: both resume and destroy
    Function *resume  = sub_2284030(state, orig_fn, suspends, coro_info,
                                     destroy_data, resume_data);
    Function *destroy = sub_2284040(state, orig_fn, suspends, coro_info,
                                     destroy_data, resume_data);
}

sub_2284030 (createResumeFunction) and sub_2284040 (createDestroyFunction) each:

Clone the original coroutine function via sub_D2E510 (function cloner)
Replace the coroutine frame parameter with a typed pointer to the frame struct
Insert a switch statement at the entry block dispatching on the suspend index stored in the frame (__coro_index)
Replace each llvm.coro.suspend with a return instruction
Wire function pointers (__resume_fn, __destroy_fn) into the frame header at offsets +0x00 and +0x08

Phase 3: Metadata and remark emission. After splitting, the pass registers the new functions in the coroutine metadata table at unk_4F8FAE8 via sub_BC1CD0, then emits an optimization remark:

// Remark emission (0x24F05D1-0x24F06E8)
sub_B17560(remark, "CoroSplit", "coro-split");  // create remark
sub_B18290(remark, "Split '");                  // prefix
sub_BD5D20(orig_fn, name_buf);                  // get function name
sub_B16430(remark, "function", name_buf);       // named attribute
sub_B18290(remark, "' (frame_size=");
sub_B16B10(remark, "frame_size", frame_size);   // integer attribute
sub_B18290(remark, ", align=");
unsigned align = 1u << alignment_log2;
sub_B16B10(remark, "align", align);
sub_B18290(remark, ")");
sub_1049740(remark);                            // publish to diagnostic handler

The format is: Split '<function_name>' (frame_size=N, align=M) where N is the computed frame size in bytes and M is 1 << alignment_log2.

The `.corodispatch` Trampoline

The CoroSplit dispatcher at sub_3160A60 (48 KB, second code cluster) generates a .corodispatch function -- a lightweight trampoline that:

Loads __coro_index from the coroutine frame at offset +0x10
Switches on the index value to select the correct resume point
Uses musttail call semantics to jump to the target without growing the stack

The string "MustTailCall.Before.CoroEnd" confirms it enforces musttail on the final resume-to-end transition. Additional strings in this function include ".from." (used to construct the dispatch label name), "CoroEnd", "CoroSave", and "CoroSuspend" (marking the IR structures being dispatched through).

For GPU targets, the musttail semantics are critical: stack space is per-thread local memory, and growing it across coroutine bounces would rapidly exhaust the limited local memory budget.

CoroFrame: Frame Layout Computation

sub_24F6730 is the largest and most complex function in the coroutine pipeline, with a 5,624-byte stack frame (0x15F8) -- one of the largest in the entire cicc binary. Its job: determine which SSA values are live across suspend points and must be "spilled" into the coroutine frame struct.

Algorithm Overview

The algorithm is a BFS-based cross-suspend-point liveness analysis:

Initialize tracking structures. Two hash tables with 16-byte entries, sentinel 0xFFFFFFFFF000, hash function (val >> 4) ^ (val >> 9). Initial capacity 8 entries each.
Iterate all instructions. Walk every basic block and instruction. A visitor callback ([visitor+18h], virtual call) classifies each instruction as relevant or not to the frame computation.
BFS traversal. A deque with 512-byte blocks (64 pointer-sized entries per block) drives BFS over the CFG. The core computation at sub_24F5860 determines which values cross which suspend points.
Spill set computation. Values that are defined before a suspend point and used after it must be stored in the frame. The result is a set of (value, suspend_point) pairs.
Frame layout. The frame type builder (at sub_3169200 in the second code cluster) arranges spill slots into a struct.

Frame Struct Layout

The coroutine frame is a flat C struct with a fixed header followed by computed spill slots:

struct __coro_frame {                              // type name: ".coro_frame_ty"
    void (*__resume_fn)(struct __coro_frame *);    // +0x00  resume function pointer
    void (*__destroy_fn)(struct __coro_frame *);   // +0x08  destroy function pointer
    uint32_t __coro_index;                         // +0x10  suspend point state variable
    // --- header ends, spill slots begin ---
    // padding for alignment (computed per-coroutine)
    // spill slots ordered by descending alignment requirement
    // promise storage (if promise_type is non-trivial)
    // alloca copies (stack variables that survive suspend)
};

The frame variable is named "__coro_frame" and the type is ".coro_frame_ty". The suspend point index field "__coro_index" is the state variable for the resume switch dispatch: value 0 means "initial entry", value N means "resumed at suspend point N", and a poison/unreachable value means "coroutine has returned".

The frame type builder at sub_3169200 (46 KB) constructs the StructType using these rules:

The two function pointers (__resume_fn, __destroy_fn) always occupy the first 16 bytes
__coro_index occupies bytes 16--19 (i32)
Remaining spill slots are sorted by alignment (largest first) to minimize padding
The promise alloca (if present) is placed at a known offset so llvm.coro.promise can compute it
Total frame size and alignment are recorded for the split remark

Spill/Reload Code Generation

The spill/reload generator at sub_31650D0 (47 KB) creates the actual load/store instructions that move values between SSA registers and the coroutine frame:

A basic block named "AllocaSpillBB" is inserted at the function entry. All alloca instructions that need to survive across suspend points are moved here and replaced with GEP+store into the frame.
A basic block named "PostSpill" follows, branching to the original entry logic.
At each suspend point, ".spill.addr" store instructions write live SSA values into their frame slots.
After each resume point, ".reload" load instructions fetch values back from frame slots into fresh SSA values.

The naming convention (.spill.addr, .reload) is important for debugging: these instructions appear in -print-after-all dumps and identify coroutine frame traffic distinctly from normal loads/stores.

Detailed BFS Liveness Algorithm

// Pseudocode for sub_24F5860 core frame computation

void computeFrameLayout(Function *F, SmallVector<SuspendPoint> &suspends) {
    // Step 1: Build definition map
    DenseMap<Value*, uint32_t> def_map;     // sentinel 0xFFFFFFFFF000
    DenseMap<Value*, uint32_t> cross_map;   // sentinel 0xFFFFFFFFF000

    // Step 2: Walk all basic blocks, identify definitions
    for (BasicBlock &BB : *F) {
        for (Instruction &I : BB) {
            if (visitor->isRelevant(&I))    // virtual call [visitor+18h]
                def_map.insert(&I, generation++);
        }
    }

    // Step 3: For each suspend point, BFS forward to find uses
    Deque<BasicBlock*> worklist;  // 512-byte blocks, 64 entries each
    for (SuspendPoint &SP : suspends) {
        worklist.clear();
        worklist.push_back(SP.getParent());

        while (!worklist.empty()) {
            BasicBlock *BB = worklist.pop_front();
            for (Instruction &I : *BB) {
                for (Value *Op : I.operands()) {
                    if (def_map.count(Op) && def_before_suspend(Op, SP)) {
                        // This value is defined before SP and used after it
                        cross_map.insert({Op, SP.getIndex()});
                        spill_set.add(Op);
                    }
                }
            }
            for (BasicBlock *Succ : successors(BB))
                worklist.push_back(Succ);
        }
    }

    // Step 4: Build frame struct from spill set
    // Sort spill slots by alignment (descending) then by size
    // Compute offsets, padding, total frame size
}

The complexity is O(instructions * suspend_points) per coroutine for the liveness phase, O(V+E) for each BFS where V = basic blocks and E = CFG edges.

Data Structures

Frame info (0x138 = 312 bytes, allocated via sub_22077B0):

Offset	Size	Description
`+0x00`	8	Spill array pointer
`+0x08`	8	Reserved (initially 0)
`+0x10`	8	Reference count (initially 1)
`+0x18`--`+0x98`	128	Embedded hash table for spill tracking (16-byte stride, sentinel-filled)
`+0x98`	8	Pointer to inner table (self-referential)
`+0xA0`	8	Capacity encoding (`0x800000000`)
`+0x128`	8	Back-reference to visitor context
`+0x130`	8	Back-reference to suspend point array

Spill entry (0x48 = 72 bytes):

Offset	Size	Description
`+0x00`	8	Coroutine function pointer
`+0x08`	8	Buffer pointer (inline or heap)
`+0x10`	8	Capacity encoding (6 entries inline)
`+0x18`--`+0x48`	48	Inline buffer for small spill sets

The inline buffer holds up to 6 spill entries without heap allocation. When exceeded, the buffer externalizes to the heap; cleanup at 0x24F6CB0 checks [entry+8] against [entry+18h] to determine if free() is needed.

BFS deque:

Parameter	Value
Block map allocation	0x40 bytes (8 pointers)
Data block allocation	0x200 bytes (512 bytes = 64 pointer entries)
Block pointers	`[rbp-0x340]`=front, `[rbp-0x338]`=count(8), `[rbp-0x330]`=begin

Hash Table Policy

Both hash tables in CoroFrame share identical parameters (see hash-infrastructure.md for the universal pattern):

Hash function: (val >> 4) ^ (val >> 9) -- same hash used throughout cicc
Entry size: 16 bytes (8-byte key + 8-byte metadata)
Empty sentinel: 0xFFFFFFFFF000
Load factor threshold: 75% (triggers growth when count * 4 >= capacity * 3)
Tombstone cleanup: 12.5% (rehash when tombstones > capacity >> 3)
Growth factor: 2x (capacity doubles on each growth)
Collision resolution: linear probing

GPU-Specific Constraints: The Heap Allocation Problem

Why Device Malloc Is Pathological

Standard LLVM coroutines allocate the frame on the heap via operator new (or a custom allocator returned by get_return_object_on_allocation_failure). On GPU, this calls into the device-side malloc, which has severe limitations:

Fixed-size heap. The device heap is controlled by cudaLimitMallocHeapSize (default 8 MB across the entire GPU). A kernel launching 65,536 threads, each with a 256-byte coroutine frame, requires 16 MB of heap -- already exceeding the default. Increasing the limit helps, but the heap must be pre-allocated before kernel launch, wasting memory for non-coroutine workloads.

Serialized allocation. Device malloc implementation uses a global free list protected by atomics. Within a warp, threads attempting simultaneous allocation serialize on this atomic. Across warps on the same SM, L2 cache line bouncing on the free-list head pointer creates further contention. Under heavy allocation pressure (hundreds of concurrent warps), the effective throughput of device malloc can drop to single-digit allocations per microsecond -- three orders of magnitude slower than a register read.

Fragmentation under concurrency. Thousands of threads allocating and freeing small frames (64--512 bytes) rapidly fragment the device heap. The device allocator does not perform compaction. Once fragmented, even a heap with sufficient total free space may fail individual allocations, causing malloc to return nullptr and triggering coroutine allocation failure paths (if the user provided get_return_object_on_allocation_failure) or program termination.

Memory latency hierarchy. The cost difference between frame locations is dramatic:

Location	Latency	Bandwidth per SM	Notes
Registers	0 cycles	N/A (direct)	Best case -- values that don't cross suspends
Local memory (L1 hit)	~28 cycles	~12 TB/s	Stack alloca destination after CoroElide
Local memory (L1 miss, L2 hit)	~200 cycles	~3 TB/s	Large frames that spill L1
Global memory (device heap)	~400-800 cycles	~1 TB/s	Default without CoroElide
Device malloc overhead	~2000+ cycles	N/A	Free-list atomic contention

The combined overhead of malloc latency + global memory access latency makes un-elided coroutines 50--100x slower than elided ones on GPU. This is the fundamental reason CoroElide is the most performance-critical coroutine optimization for GPU targets.

CoroElide: The GPU Escape Analysis

sub_24DF350 (80 KB -- the largest coroutine pass) implements the classic heap allocation elision. It runs as a function-level pass (#220 in the pipeline parser), meaning it analyzes each caller individually after CoroSplit has already split the coroutine.

Elision Preconditions

For each llvm.coro.id call site in the caller, CoroElide attempts to prove that:

No handle escape. The coroutine handle (pointer to __coro_frame) does not escape the caller's scope. Specifically, the handle is not stored to memory visible to other threads, not passed to functions that might store it, and not returned from the caller. On GPU, the "visible to other threads" criterion is complicated by shared memory (addrspace(3)) and generic address space (addrspace(0)) casts -- a handle stored through a generic pointer could be visible to any thread.
No external aliases. No alias of the handle is created that could outlive the caller. This includes GEPs into the frame, bitcasts, and pointer arithmetic. The alias analysis at this stage uses the results from the function-level AA pipeline.
Full consumption. All suspend/resume/destroy calls on this coroutine handle are within the caller function. If the handle is passed to a helper function that calls coroutine_handle::resume(), the coroutine is not fully consumed from CoroElide's perspective (unless that helper was inlined first by the CGSCC inliner running in the same SCC iteration).
Callee identity known. The coroutine callee must be identifiable (not an indirect call through a function pointer). CoroElide needs to read the callee's frame size and alignment from the split remark metadata to size the alloca correctly.

The Elision Transformation

When all preconditions are satisfied, CoroElide performs this rewrite:

// BEFORE elision (caller code):
%id    = call token @llvm.coro.id(i32 0, ptr null, ptr null, ptr null)
%need  = call i1 @llvm.coro.alloc(token %id)
br i1 %need, label %alloc, label %begin

alloc:
  %mem = call ptr @operator_new(i64 FRAME_SIZE)   ; <-- heap allocation
  br label %begin

begin:
  %phi = phi ptr [ %mem, %alloc ], [ null, %entry ]
  %hdl = call ptr @llvm.coro.begin(token %id, ptr %phi)
  ; ... use coroutine ...
  call void @llvm.coro.resume(ptr %hdl)
  call void @llvm.coro.destroy(ptr %hdl)

// AFTER elision:
%frame = alloca [FRAME_SIZE x i8], align FRAME_ALIGN  ; <-- stack allocation
%hdl   = call ptr @llvm.coro.begin(token %id, ptr %frame)
; ... use coroutine ...
call void @llvm.coro.resume(ptr %hdl)
; destroy is elided (frame on stack, automatically freed)

The key changes:

llvm.coro.alloc is replaced with false (allocation not needed)
The operator new call is deleted
An alloca of the correct size and alignment is inserted in the caller's entry block
The coro.begin now points at the stack alloca
llvm.coro.free is replaced with a no-op (stack memory does not need explicit deallocation)
The destroy function call may be simplified since stack deallocation is automatic

On NVPTX, the alloca maps to per-thread local memory (address space 5). Local memory accesses go through the L1 cache and are dramatically faster than device malloc followed by global memory access.

Elision Failure Modes on GPU

Several GPU-specific patterns defeat CoroElide:

Generic address space cast. If the coroutine handle is cast to addrspace(0) (generic), the compiler cannot prove it stays in local memory. Generic pointers are indistinguishable from shared or global pointers at the IR level, so the escape analysis conservatively assumes the handle escapes.
Coroutine handle in shared memory. Storing the handle to addrspace(3) (shared memory) makes it visible to all threads in the CTA. Even if the programmer knows only one thread uses it, CoroElide cannot prove this.
Cross-function resume. A common pattern where the coroutine is created in one device function and resumed in another (e.g., a scheduler loop calling resume on handles from a queue). The handle passed as a function argument escapes the creator.
Opaque allocator. If the coroutine uses a custom allocator (via promise_type::operator new), CoroElide may not recognize the allocation/deallocation pattern.

Diagnostic Output

CoroElide emits remarks through the standard optimization remark infrastructure:

Success: '<coroutine_name>' elided in '<caller_name>' (via -Rpass=coro-elide)
Failure: '<coroutine_name>' not elided in '<caller_name>' (via -Rpass-missed=coro-elide)

For GPU developers, the failure remark is the most important diagnostic. An un-elided coroutine on GPU is a performance disaster. The recommended debugging workflow:

nvcc -Xptxas -v --compiler-options="-Rpass-missed=coro-elide" foo.cu

CoroAnnotationElide: Developer-Asserted Elision

sub_24E2340 (33 KB) is the newer annotation-driven elision from LLVM 19. It looks for the "elide_safe_attr" function attribute and ".noalloc" suffix on coroutine function names. When both are present, elision proceeds without the full escape analysis -- the developer has asserted safety.

This is particularly useful for GPU code where the developer knows the coroutine is single-thread-scoped but the compiler cannot prove it due to pointer-to-generic-address-space casts. The "caller_presplit" attribute marks the caller as needing coroutine lowering, enabling the annotation elide pass to fire during the CGSCC iteration before the caller itself is split.

CoroAnnotationElide runs as CGSCC pass #155, meaning it fires before CoroSplit (#156) in the same CGSCC iteration. This ordering allows the annotation-based elision to rewrite allocation sites before CoroSplit performs the split, avoiding the need for a second pass.

The `llvm.nvvm.coro.create.suspend` Intrinsic

This is the sole NVIDIA-proprietary coroutine intrinsic. The NVVM verifier enforces:

llvm.nvvm.coro.create.suspend must have exactly one argument,
which must be a constant integer

The constant integer argument likely encodes a suspend-point identifier or mode. This intrinsic appears in the NVVM intrinsic table alongside llvm.nvvm.stacksave and llvm.nvvm.stackrestore, suggesting it interacts with the local memory stack for frame placement. Its exact lowering is handled by the NVVM-specific intrinsic lowering pass rather than the standard CoroSplit pipeline.

PTX `.pragma "coroutine"`

The AsmPrinter (documented in asmprinter.md) optionally emits .pragma "coroutine"; in the function header. This is triggered by metadata nodes with type byte 'N' (0x4E) linked to the current function via the list at this+792. The pragma is the first thing emitted in the function prologue (step (a) in the PTX header emission sequence at sub_215A3C0), before even the .entry/.func keyword.

The pragma signals to ptxas that the function uses coroutine semantics, potentially affecting register allocation and scheduling decisions in the assembler. The exact ptxas behavior triggered by this pragma is not documented publicly, but it likely increases the local memory budget and adjusts the register allocation heuristics for the state-machine dispatch pattern.

Warp Divergence at Suspend Points

A fundamental tension exists between SIMT execution and coroutine suspend. When one thread in a warp suspends while others do not, the warp diverges. The resume dispatch switch (the __coro_index-based state machine) creates a divergence point: threads may be at different suspend indices, requiring the hardware to serialize execution paths. This is identical to how any data-dependent branch causes divergence, but the impact is amplified because coroutine state machines typically have many switch cases (one per suspend point).

The StructurizeCFG pass (see structurizecfg.md) runs after coroutine lowering and will structurize the resume switch, potentially introducing additional control flow to manage reconvergence. On SM 70+ architectures with Independent Thread Scheduling, diverged threads can reconverge at any point, but the switch still introduces warp-level serialization proportional to the number of distinct __coro_index values active within the warp.

The Second Code Cluster (0x3150000 Region)

The binary contains a second, independent cluster of coroutine functions, likely from a different compilation unit or LTO merge:

Function	Address	Size
CoroFrame layout computation	`0x3171DA0`	55 KB
CoroSplit splitting logic	`0x316D160`	49 KB
CoroSplit dispatcher (`.corodispatch`, `MustTailCall.Before.CoroEnd`)	`0x3160A60`	48 KB
Spill/reload generation (`AllocaSpillBB`, `PostSpill`, `.reload`, `.spill.addr`)	`0x31650D0`	47 KB
Frame type builder (`__coro_frame`, `.coro_frame_ty`, `__coro_index`)	`0x3169200`	46 KB
CoroElide heap allocation elision	`0x315A7B0`	41 KB
Attributor analysis helper	`0x3150D70`	43 KB
Attributor analysis helper	`0x314DBB0`	40 KB

These functions reference the same string literals and implement the same algorithms as the primary cluster. The primary cluster at 0x24D--0x25C and this cluster at 0x314--0x317 are structurally identical -- they differ only in binary address due to compilation unit or LTO merge ordering.

Additionally, three helper functions in the primary cluster's vicinity handle specialized aspects:

Function	Address	Size
CoroSplit Cloner/Driver (calls CoroFrame helpers)	`sub_25CA370`	55 KB
CoroFrame Materializer (heap-to-stack frame layout)	`sub_25C5C80`	49 KB
CoroFrame Spill Analysis helper	`sub_25C1030`	37 KB

sub_25C5C80 (CoroFrame Materializer) is particularly relevant: this is the function that actually rewrites the IR to replace heap allocation with stack-based frame placement after CoroElide has proven safety. It materializes the frame struct type, inserts the alloca, and rewires all frame access GEPs.

Error Conditions in the Second Cluster

The CoroSplit implementation at 0x316D160 emits two diagnostic errors:

"Coroutines cannot handle non static allocas yet" -- triggered when a coroutine body contains a VLA (variable-length array) or alloca() with a dynamic size. The frame layout computation requires compile-time-known sizes for all frame slots. Dynamic allocas would require a separate heap allocation per suspend-resume cycle.
"alignment requirement of frame variables" -- triggered when a spill slot requires alignment exceeding the frame's maximum supported alignment. This can occur with over-aligned types (e.g., alignas(256) variables that must survive across suspends).

The CoroFrame at 0x3171DA0 emits:

"token definition separated from use by suspend point" -- a fatal error when an LLVM token value (which cannot be stored to memory) crosses a suspend boundary. Tokens are used for exception handling state and musttail call tracking; they are inherently non-materializable.
"Unable to handle alias with unknown offset before CoroBegin" -- triggered when a GEP with a non-constant offset operates on a value computed before coro.begin. The frame layout computation needs constant offsets to compute spill slot positions.

EDG Frontend Support

The EDG 6.6 frontend fully implements C++20 coroutine semantics in two key functions:

sub_87AFA0 (14 KB) -- Coroutine body processor. Resolves promise_type methods: initial_suspend, final_suspend, unhandled_exception, get_return_object, get_return_object_on_allocation_failure. Generates the coroutine body scaffolding including the implicit try-catch around user code.
sub_87BD00 (6 KB) -- Coroutine trait resolver. Looks up std::coroutine_traits<R, Args...>::promise_type, std::coroutine_handle, return_value, return_void. The EDG IL walker maps these as IL node type 64 (il_coroutine), with expression sub-type 0x21 (coroutine_expr). The IL copier handles coroutine handles as entity type 72 (coroutine_handle).

The frontend does not restrict coroutines to host-side code. The EDG configuration sets COROUTINE_ENABLING_POSSIBLE = 1 globally, meaning __device__ functions can be coroutines. The full coroutine IR (with llvm.coro.id, llvm.coro.begin, llvm.coro.suspend, etc.) flows into the NVVM optimizer pipeline regardless of the function's execution space.

Diagnostic Strings

String	Location	Meaning
`"Split '<name>' (frame_size=N, align=M)"`	CoroSplit remark	Successful coroutine split
`"' elided in '"`	CoroElide	Frame allocation replaced with alloca
`"' not elided in '"`	CoroElide	Elision failed, heap allocation remains
`"Coroutines cannot handle non static allocas yet"`	`0x316D160`	VLA or dynamic alloca inside coroutine body
`"alignment requirement of frame variables"`	`0x316D160`	Frame alignment constraint exceeded
`"token definition separated from use by suspend point"`	`0x3171DA0`	Token value crosses suspend boundary (error)
`"Unable to handle alias with unknown offset before CoroBegin"`	`0x3171DA0`	GEP with non-constant offset on pre-begin alias
`"llvm.nvvm.coro.create.suspend must have exactly one argument, which must be a constant integer"`	NVVM verifier	Malformed NVIDIA coroutine intrinsic
`"AllocaSpillBB"`	`0x31650D0`	Entry block for spill alloca instructions
`"PostSpill"`	`0x31650D0`	Block following spill setup
`".spill.addr"`	`0x31650D0`	Store to coroutine frame slot
`".reload"`	`0x31650D0`	Load from coroutine frame slot after resume
`".corodispatch"`	`0x3160A60`	Dispatch trampoline function name
`"MustTailCall.Before.CoroEnd"`	`0x3160A60`	Musttail semantics on final transition
`".from."`	`0x3160A60`	Dispatch label name construction
`"NoopCoro.Frame"`	`0x24DCD10`	Global no-op coroutine frame (CoroEarly)
`"caller_presplit"`	`0x24E2340`	Attribute marking pre-split caller
`"elide_safe_attr"`	`0x24E2340`	Attribute asserting elision safety
`".noalloc"`	`0x24E2340`	Function name suffix for annotation elide

Function Map

Function	Address	Size	Role
CoroEarly pass entry	`sub_24DCD10`	41 KB	--
CoroElide pass entry	`sub_24DF350`	80 KB	--
CoroAnnotationElide pass entry	`sub_24E2340`	33 KB	--
CoroSplit pass entry	`sub_24EF980`	71 KB	--
Core frame layout computation	`sub_24F5860`	--	--
CoroFrame layout entry	`sub_24F6730`	11 KB	--
CoroFrame Spill Analysis helper	`sub_25C1030`	37 KB	--
CoroFrame Materializer (heap-to-stack)	`sub_25C5C80`	49 KB	--
CoroSplit Cloner/Driver	`sub_25CA370`	55 KB	--
createResumeFunction	`sub_2284030`	--	--
createDestroyFunction	`sub_2284040`	--	--
Function cloner (used for resume/destroy)	`sub_D2E510`	--	--
Frame-already-computed check	`sub_B2D610`	--	--
Get function name string	`sub_BD5D20`	--	--
Register in coroutine metadata table	`sub_BC1CD0`	--	--
Create optimization remark	`sub_B17560`	--	--
Publish remark to diagnostic handler	`sub_1049740`	--	--
Allocator (frame info, spill entries, BFS deque)	`sub_22077B0`	--	--
`coro-cond` module analysis checker	`sub_2337E30`	15 KB	--
Attributor helper (coroutine attributes)	`sub_314DBB0`	40 KB	--
Attributor helper (coroutine attributes)	`sub_3150D70`	43 KB	--
CoroElide (second cluster)	`sub_315A7B0`	41 KB	--
CoroSplit dispatcher (`.corodispatch`)	`sub_3160A60`	48 KB	--
Spill/reload generation	`sub_31650D0`	47 KB	--
Frame type builder	`sub_3169200`	46 KB	--
CoroSplit splitting logic (second cluster)	`sub_316D160`	49 KB	--
CoroFrame layout (second cluster)	`sub_3171DA0`	55 KB	--
EDG coroutine body processor	`sub_87AFA0`	14 KB	--
EDG coroutine trait resolver	`sub_87BD00`	6 KB	--

Cross-References

Pipeline & Ordering -- where coroutine passes sit in the optimization sequence
SROA -- SROA interacts with coroutine frame allocas; decomposes aggregate allocas into scalar SSA values
AsmPrinter & PTX Body Emission -- .pragma "coroutine" emission
Inliner Cost Model -- inlining decisions for split resume/destroy functions
StructurizeCFG -- structurizes the resume dispatch switch
Hash Infrastructure -- universal DenseMap pattern used by CoroFrame
Diagnostics & Optimization Remarks -- remark emission protocol
Address Spaces -- local (5), shared (3), generic (0) spaces relevant to elision

OpenMP Runtime Declaration Table

cicc embeds a 194-entry table of OpenMP runtime function declarations at sub_312CF50 (0x312CF50, 117 KB decompiled). This single function is the authoritative source for every __kmpc_*, omp_*, and __tgt_* device-runtime call the compiler can emit into NVPTX IR. It defines the complete ABI contract between compiler-generated GPU code and the OpenMP device runtime library (libomptarget / libomp). The function takes an integer case index (0--193), constructs the corresponding FunctionType, checks whether the symbol already exists in the module via Module::getNamedValue, and if absent, creates a Function::Create with ExternalLinkage. The result is registered into a context-local array so that any later codegen pass can reference a runtime function by its numeric index without reconstructing the type.

Upstream LLVM defines the same runtime function set declaratively in llvm/include/llvm/Frontend/OpenMP/OMPKinds.def using the __OMP_RTL macro, which the OMPIRBuilder expands at construction time. cicc's table is a procedural equivalent: a giant switch(a3) with 194 cases that does exactly what OMPKinds.def + OMPIRBuilder::initialize() do, but compiled into the binary rather than generated from a .def file. The ordering of cases 0--193 matches the upstream OMPRTL_ enum one-to-one, confirming that cicc v13.0 tracks LLVM 18.x's OpenMP runtime interface.

Key Facts

Property	Value
Entry point	`sub_312CF50` @ `0x312CF50`
Decompiled size	117 KB
Total entries	194 (indices 0--193)
Sentinel	index 193 = `__last` (void function, marks table end)
Varargs entries	2: index 7 (`__kmpc_fork_call`), index 118 (`__kmpc_fork_teams`)
Linkage for all entries	`ExternalLinkage` (encoded as 0x103 = 259)
Special attribute	Attribute #26 applied to indices 7 and 118 post-creation
Registration helper	`sub_3122A50(context, index, funcDecl)`
Type construction	`sub_BCF480` = `FunctionType::get`
Symbol lookup	`sub_BA8CB0` = `Module::getNamedValue`
Function creation	`sub_B2C660` = `Function::Create`
Upstream equivalent	`OMPKinds.def` `__OMP_RTL` entries + `OMPIRBuilder::initialize()`

Context Object Type Cache

The first parameter a1 points to the OpenMP runtime context object. Starting at offset +2600, it contains a pre-allocated cache of LLVM types used to construct function signatures, avoiding redundant Type::get* calls:

Offset	Type	LLVM equivalent
+2600	`void`	`Type::getVoidTy`
+2608	`i1`	`Type::getInt1Ty`
+2616	`i8`	`Type::getInt8Ty`
+2624	`i16`	`Type::getInt16Ty`
+2632	`i32`	`Type::getInt32Ty`
+2640	`i64`	`Type::getInt64Ty`
+2648	`i8*`	`PointerType::get(i8, 0)`
+2664	`i32*`	`PointerType::get(i32, 0)`
+2672	`i64*`	`PointerType::get(i64, 0)`
+2680	`double`	`Type::getDoubleTy`
+2688	`i64` / `size_t`	`DataLayout::getIntPtrType`
+2704	`i8*` (generic ptr)	`PointerType::get(i8, 0)`
+2712	`i8**`	`PointerType::get(i8*, 0)`
+2720	`i8***`	`PointerType::get(i8**, 0)`
+2752	`kmp_critical_name*`	`[8 x i32]*`
+2784	`ident_t*`	`{i32, i32, i32, i32, i8}`
+2800	`__tgt_kernel_arguments*`	13-field struct pointer
+2816	`__tgt_async_info*`	`{i8}`
+2896	`KernelEnvironmentTy*`	`{ConfigEnv, ident_t, DynEnv}*`
+2912	`KernelLaunchEnvironmentTy*`	`{i32, i32}*`
+2928	`kmpc_micro`	`void(i32, i32, ...)*` (varargs microtask)
+2944	`kmp_reduce_func`	`void(i8, i8)*`
+2960	`kmp_copy_func`	`void(i8, i8)*`
+3008	`kmpc_ctor`	`i8(i8)*`
+3024	`kmp_routine_entry_t`	`i32(i32, i8)`
+3040	`kmp_ShuffleReductFctPtr`	`void(i8, i16, i16, i16)`
+3056	`kmp_InterWarpCopyFctPtr`	`void(i8, i32)`
+3072	`kmp_ListGlobalFctPtr`	`void(i8, i32, i8)*`

This layout mirrors the OMP_TYPE, OMP_STRUCT_TYPE, and OMP_FUNCTION_TYPE sections of upstream OMPKinds.def. The struct type definitions for ident_t, KernelEnvironmentTy, and __tgt_kernel_arguments match the upstream __OMP_STRUCT_TYPE declarations exactly.

Execution Modes: SPMD vs Generic

GPU OpenMP kernels operate in one of two execution modes, and the choice fundamentally determines which runtime functions the compiler emits:

Mode	Value	Description	Worker threads
Generic	1	Master-worker state machine. Only thread 0 runs serial code; workers spin in a polling loop (`__kmpc_barrier_simple_generic`). Parallel regions are dispatched via `__kmpc_kernel_prepare_parallel` / `__kmpc_kernel_parallel`.	Idle until parallel region
SPMD	2	All threads execute the same code from kernel entry. Serial sections between parallel regions are guarded by `tid == 0` checks with shared-memory output promotion and `__kmpc_barrier_simple_spmd` barriers.	Active from first instruction
Generic-SPMD	3	Transient state during the Generic-to-SPMD transformation. Never observed at runtime.	N/A

The execution mode is encoded in a bit-vector attached to the kernel function's metadata. The runtime function __kmpc_target_init (index 155) reads the KernelEnvironmentTy struct which embeds the ConfigurationEnvironmentTy -- the first byte of that inner struct encodes the execution mode. __kmpc_is_spmd_exec_mode (index 186) queries it at runtime.

The SPMD-vs-Generic distinction affects which runtime calls appear in the generated IR:

Generic mode kernels call __kmpc_kernel_prepare_parallel, __kmpc_kernel_parallel, __kmpc_kernel_end_parallel, __kmpc_barrier_simple_generic, and the full __kmpc_fork_call microtask dispatch.
SPMD mode kernels call __kmpc_parallel_51 (index 158) for nested parallelism, __kmpc_barrier_simple_spmd for synchronization, and __kmpc_alloc_shared / __kmpc_free_shared for shared-memory output promotion between guarded and parallel sections.
Both modes call __kmpc_target_init / __kmpc_target_deinit for kernel lifecycle management.

Call Generation Infrastructure

When any codegen pass needs a runtime function, it calls sub_312CF50(omp_context + 400, existing_value, case_index). The omp_context object (typically at a2+208 in the pass state) contains both the type cache (+2600..+3072) and the runtime function array. If Module::getNamedValue finds the symbol already declared, it is returned immediately; otherwise a new declaration is created and registered.

Once a declaration is obtained, sub_921880 (create runtime library call instruction) builds the CallInst node with the argument list from current SSA values, attaches debug/source location metadata, and inserts it at the specified basic block position.

Primary Consumers

Pass	Address	Size	Runtime Entries Used
Generic-to-SPMD transform	`sub_26968A0`	61 KB	6 (thread ID), 180 (alloc_shared), 181 (free_shared), 187 (barrier_simple_spmd)
State machine generation	`sub_2678420`	41 KB	155 (target_init), 156 (target_deinit), 171 (kernel_parallel), 172 (kernel_end_parallel), 188 (barrier_simple_generic)
Parallel region outliner	`sub_313D1B0`	47 KB	7 (fork_call), 158 (parallel_51)
Parallel region merging	`sub_2680940`	52 KB	180 (alloc_shared), 181 (free_shared), 187 (barrier_simple_spmd)
Attributor OpenMP driver	`sub_269F530`	63 KB	All -- identifies/folds known runtime calls by index

Complete Runtime Function Table

All 194 entries, organized by functional category. The "Index" column is the switch case in sub_312CF50 and the slot in the context's runtime function array. Signatures use LLVM IR type syntax. The "Call Generation" column describes how and when cicc emits each call.

Standard OpenMP Runtime (0--13)

Index	Function	Signature	Purpose	Call Generation
0	`__kmpc_barrier`	`void(ident_t*, i32)`	Explicit barrier	Emitted for `#pragma omp barrier`. On GPU compiles to `__syncthreads()`. OpenMPOpt may replace with index 187 (SPMD barrier)
1	`__kmpc_cancel`	`i32(ident_t*, i32, i32)`	Cancel construct	Third param: cancel kind (1=parallel, 2=sections, 3=for, 4=taskgroup). Returns nonzero if cancellation pending
2	`__kmpc_cancel_barrier`	`void(ident_t*, i32)`	Implicit barrier + cancel check	Generated at end of worksharing constructs when cancel is possible
3	`__kmpc_error`	`void(ident_t, i32, i8)`	Runtime error	Second param: severity (1=warning, 2=fatal). Third: message string pointer
4	`__kmpc_flush`	`void(ident_t*)`	Memory fence	`#pragma omp flush`. On GPU: `__threadfence()` or scope-specific fence
5	`__kmpc_global_thread_num`	`i32(ident_t*)`	Get global thread ID	On GPU: blockIdx*blockDim+threadIdx. Emitted at start of every region needing a thread identifier
6	`__kmpc_get_hardware_thread_id_in_block`	`i32()`	threadIdx.x equivalent	Direct PTX `%tid.x` wrapper. Used by SPMD transform (`sub_26968A0`) to build `tid==0` guards. Lookup: `sub_312CF50(..., 6)`
7	`__kmpc_fork_call`	`void(ident_t*, i32, kmpc_micro, ...)`	Fork parallel region (varargs)	Second param: shared variable count. Third: outlined microtask pointer. Remaining: shared variables. On GPU Generic mode triggers worker state machine dispatch. Attribute #26 applied post-create
8	`__kmpc_fork_call_if`	`void(ident_t, i32, i32, i8, i32)`	Conditional fork	Third param: `if`-clause condition. If false, region executes serially
9	`__kmpc_omp_taskwait`	`void(ident_t*, i32)`	Taskwait	`#pragma omp taskwait`
10	`__kmpc_omp_taskyield`	`i32(ident_t*, i32, i32)`	Task yield point	Third param: end-of-task flag
11	`__kmpc_push_num_threads`	`void(ident_t*, i32, i32)`	Set thread count	`num_threads(N)` clause. Pushes count for next parallel region
12	`__kmpc_push_proc_bind`	`void(ident_t*, i32, i32)`	Set affinity	`proc_bind(spread/close/master)`. Third param encodes binding policy
13	`__kmpc_omp_reg_task_with_affinity`	`i32(ident_t, i32, i8, i32, i8*)`	Register task with affinity info	OMP 5.0 affinity clause

Index 7 (__kmpc_fork_call) and index 118 (__kmpc_fork_teams) are the only two varargs entries. Both receive special post-processing: sub_B994D0 sets function attribute #26 (likely the convergent attribute or a varargs-related marker), checked via sub_B91C10. This prevents the optimizer from incorrectly splitting, duplicating, or removing these calls.

Hardware Query (14--16)

Index	Function	Signature	Purpose
14	`__kmpc_get_hardware_num_blocks`	`i32()`	gridDim.x equivalent
15	`__kmpc_get_hardware_num_threads_in_block`	`i32()`	blockDim.x equivalent
16	`__kmpc_get_warp_size`	`i32()`	Warp size (32 on NVIDIA)

These three functions have no parameters -- they are direct wrappers around PTX special registers (%nctaid.x, %ntid.x, and a compile-time constant 32).

OMP Standard Library API (17--45)

Index	Function	Signature	Purpose
17	`omp_get_thread_num`	`i32()`	Thread ID within team
18	`omp_get_num_threads`	`i32()`	Threads in current team
19	`omp_get_max_threads`	`i32()`	Max threads available
20	`omp_in_parallel`	`i32()`	Inside parallel region?
21	`omp_get_dynamic`	`i32()`	Dynamic adjustment enabled?
22	`omp_get_cancellation`	`i32()`	Cancellation enabled?
23	`omp_get_nested`	`i32()`	Nested parallelism enabled?
24	`omp_get_schedule`	`void(i32, i32)`	Query loop schedule
25	`omp_get_thread_limit`	`i32()`	Max total threads
26	`omp_get_supported_active_levels`	`i32()`	Max supported nesting
27	`omp_get_max_active_levels`	`i32()`	Current max nesting
28	`omp_get_level`	`i32()`	Current nesting depth
29	`omp_get_ancestor_thread_num`	`i32(i32)`	Ancestor thread ID
30	`omp_get_team_size`	`i32(i32)`	Team size at nesting level
31	`omp_get_active_level`	`i32()`	Active parallel nesting
32	`omp_in_final`	`i32()`	Inside final task?
33	`omp_get_proc_bind`	`i32()`	Current binding policy
34	`omp_get_num_places`	`i32()`	Number of places
35	`omp_get_num_procs`	`i32()`	Available processors
36	`omp_get_place_proc_ids`	`void(i32, i32*)`	Processor IDs in place
37	`omp_get_place_num`	`i32()`	Current place number
38	`omp_get_partition_num_places`	`i32()`	Places in partition
39	`omp_get_partition_place_nums`	`void(i32*)`	Place numbers in partition
40	`omp_get_wtime`	`double()`	Wall clock time
41	`omp_set_num_threads`	`void(i32)`	Set thread count
42	`omp_set_dynamic`	`void(i32)`	Enable/disable dynamic
43	`omp_set_nested`	`void(i32)`	Enable/disable nesting
44	`omp_set_schedule`	`void(i32, i32)`	Set loop schedule
45	`omp_set_max_active_levels`	`void(i32)`	Set max nesting

These are the user-facing OpenMP API functions. On GPU, most return compile-time constants or trivial register reads. The Attributor-based OpenMP driver (sub_269F530) can fold many of these to constants when the execution mode and team configuration are statically known -- for example, omp_get_num_threads folds to the blockDim.x launch parameter.

Begin/End (53--54)

Index	Function	Signature	Purpose
53	`__kmpc_begin`	`void(ident_t*, i32)`	Library initialization (rarely used on GPU)
54	`__kmpc_end`	`void(ident_t*)`	Library shutdown

Master/Masked Constructs (46--49)

Index	Function	Signature	Purpose	Call Generation
46	`__kmpc_master`	`i32(ident_t*, i32)`	Enter master region	Returns 1 for master thread (thread 0), 0 for all others. IRGen wraps user code in `if(__kmpc_master(..)) {...}`
47	`__kmpc_end_master`	`void(ident_t*, i32)`	Exit master region	Called at end of master block
48	`__kmpc_masked`	`i32(ident_t*, i32, i32)`	Enter masked region (OMP 5.1)	Third param is the filter ID (which specific thread executes). Replaces `master` in OMP 5.1
49	`__kmpc_end_masked`	`void(ident_t*, i32)`	Exit masked region	Called at end of masked block

Critical Sections (50--52)

Index	Function	Signature	Purpose	Call Generation
50	`__kmpc_critical`	`void(ident_t, i32, kmp_critical)`	Enter critical section	On GPU: atomic spin-lock acquire on the 32-byte lock variable
51	`__kmpc_critical_with_hint`	`void(ident_t, i32, i32, kmp_critical)`	Enter with lock hint	Hint encodes contention strategy (uncontended, contended, speculative, non-speculative)
52	`__kmpc_end_critical`	`void(ident_t, i32, kmp_critical)`	Exit critical section	Atomic release on lock variable

On GPU, critical sections use atomic operations on global memory. The kmp_critical_name type is [8 x i32] (32 bytes), used as an atomic lock variable. The _with_hint variant accepts a contention hint that the GPU runtime maps to different atomic strategies.

Reduction (55--58)

Index	Function	Signature	Purpose
55	`__kmpc_reduce`	`i32(ident_t, i32, i32, i64, i8, kmp_reduce_func, kmp_critical*)`	Begin reduction (blocking)
56	`__kmpc_reduce_nowait`	`i32(ident_t, i32, i32, i64, i8, kmp_reduce_func, kmp_critical*)`	Begin reduction (non-blocking)
57	`__kmpc_end_reduce`	`void(ident_t, i32, kmp_critical)`	End reduction (blocking)
58	`__kmpc_end_reduce_nowait`	`void(ident_t, i32, kmp_critical)`	End reduction (non-blocking)

These are the standard reduction protocol entries. On GPU, the compiler typically prefers the NVIDIA-specific shuffle-based reductions (indices 176--178) which are significantly faster.

Static Loop Scheduling (61--70)

Index	Function	Signature
61--64	`__kmpc_for_static_init_{4,4u,8,8u}`	`void(ident_t, i32, i32, i32, {i32,i64}, {i32,i64}, {i32,i64}, {i32,i64}, {i32,i64}, {i32,i64})`
65	`__kmpc_for_static_fini`	`void(ident_t*, i32)`
66--69	`__kmpc_distribute_static_init_{4,4u,8,8u}`	Same 9-param shape as 61--64
70	`__kmpc_distribute_static_fini`	`void(ident_t*, i32)`

The _4 / _4u / _8 / _8u suffixes indicate signed-32, unsigned-32, signed-64, unsigned-64 loop variable types respectively. All static_init functions take 9 parameters: location, thread ID, schedule type, pointer to is-last flag, pointers to lower/upper/stride/incr bounds, and chunk size.

Dynamic Dispatch (71--87)

Indices 71--74 handle distribute + dynamic dispatch initialization. Indices 75--82 handle standard dispatch_init and dispatch_next for the four integer widths. Indices 83--87 are dispatch finalization. Total: 17 entries covering the full dynamic loop scheduling interface.

Team Static & Combined Distribute-For (88--95)

Indices 88--91 (__kmpc_team_static_init_{4,4u,8,8u}) handle team-level static work distribution. Indices 92--95 (__kmpc_dist_for_static_init_{4,4u,8,8u}) are the combined distribute parallel for static init, taking 10 parameters (the extra parameter is the distribute upper bound pointer).

Tasking (98--116)

19 entries covering the full OpenMP tasking interface:

Index	Function	Signature	Purpose
98	`__kmpc_omp_task_alloc`	`i8(ident_t, i32, i32, i64, i64, kmp_routine_entry_t)`	Allocate task descriptor (6 params). Returns `kmp_task_t*`. Params: flags, sizeof_task, sizeof_shareds, task_entry
99	`__kmpc_omp_task`	`i32(ident_t, i32, i8)`	Submit allocated task for execution. Third param is the `kmp_task_t*` from task_alloc
100	`__kmpc_end_taskgroup`	`void(ident_t*, i32)`	End `#pragma omp taskgroup`
101	`__kmpc_taskgroup`	`void(ident_t*, i32)`	Begin taskgroup
102	`__kmpc_omp_task_begin_if0`	`void(ident_t, i32, i8)`	Begin immediate task (when `if` clause evaluates to false)
103	`__kmpc_omp_task_complete_if0`	`void(ident_t, i32, i8)`	Complete immediate task
104	`__kmpc_omp_task_with_deps`	`i32(ident_t, i32, i8, i32, i8, i32, i8)`	Task with dependency list (7 params). Params: task, ndeps, dep_list, ndeps_noalias, noalias_list
105	`__kmpc_taskloop`	`void(ident_t, i32, i8, i32, i64, i64, i64, i32, i32, i64, i8*)`	`#pragma omp taskloop` (11 params). Params: task, if_val, lb_p, ub_p, st, nogroup, sched, grainsize, task_dup
106	`__kmpc_taskloop_5`	`void(ident_t, i32, i8, i32, i64, i64, i64, i32, i32, i64, i8*, i32)`	OMP 5.1 taskloop (12 params). Extra param: modifier
107	`__kmpc_omp_target_task_alloc`	`i8(ident_t, i32, i32, i64, i64, kmp_routine_entry_t, i64)`	Target-offload task allocation (7 params). Extra i64: device_id
108	`__kmpc_taskred_modifier_init`	`i8(ident_t, i32, i32, i32, i8*)`	Init task reduction with modifier (5 params). Params: is_ws, num, data
109	`__kmpc_taskred_init`	`i8(i32, i32, i8)`	Init task reduction (basic)
110	`__kmpc_task_reduction_modifier_fini`	`void(ident_t*, i32, i32)`	Finalize task reduction
111	`__kmpc_task_reduction_get_th_data`	`i8(i32, i8, i8*)`	Get thread-local reduction data
112	`__kmpc_task_reduction_init`	`i8(i32, i32, i8)`	Init task reduction (alternate path)
113	`__kmpc_task_reduction_modifier_init`	`i8(i8, i32, i32, i32, i8*)`	Init with full modifier (5 params)
114	`__kmpc_proxy_task_completed_ooo`	`void(i8*)`	Out-of-order proxy task completion. Used for detached tasks
115	`__kmpc_omp_wait_deps`	`void(ident_t, i32, i32, i8, i32, i8*)`	Wait on task dependencies (6 params)
116	`__kmpc_omp_taskwait_deps_51`	`void(ident_t, i32, i32, i8, i32, i8*, i32)`	OMP 5.1 dependency wait (7 params). Extra param: nowait modifier

Index 106 (__kmpc_taskloop_5) and index 116 (__kmpc_omp_taskwait_deps_51) are OMP 5.1 additions with an extra modifier parameter compared to their predecessors.

Teams and Cancellation (117--121)

Index	Function	Signature	Purpose
117	`__kmpc_cancellationpoint`	`i32(ident_t*, i32, i32)`	Cancellation point check
118	`__kmpc_fork_teams`	`void(ident_t*, i32, kmpc_micro, ...)`	Fork teams region (varargs)
119	`__kmpc_push_num_teams`	`void(ident_t*, i32, i32, i32)`	Set team count
120	`__kmpc_push_num_teams_51`	`void(ident_t*, i32, i32, i32, i32)`	Set team count (OMP 5.1, 5 params)
121	`__kmpc_set_thread_limit`	`void(ident_t*, i32, i32)`	Set per-team thread limit

Copyprivate and Threadprivate (122--124)

Index	Function	Signature	Purpose
122	`__kmpc_copyprivate`	`void(ident_t, i32, i64, i8, kmp_copy_func, i32)`	`#pragma omp copyprivate`. Broadcasts private data from single thread to all others. 6 params
123	`__kmpc_threadprivate_cached`	`i8(ident_t, i32, i8, i64, i8**)`	Get/allocate threadprivate variable data. 5 params
124	`__kmpc_threadprivate_register`	`void(ident_t, i8, kmpc_ctor, void, void)`	Register threadprivate with ctor, copy-ctor, dtor callbacks

Doacross Synchronization (125--128)

Cross-iteration dependencies for #pragma omp ordered depend(source/sink).

Index	Function	Signature	Purpose
125	`__kmpc_doacross_init`	`void(ident_t, i32, i32, i8)`	Init doacross tracking. Params: num_dims, dims_info
126	`__kmpc_doacross_post`	`void(ident_t, i32, i64)`	Post (source): signal iteration completion
127	`__kmpc_doacross_wait`	`void(ident_t, i32, i64)`	Wait (sink): wait for iteration to complete
128	`__kmpc_doacross_fini`	`void(ident_t*, i32)`	Finalize doacross tracking

Memory Allocators (129--136)

Index	Function	Signature	Purpose
129	`__kmpc_alloc`	`i8(i32, i64, i8)`	OpenMP allocator alloc. Params: gtid, size, allocator
130	`__kmpc_aligned_alloc`	`i8(i32, i64, i64, i8)`	Aligned allocation. Params: gtid, align, size, allocator
131	`__kmpc_free`	`void(i32, i8, i8)`	Free allocated memory. Params: gtid, ptr, allocator
132	`__tgt_interop_init`	`void(ident_t, i32, i8, i32, i32, i32, i8, i32)`	OMP 5.1 foreign runtime interop init (8 params)
133	`__tgt_interop_destroy`	`void(ident_t, i32, i8, i32, i32, i32, i8)`	Destroy interop object (7 params)
134	`__tgt_interop_use`	`void(ident_t, i32, i8, i32, i32, i32, i8)`	Use interop object (7 params)
135	`__kmpc_init_allocator`	`i8(i32, i32, i8, i8*)`	Init OpenMP allocator. Params: gtid, memspace, num_traits, traits
136	`__kmpc_destroy_allocator`	`void(i32, i8*)`	Destroy allocator

Target Offloading (137--153)

18 entries implementing the host-side target offloading protocol. These are primarily used when cicc compiles host code that launches GPU kernels, not within device code itself:

Index	Function	Signature	Params	Purpose
137	`__kmpc_push_target_tripcount_mapper`	`void(ident_t*, i64, i64)`	3	Set iteration count for target region. Params: device_id, trip_count
138	`__tgt_target_mapper`	`i32(ident_t, i64, i8, i32, i8, i8, i64, i64, i8, i8)`	10	Launch target region with data mapping
139	`__tgt_target_nowait_mapper`	(14 params)	14	Async target launch. Adds depobj count/list, noalias count/list
140	`__tgt_target_teams_mapper`	(12 params)	12	Target teams launch. Adds num_teams, thread_limit, mappers
141	`__tgt_target_teams_nowait_mapper`	(16 params)	16	Async target teams. Most complex host-side offload call
142	`__tgt_target_kernel`	`i32(ident_t, i64, i32, i32, i8, __tgt_kernel_args*)`	6	New-style kernel launch (takes `__tgt_kernel_arguments*`)
143	`__tgt_target_kernel_nowait`	(10 params)	10	Async new-style launch. Adds depobj info
144	`__tgt_target_data_begin_mapper`	(9 params)	9	Map data to device
145	`__tgt_target_data_begin_nowait_mapper`	(13 params)	13	Async map-to
146	`__tgt_target_data_begin_mapper_issue`	(10 params)	10	Split-phase issue for async map-to
147	`__tgt_target_data_begin_mapper_wait`	`void(i64, __tgt_async_info*)`	2	Split-phase wait for async map-to
148	`__tgt_target_data_end_mapper`	(9 params)	9	Map data from device
149	`__tgt_target_data_end_nowait_mapper`	(13 params)	13	Async map-from
150	`__tgt_target_data_update_mapper`	(9 params)	9	Data update (host-to-device or device-to-host)
151	`__tgt_target_data_update_nowait_mapper`	(13 params)	13	Async data update
152	`__tgt_mapper_num_components`	`i64(i8*)`	1	Query user-defined mapper component count
153	`__tgt_push_mapper_component`	`void(i8, i8, i8, i64, i64, i8)`	6	Register mapper component. Params: handle, base, begin, size, type, name

Task Completion Event (154)

Index	Function	Signature	Purpose
154	`__kmpc_task_allow_completion_event`	`i8(ident_t, i32, i8*)`	Allow completion event for detached tasks (OMP 5.0)

GPU Kernel Lifecycle (155--158)

These are the most important entries for device-side GPU OpenMP code.

Index	Function	Signature	Purpose	Call Generation
155	`__kmpc_target_init`	`i32(KernelEnvironmentTy, KernelLaunchEnvironmentTy)`	Kernel entry	First call in every GPU OpenMP kernel. State machine generator (`sub_2678420`) emits this at entry. `KernelEnvironmentTy` carries `ConfigurationEnvironmentTy` (first byte = execution mode)
156	`__kmpc_target_deinit`	`void()`	Kernel exit	Last call in every GPU OpenMP kernel. Emitted by state machine generator
157	`__kmpc_kernel_prepare_parallel`	`void(i8*)`	Generic: signal workers	Master thread writes outlined function pointer to shared memory, then signals workers to execute it. Replaced by `__kmpc_parallel_51` after SPMD conversion
158	`__kmpc_parallel_51`	`void(ident_t, i32, i32, i32, i32, i8, i8, i8*, i64)`	OMP 5.1 GPU parallel dispatch	9 params: if_expr, num_threads, proc_bind, fn, wrapper_fn, shared_args, num_shared_args. Used by parallel region outliner (`sub_313D1B0`) on SPMD kernels. Replaces `fork_call` for GPU

__kmpc_target_init is the first runtime call in every GPU OpenMP kernel. In Generic mode, it returns -1 for worker threads (which should enter the polling loop) and 0 for the master thread. In SPMD mode, it returns 0 for all threads. The KernelEnvironmentTy struct carries the ConfigurationEnvironmentTy which encodes the execution mode, team sizes, and runtime configuration.

New-Style Static Loops, OMP 5.1+ (159--170)

12 entries implementing the callback-based loop interface introduced in OpenMP 5.1:

Index	Function	Signature
159--162	`__kmpc_for_static_loop_{4,4u,8,8u}`	`void(ident_t, i8, i8*, {i32,i64}, {i32,i64}, {i32,i64})`
163--166	`__kmpc_distribute_static_loop_{4,4u,8,8u}`	`void(ident_t, i8, i8*, {i32,i64}, {i32,i64})`
167--170	`__kmpc_distribute_for_static_loop_{4,4u,8,8u}`	`void(ident_t, i8, i8*, {i32,i64}, {i32,i64}, {i32,i64}, {i32,i64})`

Unlike the old-style _init/_fini pairs, these new-style loops take function pointer callbacks (i8* for the loop body and data pointer) and handle initialization + execution + finalization in a single call.

Legacy Kernel-Mode Parallel (171--174)

Index	Function	Signature	Purpose
171	`__kmpc_kernel_parallel`	`i1(i8**)`	Generic mode: worker checks if parallel work available
172	`__kmpc_kernel_end_parallel`	`void()`	Generic mode: worker signals completion
173	`__kmpc_serialized_parallel`	`void(ident_t*, i32)`	Execute parallel region serially (if(0) parallel)
174	`__kmpc_end_serialized_parallel`	`void(ident_t*, i32)`	End serialized parallel

These are the Generic-mode worker-side functions. __kmpc_kernel_parallel returns true when the master thread has dispatched work via __kmpc_kernel_prepare_parallel, writing the outlined function pointer into the output parameter.

Warp-Level Primitives (175, 179, 189--190)

Index	Function	Signature	Purpose
175	`__kmpc_shuffle_int32`	`i32(i32, i16, i16)`	Warp shuffle for 32-bit value
179	`__kmpc_shuffle_int64`	`i64(i64, i16, i16)`	Warp shuffle for 64-bit value
189	`__kmpc_warp_active_thread_mask`	`i64()`	Active lane mask (PTX `activemask`)
190	`__kmpc_syncwarp`	`void(i64)`	Warp-level barrier with mask

The shuffle functions take (value, lane_offset, warp_size) and implement butterfly-pattern data exchange for intra-warp reductions. These compile down to PTX shfl.sync instructions.

NVIDIA Device Reduction (176--178)

Index	Function	Signature	Purpose
176	`__kmpc_nvptx_parallel_reduce_nowait_v2`	`i32(ident_t, i64, i8, ShuffleReductFctPtr, InterWarpCopyFctPtr)`	Intra-CTA parallel reduction
177	`__kmpc_nvptx_teams_reduce_nowait_v2`	`i32(ident_t, i32, i8, i64, i8*, ShuffleReductFctPtr, InterWarpCopyFctPtr, ListGlobalFctPtr, ListGlobalFctPtr, ListGlobalFctPtr, ListGlobalFctPtr)`	Cross-CTA team reduction (11 params)
178	`__kmpc_reduction_get_fixed_buffer`	`i8*()`	Get global reduction scratch buffer

These are the GPU-specific reduction entries -- the single most important performance-critical runtime calls for OpenMP on NVIDIA GPUs. The parallel reduction (index 176) uses a two-phase approach: (1) intra-warp reduction via shuffle, then (2) inter-warp reduction via shared memory copy. The compiler generates the ShuffleReductFctPtr and InterWarpCopyFctPtr callback functions as outlined helpers that the runtime calls during the reduction tree.

The teams reduction (index 177) adds four ListGlobalFctPtr callbacks for managing global memory buffers across CTAs, plus an extra size parameter. This is the most complex runtime call in the entire table, with 11 parameters.

Shared Memory Management (180--184)

Index	Function	Signature	Purpose
180	`__kmpc_alloc_shared`	`i8*(i64)`	Dynamic shared memory allocation
181	`__kmpc_free_shared`	`void(i8*, i64)`	Free shared memory
182	`__kmpc_begin_sharing_variables`	`void(i8***, i64)`	Begin variable sharing protocol
183	`__kmpc_end_sharing_variables`	`void()`	End sharing protocol
184	`__kmpc_get_shared_variables`	`i8**()`	Get shared variable array

__kmpc_alloc_shared / __kmpc_free_shared are heavily used in the SPMD transformation's guarded output mechanism: values computed by the master thread that are needed by all threads are stored into dynamically-allocated shared memory, synchronized via barrier, then loaded by all threads.

SPMD Mode Detection (185--188)

Index	Function	Signature	Purpose
185	`__kmpc_parallel_level`	`i16(ident_t*, i32)`	Current parallel nesting depth
186	`__kmpc_is_spmd_exec_mode`	`i8()`	Returns 1 if SPMD, 0 if Generic
187	`__kmpc_barrier_simple_spmd`	`void(ident_t*, i32)`	Lightweight barrier for SPMD mode (`bar.sync`)
188	`__kmpc_barrier_simple_generic`	`void(ident_t*, i32)`	State-machine barrier for Generic mode

The two barrier variants reflect the fundamental mode difference. __kmpc_barrier_simple_spmd compiles to a single bar.sync instruction. __kmpc_barrier_simple_generic involves polling a shared-memory flag because workers are in a state-machine loop that must check for new work after each barrier.

Profiling (191--192) and Sentinel (193)

Index	Function	Signature	Purpose
191	`__llvm_profile_register_function`	`void(i8*)`	PGO: register function for profiling
192	`__llvm_profile_register_names_function`	`void(i8*, i64)`	PGO: register name table
193	`__last`	`void()`	Sentinel marking table end

The two __llvm_profile_* entries support profile-guided optimization instrumentation on GPU. The __last sentinel at index 193 is a void-to-void function that marks the end of the table; it is never called at runtime.

Declaration Construction Protocol

For each runtime function, sub_312CF50 follows an identical protocol:

// Pseudocode for a typical case (e.g., case 0: __kmpc_barrier)
case 0: {
    // 1. Build parameter type array from cached types
    Type *params[] = { ctx->ident_t_ptr, ctx->i32_ty };  // a1+2784, a1+2632

    // 2. Construct FunctionType
    FunctionType *fty = FunctionType::get(
        ctx->void_ty,   // return type (a1+2600)
        params, 2,       // param array + count
        /*isVarArg=*/false
    );

    // 3. Check if symbol already exists in module
    Value *existing = Module::getNamedValue("__kmpc_barrier");
    if (existing == a2)  // a2 is the existing-check value
        return existing;

    // 4. Create new function declaration
    Function *decl = Function::Create(
        fty,
        259,             // linkage = ExternalLinkage (0x103)
        "__kmpc_barrier",
        module
    );

    // 5. Register in context table
    registerRuntimeFunction(a1, /*index=*/0, decl);  // sub_3122A50

    return decl;
}

The linkage value 259 (0x103) decodes as ExternalLinkage with the DLLImport storage class flag set. This is consistent across all 194 entries.

For the two varargs entries (indices 7 and 118), the FunctionType::get call passes isVarArg=true, and after Function::Create, the code calls sub_B994D0 to add attribute #26 and sub_B91C10 to verify it was applied. Attribute #26 likely corresponds to a convergent-or-varargs marker that prevents the optimizer from incorrectly transforming these calls.

Comparison with Upstream LLVM OMPKinds.def

cicc's table maps one-to-one with the __OMP_RTL entries in LLVM 18.x's OMPKinds.def. The ordering is identical: the enum OMPRTL___kmpc_barrier = 0 corresponds to cicc's case 0, and so on through OMPRTL___last = 193 at case 193.

Key differences from upstream:

Procedural vs declarative. Upstream uses X-macros (__OMP_RTL) expanded by OMPIRBuilder::initialize() to lazily create declarations on first use. cicc's sub_312CF50 is a compiled switch statement that eagerly creates declarations when requested by case index.
Type representation. Upstream uses opaque pointer types (PointerType::get(Ctx, 0)) throughout. cicc preserves typed pointers (i8*, i32*, i64*, struct pointers) in its type cache, consistent with LLVM's pre-opaque-pointer era. This is because cicc's internal IR (NVVM IR) still uses typed pointers even though upstream LLVM has migrated to opaque pointers.
Missing entries. cicc lacks __kmpc_push_num_threads_strict (present in latest upstream) and uses __kmpc_parallel_51 where upstream LLVM 18.x defines __kmpc_parallel_60 with a slightly different signature. The _51 name indicates cicc v13.0 targets the OMP 5.1 runtime ABI, not the OMP 6.0 draft.
Attribute handling. Upstream OMPKinds.def includes extensive attribute sets (GetterAttrs, SetterAttrs, etc.) that annotate runtime functions with nounwind, nosync, nofree, willreturn, and memory effect attributes for optimization. cicc applies only attribute #26 to the two varargs functions and otherwise relies on the OpenMPOpt pass to infer attributes.
The __tgt_interop_* entries (indices 132--134) in cicc take a slightly different parameter list than upstream: cicc includes an extra i32 parameter at the end that upstream encodes differently, reflecting a minor ABI divergence in the interop interface.

Configuration Knobs

All LLVM cl::opt knobs related to OpenMP optimization, as found in the cicc binary:

Knob	Type	Default	Effect
`openmp-opt-disable`	`bool`	`false`	Disable all OpenMP optimizations
`openmp-opt-enable-merging`	`bool`	`false`	Enable parallel region merging
`openmp-opt-disable-internalization`	`bool`	`false`	Skip function internalization
`openmp-opt-disable-deglobalization`	`bool`	`false`	Skip global-to-local promotion
`openmp-opt-disable-spmdization`	`bool`	`false`	Skip Generic-to-SPMD transformation
`openmp-opt-disable-folding`	`bool`	`false`	Skip ICV folding
`openmp-opt-disable-state-machine-rewrite`	`bool`	`false`	Skip state machine optimization
`openmp-opt-disable-barrier-elimination`	`bool`	`false`	Skip redundant barrier removal
`openmp-opt-inline-device`	`bool`	varies	Inline device runtime calls
`openmp-opt-verbose-remarks`	`bool`	`false`	Emit detailed optimization remarks
`openmp-opt-max-iterations`	`int`	varies	Fixed-point iteration limit for analysis
`openmp-opt-shared-limit`	`int`	varies	Max shared memory for SPMD output promotion
`openmp-opt-print-module-after`	`bool`	`false`	Dump module IR after OpenMP optimization
`openmp-opt-print-module-before`	`bool`	`false`	Dump module IR before OpenMP optimization
`openmp-deduce-icv-values`	`bool`	varies	Deduce Internal Control Variable values
`openmp-print-icv-values`	`bool`	`false`	Print deduced ICV values
`openmp-print-gpu-kernels`	`bool`	`false`	Print identified GPU kernels
`openmp-hide-memory-transfer-latency`	`bool`	`false`	Overlap data transfers with computation

The openmp-opt-shared-limit knob is particularly relevant for the SPMD transformation: it caps the total amount of shared memory allocated for guarded output promotion. If the serial sections between parallel regions produce too many live-out values, the SPMD transformation may be abandoned when the shared memory budget is exceeded.

Diagnostic Strings

The OpenMP subsystem emits two diagnostics during SPMD transformation:

Code	Severity	Message
OMP120	Remark	`"Transformed generic-mode kernel to SPMD-mode."`
OMP121	Warning	`"Value has potential side effects preventing SPMD-mode execution. Add [[omp::assume(\"ompx_spmd_amenable\")]] to the called function to override"`

OMP120 is emitted by sub_26968A0 on successful Generic-to-SPMD conversion. OMP121 is emitted for each call instruction that references a function not in the SPMD-amenable set, explaining why the transformation failed and providing the user with the override attribute.

Pipeline Integration

The OpenMP passes are registered in the pipeline under three names:

Pipeline ID	Pass Name	Level	Description
75	`openmp-opt`	Module	Pre-link OpenMP optimization
76	`openmp-opt-postlink`	Module	Post-link OpenMP optimization
154	`openmp-opt-cgscc`	CGSCC	Call-graph-level OpenMP optimization

The runtime declaration table (sub_312CF50) is invoked lazily from any of these passes when they need to emit a runtime call. The SPMD transformation is part of the module-level openmp-opt pass.

Execution Mode Call Patterns

The execution mode fundamentally determines which runtime functions appear in generated IR. These pseudocode patterns show the exact call sequences emitted by the state machine generator (sub_2678420) and the SPMD transformation (sub_26968A0).

Generic Mode Kernel (mode byte = 1)

entry:
    ret = __kmpc_target_init(KernelEnv, LaunchEnv)   // [155]
    if (ret == -1) goto worker_loop                   // worker threads
    // master thread: user code
    __kmpc_kernel_prepare_parallel(outlined_fn_ptr)   // [157]
    __kmpc_barrier_simple_generic(loc, gtid)          // [188]
    // ... more serial + parallel sections ...
    __kmpc_target_deinit()                            // [156]
worker_loop:
    while (true) {
        __kmpc_barrier_simple_generic(loc, gtid)      // [188]
        if (__kmpc_kernel_parallel(&fn))               // [171]
            fn(args);
            __kmpc_kernel_end_parallel()               // [172]
        __kmpc_barrier_simple_generic(loc, gtid)      // [188]
    }

SPMD Mode Kernel -- Simple (mode byte = 2, single parallel region)

After successful Generic-to-SPMD transformation:

entry:
    __kmpc_target_init(KernelEnv, LaunchEnv)          // [155], returns 0 for all
    tid = __kmpc_get_hardware_thread_id_in_block()    // [6]
    is_main = (tid == 0)
    br is_main, user_code, exit.threads
user_code:
    // all threads: user code
    __kmpc_parallel_51(loc, gtid, ...)                // [158], for nested
    __kmpc_barrier_simple_spmd(loc, gtid)             // [187]
exit.threads:
    __kmpc_target_deinit()                            // [156]

SPMD Mode Kernel -- Complex (guarded regions, multiple parallel regions)

entry:
    __kmpc_target_init(...)                           // [155]
region.check.tid:
    tid = __kmpc_get_hardware_thread_id_in_block()    // [6]
    cmp = icmp eq tid, 0
    br cmp, region.guarded, region.barrier
region.guarded:
    ... master-only serial code ...
    shared_ptr = __kmpc_alloc_shared(sizeof(result))  // [180]
    store result -> shared_ptr
region.guarded.end:
    br region.barrier
region.barrier:
    __kmpc_barrier_simple_spmd(loc, gtid)             // [187]
    result = load from shared_ptr
    __kmpc_barrier_simple_spmd(loc, gtid)             // [187], post-load
    __kmpc_free_shared(shared_ptr, size)              // [181]
    ... all threads continue with result ...
exit:
    __kmpc_target_deinit()                            // [156]

The SPMD transformation eliminates the worker state machine entirely. Workers no longer idle-spin in a polling loop; they participate in computation from the kernel's first instruction. Serial sections between parallel regions are wrapped in tid==0 guards with shared-memory output promotion and barriers.

SPMD-Amenable Function Table

The SPMD transformation maintains a hash set of functions that are safe to call from all threads simultaneously, located at *(omp_context + 208) + 34952 (base pointer), +34968 (capacity).

Property	Value
Hash function	Open-addressing with linear probing
Slot computation	`((addr >> 9) ^ (addr >> 4)) & (capacity - 1)`
Sentinel	`-4096` (empty slot marker)
Contents	Functions pre-analyzed or annotated with `[[omp::assume("ompx_spmd_amenable")]]`

When a call instruction references a function not in this set, the SPMD transformation fails for that kernel and emits OMP121: "Value has potential side effects preventing SPMD-mode execution. Add [[omp::assume(\"ompx_spmd_amenable\")]] to the called function to override".

Functional Category Summary

Category	Count	Indices
Thread hierarchy and hardware query	20	0--6, 14--16, 17--45
Work sharing / loop scheduling	48	61--95, 159--170
Tasking	19	98--116, 154
Synchronization	12	0, 2, 4, 50--52, 59--60, 96--97, 187--188, 190
Target offloading / data mapping	18	137--153
GPU execution mode	10	155--158, 171--174, 185--186
Warp primitives	4	175, 179, 189--190
NVIDIA device reduction	3	176--178
Shared memory management	5	180--184
Memory allocators	8	129--136
Copyprivate / threadprivate	3	122--124
Doacross synchronization	4	125--128
Teams / cancellation	5	117--121
Master / masked	4	46--49
Reduction (standard)	4	55--58
Begin / end	2	53--54
Profiling	2	191--192
Sentinel	1	193
Total	194

Function Map

Function	Address	Size	Role
`sub_312CF50` -- OpenMP runtime declaration factory (194-case switch)	`0x312CF50`	--	--
`sub_3122A50` -- `registerRuntimeFunction(context, index, funcDecl)`	`0x3122A50`	--	--
`sub_2686D90` -- OpenMP runtime declaration table (215 KB, outer wrapper)	`0x2686D90`	--	--
`sub_26968A0` -- Generic-to-SPMD transformation (61 KB)	`0x26968A0`	--	--
`sub_2680940` -- Parallel region merging (52 KB)	`0x2680940`	--	--
`sub_2678420` -- State machine generation for Generic mode (41 KB)	`0x2678420`	--	--
`sub_269F530` -- Attributor-based OpenMP optimization driver (63 KB)	`0x269F530`	--	--
`sub_313D1B0` -- Parallel region outliner (47 KB)	`0x313D1B0`	--	--
`sub_BCF480` -- `FunctionType::get(retTy, paramTys, count, isVarArg)`	`0xBCF480`	--	--
`sub_BA8CB0` -- `Module::getNamedValue(name)`	`0xBA8CB0`	--	--
`sub_B2C660` -- `Function::Create(funcTy, linkage, name, module)`	`0xB2C660`	--	--
`sub_B994D0` -- `addAttribute(26, value)` -- set function attribute	`0xB994D0`	--	--
`sub_B91C10` -- `hasAttribute(26)` -- check function attribute	`0xB91C10`	--	--
`sub_B9C770` -- Attribute construction (varargs attribute)	`0xB9C770`	--	--
`sub_B8C960` -- Attribute kind construction	`0xB8C960`	--	--
`sub_B2BE50` -- `Function::getContext()`	`0xB2BE50`	--	--
`sub_921880` -- Create runtime library call instruction	`0x921880`	--	--
`sub_5FB5C0` -- OpenMP variant processing (`%s$$OMP_VARIANT%06d`)	`0x5FB5C0`	--	--

OpenMP Variant Processing

cicc also supports OpenMP variant dispatch during EDG front-end processing. The function sub_5FB5C0 at 0x5FB5C0 handles mangled names with the format %s$$OMP_VARIANT%06d, which the front-end generates for #pragma omp declare variant constructs. This is separate from the runtime declaration table and operates at the source-level AST rather than at the LLVM IR level.

Cross-References

Generic-to-SPMD Transformation -- the primary consumer of the runtime table, performing mode conversion using entries 6, 155, 156, 180, 181, 187, 188
Pipeline & Ordering -- where openmp-opt (ID 75), openmp-opt-postlink (ID 76), and openmp-opt-cgscc (ID 154) sit in the pass pipeline
CLI Flags -- compiler flags that control OpenMP code generation
LLVM Knobs -- the openmp-opt-* knobs listed above
Kernel Metadata -- how KernelEnvironmentTy and execution mode are set during IR generation
Hash Infrastructure -- the open-addressing hash table pattern used by the SPMD-amenable function set
GPU Execution Model -- broader context on SPMD vs Generic execution

Generic-to-SPMD Transformation

The Generic-to-SPMD transformation (sub_26968A0, 61 KB, ~1807 lines) is cicc's most impactful OpenMP target optimization. It converts GPU kernels from Generic execution mode -- where thread 0 acts as a master running serial code through a state machine while all other threads idle at a barrier -- into SPMD mode, where every thread in the block executes the same code from the first instruction. The transformation eliminates the worker state machine loop entirely, removes warp divergence at kernel entry, replaces heavyweight generic barriers with lightweight SPMD barriers (__syncthreads), and enables the hardware scheduler to fill warps from the very first cycle. On real workloads this routinely yields 2-4x speedups for simple target parallel for regions. The pass emits diagnostic OMP120 on success and OMP121 when a callee's side effects prevent conversion.

Key Facts

Property	Value
Function address	`sub_26968A0`
Decompiled size	61 KB (~1807 lines)
Pass registration	`openmp-opt` (pipeline slot 75, Module pass)
Post-link variant	`openmp-opt-postlink` (slot 76)
CGSCC variant	`openmp-opt-cgscc` (slot 154)
Parameters	`a1` = PassState, `a2` = ModuleContext, `a3` = OutputFlag
Eligibility flag	`*(a1+241)` -- boolean, set by prior analysis
Parallel region array	`(a1+280)` base, `(a1+288)` count
Diagnostic handler	`*(a2+4392)`
Success diagnostic	OMP120: "Transformed generic-mode kernel to SPMD-mode."
Failure diagnostic	OMP121: "Value has potential side effects preventing SPMD-mode execution"

Generic vs SPMD Execution Model

Understanding the two execution modes is essential before examining the transformation.

Aspect	Generic Mode	SPMD Mode
Thread roles	Thread 0 = master; threads 1..N-1 = workers	All threads execute same code
Kernel entry	`__kmpc_target_init` returns tid for master, -1 for workers	`__kmpc_target_init` returns tid for all
Serial code	Master executes directly	Wrapped in `if (tid == 0)` guard
Parallel region	Master signals workers via `parallel_level`; workers wake, execute outlined fn, re-barrier	All threads already executing; outlined fn body inlined
Barrier type	`__kmpc_barrier_simple_generic` (poll-based state machine)	`__kmpc_barrier_simple_spmd` (maps to `bar.sync` / `__syncthreads`)
Worker idle loop	`while(true) { barrier(); if(parallel_level) { exec(); barrier(); } }`	No idle loop -- eliminated entirely
Warp divergence	Warps containing thread 0 diverge at entry gate	No divergence at entry
Occupancy	Lower -- workers consume registers/shared mem while idle	Higher -- all resources used productively
Execution mode constant	1 (`OMP_TGT_EXEC_MODE_GENERIC`)	2 (`OMP_TGT_EXEC_MODE_SPMD`)
Transition marker	--	3 (`OMP_TGT_EXEC_MODE_GENERIC_SPMD`, intermediate during transform)

In Generic mode the runtime creates a CTA (Cooperative Thread Array) where only thread 0 enters user code. The remaining N-1 threads enter a polling loop: they call __kmpc_barrier_simple_generic, check the parallel_level variable, and if a parallel region has been entered by the master, they wake up, execute the outlined parallel function, then return to polling. This "state machine" pattern is the primary performance bottleneck -- it wastes cycles on barrier polling, causes massive warp divergence on the first warp (which contains both the master and worker lanes), and prevents the scheduler from issuing useful work for idle threads.

SPMD mode eliminates all of this. Every thread begins executing user code at kernel entry. Serial code sections that cannot be parallelized are protected by lightweight tid == 0 guards, with results broadcast to all threads through shared memory and bar.sync barriers.

Legality Analysis

The transformation is gated by a boolean eligibility flag at *(a1+241), which is computed by a prior analysis pass (not sub_26968A0 itself). The analysis determines eligibility based on three conditions:

Condition 1: Kernel is Currently in Generic Mode

The execution mode bit-vector's low byte must equal 1 (Generic). This is checked at line 429 of the decompiled output:

// sub_2674090/sub_2674040 read the execution mode attribute
mode_bv = get_exec_mode(a1 + 304);
if (mode_bv.size <= 64)
    mode_val = mode_bv.inline_data;
else
    mode_val = *mode_bv.data_ptr;

if ((uint8_t)mode_val != 1)  // Not Generic mode
    return;

Condition 2: All Callees are SPMD-Amenable

Every call instruction reachable from the kernel's parallel regions must reference a function in the SPMD-amenable function set. This set lives at *(a2+208) + 34952 (base pointer) with capacity at offset +34968.

// SPMD-amenable lookup (open-addressing hash set)
bool is_spmd_amenable(void *func_ptr, void *table_base, uint64_t capacity) {
    uint64_t hash = ((uintptr_t)func_ptr >> 9) ^ ((uintptr_t)func_ptr >> 4);
    uint64_t slot = hash & (capacity - 1);
    while (true) {
        void *entry = table_base[slot];
        if (entry == func_ptr) return true;
        if (entry == (void*)-4096) return false;  // empty sentinel
        slot = (slot + 1) & (capacity - 1);       // linear probe
    }
}

Functions are pre-populated in this set if they have been analyzed as side-effect free (from the caller's perspective in SPMD context), or if the programmer annotated them with [[omp::assume("ompx_spmd_amenable")]]. When a callee fails this check, the pass takes Path A (non-SPMD candidate path, lines 1692-1806) and emits OMP121 for each offending call:

warning: Value has potential side effects preventing SPMD-mode execution.
         Add `[[omp::assume("ompx_spmd_amenable")]]` to the called function
         to override [OMP121]

The diagnostic is constructed via sub_B178C0 (warning constructor), message appended via sub_B18290, and emitted through sub_1049740 to the handler at *(a2+4392).

Condition 3: No Unresolvable Side Effects

The kernel must not contain operations that are inherently unsafe when executed by multiple threads simultaneously -- for example, I/O operations with ordering requirements, or accesses to thread-local storage that assumes single-thread access.

Legality Pseudocode

function is_spmd_eligible(kernel, module_ctx):
    // Check current execution mode
    mode = read_exec_mode(kernel.attributes)
    if mode != GENERIC:
        return false

    // Scan all parallel regions
    for region in kernel.parallel_regions:
        for inst in region.instructions:
            if is_call_like(inst):  // opcode 34, 52, or 86
                callee = get_callee(inst)
                if callee.is_declaration:
                    if callee not in module_ctx.spmd_amenable_set:
                        emit_diagnostic(OMP121, inst.location,
                            "Value has potential side effects...")
                        return false

    return true

The call-like instruction detection uses a bitmask test: (opcode - 34) <= 0x33 followed by bittest(0x8000000000041, opcode - 34), which matches opcodes 34 (call), 52 (invoke), and 86 (callbr) -- the three LLVM call-family instructions.

Transformation Algorithm

Once eligibility is confirmed, sub_26968A0 takes Path B (lines 407-1691). The path splits based on kernel complexity:

Simple Case: Single Parallel Region

When *(a1+160) == 0 and *(a1+224) == 0, the kernel has a single parallel region with no intervening serial code. This is the fast path (lines 432-672).

function transform_simple_spmd(kernel, module_ctx):
    entry_bb = get_entry_block(kernel)
    func_scope = get_function_scope(kernel)
    thread_config = get_thread_configuration(kernel, module_ctx)

    // 1. Create new basic blocks
    user_code_bb = create_region("main.thread.user_code")
    exit_bb = create_exit_block("exit.threads")
    register_in_worklist(user_code_bb)
    register_in_worklist(exit_bb)

    // 2. Insert thread-id check at entry
    tid = call __kmpc_get_hardware_thread_id_in_block()  // runtime call ID 6
    is_main = icmp eq tid, 0
    br is_main, user_code_bb, exit_bb

    // 3. Move original parallel body into user_code_bb
    //    (all threads execute this -- the parallel outlined fn
    //     is effectively inlined into the kernel)

    // 4. Update execution mode: Generic(1) -> SPMD(2)
    //    Intermediate: set mode 3 (GENERIC_SPMD) then overwrite to 2
    bv_entry = create_bitvector_entry(*(kernel+304+8), 3, 0)
    current = read_attribute(*(kernel+304))
    *(kernel+304) = insert_attribute(current, bv_entry, key=0, value=1)

    // 5. Emit success diagnostic
    if diagnostic_handler_registered(module_ctx+4392):
        emit_remark(OMP120, "Transformed generic-mode kernel to SPMD-mode.")

The resulting CFG is straightforward:

entry:
    %tid = call i32 @__kmpc_get_hardware_thread_id_in_block()
    %is_main = icmp eq i32 %tid, 0
    br i1 %is_main, label %user_code, label %exit.threads

user_code:                         ; all threads execute
    ... original parallel body ...
    br label %exit.threads

exit.threads:
    ret void

Complex Case: Multiple Parallel Regions

When the kernel contains multiple parallel regions with serial code between them, the pass executes a four-phase transformation (lines 720-1676).

Phase 1: Deduplicate Parallel Regions (lines 720-760)

Multiple parallel regions may call the same outlined function. The pass deduplicates by function pointer using an inline hash set:

function dedup_regions(parallel_regions):
    seen = HashSet()  // inline small-buffer optimization
    unique = []
    for region in parallel_regions:
        fn_ptr = region.outlined_function  // offset+40
        if fn_ptr not in seen:
            seen.insert(fn_ptr)
            unique.append(region)
    return unique

Phase 2: Identify Non-SPMD-Safe Instructions (lines 768-873)

For each parallel region, the pass walks the CFG successor chain and identifies instructions with side effects that are not SPMD-compatible:

function find_guarded_ranges(region, module_ctx):
    ranges = []
    first_unsafe = null
    last_unsafe = null

    for inst in walk_cfg_successors(region):
        if is_side_effecting_call(inst):
            // Skip known-safe calls (global dtors at module_ctx+208+32432)
            if inst.callee == module_ctx.global_dtor_fn:
                continue
            // For invoke instructions: check if exception handler count is 0
            if inst.opcode == 85:  // invoke
                if get_eh_handler_count(inst) == 0:
                    continue  // can be simplified
            if first_unsafe == null:
                first_unsafe = inst
            last_unsafe = inst
        else:
            if first_unsafe != null:
                ranges.append((first_unsafe, last_unsafe))
                first_unsafe = null
                last_unsafe = null

    if first_unsafe != null:
        ranges.append((first_unsafe, last_unsafe))

    return ranges

The pass then calls sub_B444E0 to insert guard instructions at each range boundary.

Phase 3: Build Guarded Region Descriptors (lines 876-1059)

Each parallel region is looked up in the function-to-region-tracker hash map at *(a2+144). This map uses a splitmix64-variant hash:

uint64_t hash_function_key(uint64_t name_hash, uint64_t addr_hash) {
    uint64_t raw = name_hash ^ (16 * addr_hash);
    uint64_t h = raw * 0xBF58476D1CE4E5B9ULL;
    h = (h >> 31) ^ (h * 0x1CE4E5B9ULL);
    return h;
}

The map stores 24-byte keys (module pointer, name pointer, auxiliary pointer) with a sentinel key of (-4096, qword_4FEE4D0, qword_4FEE4D8). Each entry's value (at +24) points to a guarded region tracker structure:

Offset	Type	Description
+472	i32	Work counter
+480	ptr	Block pointer array base
+488	i64	Capacity
+492	i32	Current size
+500	i8	Initialized flag

Phase 4: Split and Rewire CFG (lines 1060-1670)

For each (first_instr, last_instr) pair identified in Phase 2, the pass creates five new basic blocks and rewires the CFG:

function create_guarded_region(first_instr, last_instr, module_ctx):
    parent_bb = first_instr.parent

    // 1. Split into 5 blocks
    guarded_end_bb = split_block(parent_bb, after=last_instr, name="region.guarded.end")
    barrier_bb    = split_block(guarded_end_bb, at_start, name="region.barrier")
    exit_bb       = split_block(barrier_bb, at_start, name="region.exit")
    guarded_bb    = split_block(parent_bb, at=first_instr, name="region.guarded")
    check_tid_bb  = split_block(parent_bb, at=terminator, name="region.check.tid")

    // 2. Register all blocks in worklist
    for bb in [guarded_end_bb, barrier_bb, exit_bb, guarded_bb, check_tid_bb]:
        register_in_worklist(bb)

    // 3. Handle escaping values (shared memory promotion)
    has_broadcast = false
    for inst in guarded_bb:
        outside_uses = [u for u in inst.uses if u.parent != guarded_bb]
        if outside_uses:
            has_broadcast = true

            // Allocate shared memory for output
            alloc = create_alloca(
                type = inst.type,
                address_space = 7,  // shared memory
                name = sanitize(inst.name) + ".guarded.output.alloc"
            )

            // Store result from master thread (inside guarded block)
            create_store(inst, alloc, insert_in=guarded_bb)

            // Load from all threads (after barrier)
            load = create_load(
                type = inst.type,
                ptr = alloc,
                name = sanitize(inst.name) + ".guarded.output.load",
                insert_in = barrier_successor
            )

            // Rewrite all outside uses
            replace_all_uses_outside(inst, load, guarded_bb)

    // 4. Insert thread-id check
    tid = call __kmpc_get_hardware_thread_id_in_block()  // call ID 6
    cmp = icmp eq tid, 0
    br cmp, guarded_bb, barrier_bb

    // 5. Insert SPMD barrier
    call __kmpc_barrier_simple_spmd(ident, tid)  // call ID 187

    // 6. If broadcast values exist, insert second barrier after loads
    if has_broadcast:
        call __kmpc_barrier_simple_spmd(ident, tid)  // ensures loads complete

The resulting CFG for a complex kernel with serial code between two parallel regions:

entry:
    ...

region.check.tid:
    %tid = call i32 @__kmpc_get_hardware_thread_id_in_block()
    %cmp = icmp eq i32 %tid, 0
    br i1 %cmp, label %region.guarded, label %region.barrier

region.guarded:                    ; master thread only
    ... serial code ...
    store %result, %shared_mem     ; broadcast output
    br label %region.guarded.end

region.guarded.end:
    br label %region.barrier

region.barrier:
    call void @__kmpc_barrier_simple_spmd(%ident, %tid)
    %result = load %shared_mem     ; all threads read
    call void @__kmpc_barrier_simple_spmd(%ident, %tid)  ; if broadcast
    br label %region.exit

region.exit:
    ... next parallel region (all threads) ...

Name Sanitization

Output variable names are sanitized for use as global symbol names. Non-alphanumeric, non-underscore characters are replaced with .:

// Identical logic in both cicc and upstream LLVM
char sanitize_char(char c) {
    if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') ||
        (c >= '0' && c <= '9') || c == '_')
        return c;
    return '.';
}

Shared Memory Output Promotion

When a value computed inside a guarded region (master-only code) is needed by all threads after the barrier, the pass promotes it through shared memory. This is the cicc implementation of what upstream LLVM calls "broadcast values." The sequence is:

Allocate: sub_B30000 creates an address-space-7 (shared/local) allocation with suffix .guarded.output.alloc. The allocation node is 80 bytes, subtype 7.
Store: sub_B4D460 emits a store from the master thread's computed value into shared memory. Placed inside the guarded block, before the branch to region.guarded.end.
First barrier: __kmpc_barrier_simple_spmd (runtime call ID 187) ensures the store is globally visible to all threads in the CTA.
Load: sub_B4D230 emits a load from shared memory with suffix .guarded.output.load. Placed in the barrier successor block so all threads read the broadcast value.
Second barrier: If broadcast values exist, a second __kmpc_barrier_simple_spmd call ensures all threads have completed their loads before the shared memory is potentially reused.
Use rewriting: sub_256E5A0 replaces every use of the original value outside the guarded block with the loaded value.

State Machine Elimination

The state machine elimination is the core performance win of the SPMD transformation. Understanding the state machine that gets eliminated -- and its fallback generator -- is essential for reimplementation.

Generic-Mode Worker State Machine (What Gets Eliminated)

In Generic mode, __kmpc_target_init (runtime call ID 155) returns -1 for all threads except thread 0 (the master). The kernel entry code branches on this return value: thread 0 falls through to user code, while threads 1..N-1 jump to the worker state machine loop. This loop is the performance bottleneck that the SPMD transformation eliminates.

The complete Generic-mode kernel structure, as generated by the runtime and optionally customized by sub_2678420:

// Generic mode kernel entry (before SPMD transformation)
void __omp_offloading_kernel(KernelEnvironmentTy *env, KernelLaunchEnvironmentTy *launch_env) {
    int ret = __kmpc_target_init(env, launch_env);  // [155]
    if (ret == -1)
        goto worker_state_machine;

    // === MASTER THREAD (thread 0) ===
    // User code: serial sections + parallel dispatch
    ...
    __kmpc_kernel_prepare_parallel(outlined_fn_ptr);  // [157] signal workers
    __kmpc_barrier_simple_generic(loc, gtid);         // [188] wake workers
    // ... workers execute outlined_fn ...
    __kmpc_barrier_simple_generic(loc, gtid);         // [188] wait for workers
    // ... more serial code ...
    __kmpc_target_deinit();                           // [156]
    return;

worker_state_machine:
    // === WORKER THREADS (threads 1..N-1) ===
    // sub_2678420 generates this structure with these exact labels:
    worker_state_machine.begin:
        __kmpc_barrier_simple_generic(loc, gtid);     // [188] poll barrier
    .is_active.check:
        bool active = __kmpc_kernel_parallel(&fn);    // [171] check for work
        if (!active)
            goto .done.barrier;
    .parallel_region.check:
        if (fn == known_outlined_fn_1)
            goto .parallel_region.execute;
        // ... more checks for known outlined functions ...
        goto .fallback.execute;
    .parallel_region.execute:
        known_outlined_fn_1(args);                    // direct call (devirtualized)
        goto .done.barrier;
    .fallback.execute:
        fn(args);                                     // indirect call (generic)
    .done.barrier:
        __kmpc_kernel_end_parallel();                 // [172] signal completion
        __kmpc_barrier_simple_generic(loc, gtid);     // [188] sync barrier
        goto worker_state_machine.begin;
    .finished:
        return;
}

The state machine consumes five runtime calls per parallel-region invocation per worker thread: two __kmpc_barrier_simple_generic (ID 188) for poll/sync barriers, one __kmpc_kernel_parallel (ID 171) to check for dispatched work, one indirect or direct call to the outlined function, and one __kmpc_kernel_end_parallel (ID 172) to signal completion. Each __kmpc_barrier_simple_generic call compiles to a poll loop on a shared-memory flag -- not a hardware bar.sync -- because the generic barrier must handle the asymmetric wakeup protocol where the master thread signals workers through __kmpc_kernel_prepare_parallel.

Worker State Machine Generator: `sub_2678420` (41 KB)

When the SPMD transformation fails (eligibility flag *(a1+241) == 0), cicc falls back to sub_2678420 which builds a customized state machine that is more efficient than the default runtime state machine. The customization replaces the indirect fn(args) call in .fallback.execute with a direct-call dispatch table when the set of outlined parallel functions is statically known.

Property	Value
Function address	`sub_2678420`
Decompiled size	41 KB
Basic block labels	`worker_state_machine.begin`, `.is_active.check`, `.parallel_region.check`, `.parallel_region.execute`, `.fallback.execute`, `.done.barrier`, `.finished`
Diagnostics	OMP130, OMP131, OMP132, OMP133

The generator has two modes:

Mode 1: Remove unused state machine (OMP130). When the kernel has zero parallel regions (e.g., a #pragma omp target with no nested parallel), the state machine is dead code. sub_2678420 removes the entire worker loop and emits: "Removing unused state machine from generic-mode kernel." (OMP130).

Mode 2: Rewrite with customized dispatch (OMP131). When the kernel has N known parallel regions, the generator builds a switch/cascade of direct-call comparisons in .parallel_region.check and .parallel_region.execute, avoiding the overhead of indirect calls through __kmpc_kernel_parallel's function pointer. It emits: "Rewriting generic-mode kernel with a customized state machine." (OMP131).

// Customized state machine pseudocode (sub_2678420 output)
function build_custom_state_machine(kernel, parallel_regions):
    // Create the 6 basic blocks with labels above
    begin_bb   = create_block("worker_state_machine.begin")
    active_bb  = create_block(".is_active.check")
    check_bb   = create_block(".parallel_region.check")
    exec_bb    = create_block(".parallel_region.execute")
    fallback_bb = create_block(".fallback.execute")
    barrier_bb = create_block(".done.barrier")
    finished_bb = create_block(".finished")

    // Entry: poll barrier
    in begin_bb:
        call __kmpc_barrier_simple_generic(loc, gtid)  // [188]
        br .is_active.check

    // Check if master dispatched work
    in active_bb:
        %active = call i1 @__kmpc_kernel_parallel(&fn)  // [171]
        br %active, .parallel_region.check, .done.barrier

    // Devirtualized dispatch: compare fn pointer against known functions
    in check_bb:
        for i, region in enumerate(parallel_regions):
            %cmp = icmp eq fn, @outlined_fn_i
            br %cmp, .parallel_region.execute.i, next_check
        br .fallback.execute  // no match -- use indirect call

    // Direct call to known function (avoids indirect branch penalty)
    in exec_bb:
        for each matched region:
            call @outlined_fn_i(args)
            br .done.barrier

    // Fallback: indirect call (should be unreachable if analysis is complete)
    in fallback_bb:
        call fn(args)  // indirect
        br .done.barrier

    // End parallel + sync barrier
    in barrier_bb:
        call __kmpc_kernel_end_parallel()  // [172]
        call __kmpc_barrier_simple_generic(loc, gtid)  // [188]
        br .worker_state_machine.begin

    // Optional: exit (reached via __kmpc_target_deinit signaling)
    in finished_bb:
        ret void

The runtime calls consumed by sub_2678420:

Call ID	Function	Role in State Machine
155	`__kmpc_target_init`	Kernel entry; returns -1 for workers
156	`__kmpc_target_deinit`	Kernel exit cleanup
157	`__kmpc_kernel_prepare_parallel`	Master signals workers with outlined fn pointer
171	`__kmpc_kernel_parallel`	Worker checks if work is dispatched; returns fn ptr
172	`__kmpc_kernel_end_parallel`	Worker signals completion of parallel region
188	`__kmpc_barrier_simple_generic`	Poll-based barrier (shared-memory flag loop)

SPMD Amenability Analysis Pipeline

The eligibility flag at *(a1+241) -- which gates whether sub_26968A0 attempts the SPMD transformation -- is computed by the Attributor-based OpenMP optimization driver at sub_269F530 (63 KB). This driver orchestrates interprocedural fixed-point analysis using the standard LLVM Attributor framework.

The analysis pipeline:

sub_269F530 (OpenMP Attributor Driver, 63 KB)
  |
  +-- sub_251BBC0 (AbstractAttribute infrastructure)
  |     Creates abstract attributes for each kernel, including
  |     the SPMD-compatibility tracker that will become a1+241.
  |
  +-- sub_251CD10 (Attributor::runTillFixpoint, 53 KB)
  |     Iterates up to openmp-opt-max-iterations (default: 256)
  |     times, updating abstract attribute states until convergence.
  |
  +-- sub_26747F0 (OpenMP kernel info collector)
        Populates the PassState structure (a1) with:
          a1+72:   function handle
          a1+160:  serial-code-present flag
          a1+224:  multiple-region flag
          a1+241:  SPMD-eligible boolean  <-- the gate
          a1+280:  parallel region array base
          a1+288:  parallel region count
          a1+304:  execution mode attribute map

The fixed-point analysis in sub_251CD10 converges by iterating over all abstract attributes until none change state. For SPMD eligibility, the key attribute tracks three conditions that must all hold:

Execution mode is Generic (mode byte == 1). Read via sub_2674090/sub_2674040 from the kernel's attribute map at *(a1+304). If the kernel is already SPMD or Bare, no transformation is needed.
All reachable callees are SPMD-amenable. The analysis walks every call/invoke/callbr instruction in every parallel region of the kernel. Each callee is looked up in the SPMD-amenable function set at *(a2+208)+34952. This set is populated by two sources:
- Automatic population: When sub_312CF50 (the 194-case runtime declaration factory) creates a runtime function declaration, that function is automatically added to the set if it is known to be thread-safe (most __kmpc_* functions, all omp_* query functions).
- User annotation: Functions declared with [[omp::assume("ompx_spmd_amenable")]] are inserted into the set by the attribute parser.
The set uses the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192); see Hash Table and Collection Infrastructure. If any callee fails the lookup, the analysis sets *(a1+241) = 0 and the transformation will emit OMP121 diagnostics instead.
No unresolvable side effects. Operations that are inherently unsafe when executed by all threads simultaneously -- such as I/O with ordering requirements, thread-local storage accesses assuming single-thread semantics, or calls to external functions with unknown side-effect profiles -- prevent SPMDization.

The Attributor driver at sub_269F530 also feeds into sub_2678420 (state machine generator) for kernels that fail SPMD eligibility, and into sub_2680940 (parallel region merging) for kernels that pass. The decision tree:

sub_269F530 analysis complete
  |
  +-- a1+241 == 1 (SPMD-eligible)
  |     |
  |     +-- a1+160 == 0 && a1+224 == 0 --> sub_26968A0 simple path
  |     +-- otherwise                   --> sub_26968A0 complex path
  |
  +-- a1+241 == 0 (not SPMD-eligible)
        |
        +-- has parallel regions --> sub_2678420 (custom state machine)
        +-- no parallel regions  --> sub_2678420 (remove dead state machine)

How the SPMD Transform Eliminates the State Machine

The actual elimination happens in sub_26968A0 and proceeds differently for simple vs. complex kernels, but the core mechanism is the same: replace the asymmetric master/worker execution model with symmetric all-thread execution.

Step 1: Remove the __kmpc_target_init return-value gate. In Generic mode, __kmpc_target_init returns -1 for workers and the kernel branches workers to the state machine loop. In SPMD mode, the return value is not used as a gate -- all threads fall through to user code. The transformation does not literally delete the __kmpc_target_init call (it is still needed for runtime initialization), but changes the execution mode attribute so the runtime initializes all threads as active.

Step 2: Eliminate the worker loop entirely. The basic blocks worker_state_machine.begin, .is_active.check, .parallel_region.check, .parallel_region.execute, .fallback.execute, .done.barrier, and .finished become dead code once the execution mode flips to SPMD. They are not explicitly deleted by sub_26968A0; instead, setting mode=2 in the KernelEnvironmentTy means the runtime never creates the worker branch, so the dead blocks are eliminated by subsequent DCE passes.

Step 3: Replace barrier primitives. Every __kmpc_barrier_simple_generic (ID 188) in the kernel is replaced with __kmpc_barrier_simple_spmd (ID 187). The difference:

Generic barrier (ID 188): poll-based. Workers spin-check a shared-memory flag. The master writes the flag, then workers read it. This involves memory fences, cache-line bouncing, and potential bank conflicts. Compiles to a ld.volatile.shared + branch loop.
SPMD barrier (ID 187): hardware-based. Maps directly to PTX bar.sync / CUDA __syncthreads(). Single instruction, handled by the warp scheduler with zero polling overhead.

Step 4: Guard serial code. For the simple case (single parallel region), this is just:

%tid = call i32 @__kmpc_get_hardware_thread_id_in_block()  ; [6]
%is_main = icmp eq i32 %tid, 0
br i1 %is_main, label %user_code, label %exit.threads

For the complex case (multiple parallel regions with serial gaps), the 5-block guarded region structure is created for each serial section, with shared-memory output promotion and double-barrier synchronization as described in Phase 4 above.

Step 5: Update execution mode. The kernel attribute is rewritten from Generic (1) to SPMD (2) via the intermediate GENERIC_SPMD (3) marker. This is the final, irreversible step. Once the mode is set, __kmpc_target_init at runtime will launch all threads into user code instead of routing N-1 threads to a state machine.

Performance Impact of Elimination

The state machine elimination saves:

Source of overhead	Generic mode	SPMD mode	Savings
Worker idle polling	N-1 threads spin in `__kmpc_barrier_simple_generic`	No idle threads	100% of idle cycles
Barrier latency	Poll-based shared-memory loop (10s-100s of cycles)	Hardware `bar.sync` (single cycle dispatch)	~10-100x per barrier
Warp divergence at entry	Warp 0 diverges (thread 0 = master, threads 1-31 = workers)	No divergence	1 warp fully utilized
Indirect calls	`__kmpc_kernel_parallel` returns fn ptr for indirect dispatch	No indirect calls -- outlined fn body inlined/direct	Branch predictor pressure eliminated
Register pressure	Workers hold state machine registers while idle	No state machine registers	Improved occupancy
Shared memory	Generic barriers use shared-memory flags	Only guarded-output allocations use shared memory	Reduced shared memory pressure

On a typical #pragma omp target parallel for kernel, the SPMD transformation eliminates 5 runtime calls per parallel-region per worker-thread per iteration of the state machine loop. For a 256-thread CTA with one parallel region, that is 255 threads x 5 calls = 1,275 eliminated runtime calls per kernel invocation.

Execution Mode Update

When the transformation succeeds, the kernel's execution mode attribute is updated from Generic (1) to SPMD (2). The update goes through an intermediate GENERIC_SPMD (3) state:

// At LABEL_227 (shared success path)
bv_entry = sub_ACD640(*(a1+304+8), /*mode=*/3, /*aux=*/0);  // create mode-3 entry
current  = sub_2673FD0(*(a1+304));                           // read current attrs
*(a1+304) = sub_AAAE30(current, bv_entry, {key=0}, 1);      // write SPMD mode

The execution mode encoding matches upstream LLVM's OMPTgtExecModeFlags:

Value	Name	Meaning
0	`OMP_TGT_EXEC_MODE_BARE`	Bare mode (no runtime)
1	`OMP_TGT_EXEC_MODE_GENERIC`	Generic (state machine)
2	`OMP_TGT_EXEC_MODE_SPMD`	SPMD (all threads active)
3	`OMP_TGT_EXEC_MODE_GENERIC_SPMD`	Generic

The mode is stored in the KernelEnvironmentTy global variable that __kmpc_target_init reads at kernel launch. Setting it to SPMD tells the runtime to skip the state machine setup and launch all threads directly into user code.

Limitations: What Prevents SPMDization

The following constructs cause the pass to emit OMP121 and fall back to Generic mode:

Calls to non-SPMD-amenable functions: Any callee not in the SPMD-amenable set blocks transformation. The user override is [[omp::assume("ompx_spmd_amenable")]].
Nested parallelism: Kernels with nested #pragma omp parallel regions inside a target region cannot be SPMDized because the worker threads are already participating.
Tasking constructs: #pragma omp task, taskloop, and taskgroup create runtime-managed work units incompatible with the SPMD execution model.
Critical sections and ordered regions: These constructs require specific thread-identity semantics that conflict with SPMD guards.
Unresolvable side effects: Calls to external functions whose side-effect profile is unknown (no declaration with convergent or spmd_amenable annotations).
Exception handling with unresolvable handlers: Invoke instructions with non-zero exception handler counts that cannot be simplified block the transformation (checked via sub_BD2BC0).

Comparison with Upstream LLVM OpenMPOpt

The cicc SPMD transformation in sub_26968A0 is a proprietary reimplementation that predates upstream LLVM's SPMDization and differs in several significant ways:

Aspect	Upstream LLVM OpenMPOpt	cicc `sub_26968A0`
Framework	Attributor-based (`AAKernelInfo`)	Standalone pass, direct IR mutation
Analysis approach	Fixed-point iteration via `SPMDCompatibilityTracker`	Pre-computed boolean flag at `a1+241`
Guarded regions	`insertInstructionGuardsHelper` using `SplitBlock`	Custom 5-block split with explicit worklist registration
Broadcast mechanism	`GlobalVariable` in shared memory (internal linkage, `UndefValue` init)	`alloca` in address space 7 (shared) via `sub_B30000`
Barrier	`__kmpc_barrier_simple_spmd`	Same: `__kmpc_barrier_simple_spmd` (call ID 187)
Hash tables	LLVM `DenseSet` / `SmallPtrSet`	Custom open-addressing with `-4096` sentinel (details)
Region merging	Separate `openmp-opt-enable-merging` flag (disabled by default)	Integrated into the complex path; always runs when needed
State machine fallback	`buildCustomStateMachine` in same `AAKernelInfo::manifest`	Separate function `sub_2678420` (41 KB)
Diagnostic IDs	OMP120, OMP121 (identical)	OMP120, OMP121 (identical)
`ompx_spmd_amenable` override	Same attribute name	Same attribute name

The key architectural difference is that upstream LLVM uses the Attributor framework's fixed-point iteration to converge on SPMD compatibility, while cicc separates the analysis (which sets a1+241) from the transformation (which is sub_26968A0). This separation allows cicc to make a single pass over the IR for the transformation rather than iterating to a fixpoint, at the cost of less flexibility in handling interdependent kernels.

Upstream's region merging is behind openmp-opt-enable-merging and disabled by default. cicc's complex path (Phase 3a-3d) performs region merging unconditionally when a kernel has multiple parallel regions with serial gaps, suggesting NVIDIA found merging beneficial enough for GPU targets to enable it by default.

Configuration Knobs

All knobs are standard LLVM cl::opt registrations present in the cicc binary. These match upstream LLVM options:

Knob	Type	Default	Effect
`openmp-opt-disable`	bool	false	Disables all OpenMP optimizations
`openmp-opt-disable-spmdization`	bool	false	Disables SPMD transformation specifically
`openmp-opt-disable-deglobalization`	bool	false	Disables device memory deglobalization
`openmp-opt-disable-folding`	bool	false	Disables OpenMP folding optimizations
`openmp-opt-disable-state-machine-rewrite`	bool	false	Disables custom state machine generation
`openmp-opt-disable-barrier-elimination`	bool	false	Disables barrier elimination optimizations
`openmp-opt-disable-internalization`	bool	false	Disables function internalization
`openmp-opt-enable-merging`	bool	false	Enables parallel region merging (upstream default; cicc complex path always merges)
`openmp-opt-inline-device`	bool	false	Inlines all applicable device functions
`openmp-opt-verbose-remarks`	bool	false	Enables more verbose optimization remarks
`openmp-opt-max-iterations`	unsigned	256	Maximum attributor fixpoint iterations
`openmp-opt-shared-limit`	unsigned	UINT_MAX	Maximum shared memory usage for broadcast values
`openmp-opt-print-module-before`	bool	false	Dumps IR before OpenMP optimizations
`openmp-opt-print-module-after`	bool	false	Dumps IR after OpenMP optimizations

Note: The openmp-opt-shared-limit knob controls how much shared memory can be consumed by broadcast value allocations in guarded regions. If the limit is exceeded, the transformation will not proceed for additional guarded outputs. The default of UINT_MAX effectively means no limit.

Diagnostic Strings

Code	Severity	Message	Trigger
OMP120	Remark	"Transformed generic-mode kernel to SPMD-mode."	Successful transformation (both simple and complex paths)
OMP121	Warning	"Value has potential side effects preventing SPMD-mode execution. Add `[[omp::assume(\"ompx_spmd_amenable\")]]` to the called function to override"	Callee not in SPMD-amenable set
OMP130-OMP133	Various	State machine diagnostics	`sub_2678420` (fallback, not this pass)
OMP150	Remark	Parallel region merging	`sub_2697xxx` (separate merging diagnostics)

Diagnostics are emitted only when a handler is registered at *(a2+4392) and the handler's isEnabled virtual method (vtable offset +48) returns true. The construction follows the pattern: sub_B174A0 (remark) or sub_B178C0 (warning) builds a DiagnosticInfo, sub_B18290 appends the message text, and sub_1049740 emits to the handler.

Runtime Call Dependencies

The transformation uses these runtime functions from the OpenMP runtime declaration table:

Call ID	Function	Signature	Usage
6	`__kmpc_get_hardware_thread_id_in_block`	`i32()`	Thread identification for `tid == 0` guards
180	`__kmpc_alloc_shared`	`i8*(i64)`	Allocate shared memory for guarded output promotion (complex path)
181	`__kmpc_free_shared`	`void(i8*, i64)`	Free shared memory allocations at kernel exit (complex path)
187	`__kmpc_barrier_simple_spmd`	`void(ident_t*, i32)`	Lightweight SPMD barrier (maps to PTX `bar.sync`)

The state machine fallback (sub_2678420) uses a different set of runtime calls, all of which become dead code after successful SPMD transformation:

Call ID	Function	Signature	Eliminated by SPMD
155	`__kmpc_target_init`	`i32(KernelEnvironmentTy, KernelLaunchEnvironmentTy)`	Return value no longer gates workers
156	`__kmpc_target_deinit`	`void()`	Retained (still needed for cleanup)
157	`__kmpc_kernel_prepare_parallel`	`void(i8*)`	Eliminated -- no worker dispatch needed
171	`__kmpc_kernel_parallel`	`i1(i8**)`	Eliminated -- no worker polling loop
172	`__kmpc_kernel_end_parallel`	`void()`	Eliminated -- no worker completion signal
188	`__kmpc_barrier_simple_generic`	`void(ident_t*, i32)`	Replaced with ID 187 (SPMD barrier)

Additionally, the SPMD-amenable function set at *(a2+208)+34952 is populated by the runtime table builder (sub_312CF50) during module initialization. Functions declared via sub_312CF50 cases 0-193 are automatically considered, along with user-annotated functions.

Function Map

Function	Address	Size	Role
Generic-to-SPMD transformation pass (this function, 61 KB)	`sub_26968A0`	--	--
Worker state machine generation (Generic fallback, 41 KB)	`sub_2678420`	--	--
Attributor-based OpenMP optimization driver (63 KB, sets `a1+241`)	`sub_269F530`	--	--
Parallel region merging (52 KB)	`sub_2680940`	--	--
AbstractAttribute infrastructure (Attributor framework)	`sub_251BBC0`	--	--
Attributor::runTillFixpoint (53 KB, fixed-point iteration engine)	`sub_251CD10`	--	--
OpenMP kernel info collector (populates PassState)	`sub_26747F0`	--	--
Attributor Module Pass entry point (51 KB)	`sub_2591C20`	--	--
Read execution mode from attribute map	`sub_2674090`	--	--
Read execution mode (alternate entry)	`sub_2674040`	--	--
Get parallel region thread configuration	`sub_250CBE0`	--	--
Read attribute from kernel attribute map	`sub_2673FD0`	--	--
Create secondary barrier call	`sub_2673A60`	--	--
OpenMP runtime call table lookup by ID (194-case switch, 117 KB)	`sub_312CF50`	--	--
registerRuntimeFunction (registers declaration in table)	`sub_3122A50`	--	--
Parallel region outliner (47 KB, creates `.omp_par` functions)	`sub_313D1B0`	--	--
Get function entry basic block	`sub_25096F0`	--	--
Get function scope / debug info	`sub_BD5C60`	--	--
Build CFG region (start/end blocks)	`sub_AA8550`	--	--
Build exit/cleanup block	`sub_AA4D50`	--	--
Split basic block	`sub_F36960`	--	--
Allocate IR instruction node	`sub_BD2C40`	--	--
Fill instruction as runtime-call value load	`sub_B4A410`	--	--
Create integer constant (zero for tid check)	`sub_AD64C0`	--	--
Create integer constant (alternate entry, used in complex path)	`sub_AD6530`	--	--
Create icmp instruction	`sub_B52500`	--	--
Create branch instruction (opcode 3)	`sub_B4C9A0`	--	--
Create shared-memory alloca (addr space 7)	`sub_B30000`	--	--
Create store instruction	`sub_B4D460`	--	--
Create load instruction	`sub_B4D230`	--	--
Replace all uses of a value	`sub_256E5A0`	--	--
Create runtime library call instruction	`sub_921880`	--	--
Create bit-vector entry	`sub_ACD640`	--	--
Insert into attribute map	`sub_AAAE30`	--	--
Register block in pass manager worklist	`sub_D695C0`	--	--
Construct remark DiagnosticInfo	`sub_B174A0`	--	--
Construct warning DiagnosticInfo	`sub_B178C0`	--	--
Append string to diagnostic message	`sub_B18290`	--	--
Emit diagnostic to handler	`sub_1049740`	--	--
Check if instruction is a call	`sub_B46970`	--	--
Check if instruction is an invoke	`sub_B46420`	--	--
Get invoke exception handler count	`sub_BD2BC0`	--	--
Insert guard instructions at range boundary	`sub_B444E0`	--	--
Fast-path comparison instruction creation	`sub_AAB310`	--	--
Full comparison instruction creation	`sub_B523C0`	--	--
Build name from debug info + suffix	`sub_CA0F50`	--	--
Ref-count increment on metadata/debug-info	`sub_B96E90`	--	--
Ref-count decrement on metadata/debug-info	`sub_B91220`	--	--
Transfer metadata ownership between blocks	`sub_B976B0`	--	--
Get terminator's successor block pointer	`sub_986580`	--	--
Add operand bundle to instruction	`sub_B99FD0`	--	--
Duplicate metadata reference	`sub_266EF50`	--	--
Process entry block terminator successor	`sub_B491C0`	--	--
Get instruction value type	`sub_ACA8A0`	--	--
Get IR node name	`sub_BD5D20`	--	--
Vector push_back (dynamic arrays)	`sub_C8CC70`	--	--
Vector reserve/grow	`sub_C8D5F0`	--	--

Cross-References

OpenMP Runtime Declaration Table -- complete runtime function table (sub_312CF50), including __kmpc_barrier_simple_spmd (ID 187) and __kmpc_get_hardware_thread_id_in_block (ID 6)
Entry Point & CLI -- how OpenMP target offloading flags reach the optimizer
LLVM Optimizer -- pipeline slots 75/76/154 where openmp-opt runs
CLI Flags -- openmp-opt-* knob documentation

LTO & Module Optimization

CICC v13.0 implements Link-Time Optimization as a five-pass pipeline that exploits the GPU's closed-world compilation model for optimization opportunities unavailable to CPU compilers. In CPU LTO, the linker merges partially-optimized object files and runs a second round of optimization on the combined module. The fundamental constraint is that shared libraries, dynamic loading, and symbol interposition limit what the optimizer can assume about the complete program. On GPU, none of these constraints exist. Every __device__ function that can execute on the hardware must be statically visible at compile time -- there is no device-side dlopen, no .so files, no PLT/GOT, no symbol preemption. This closed-world guarantee means the LTO pipeline can inline aggressively across translation units, devirtualize every virtual call site against a complete class hierarchy, and promote or split global variables with full knowledge that no external observer will access the original symbols.

The LTO pipeline runs after the main LLVM optimizer (tier 0-3 passes) has performed per-module optimization. It is triggered when cicc processes bitcode from separate compilation (nvcc --device-c / -dc mode), where each .cu file compiles to a relocatable device object containing LLVM bitcode in the NVVM container. The device linker (nvlink) merges these objects and reinvokes cicc in LTO mode, passing the combined bitcode through the LTO pipeline before final PTX emission. In whole-program compilation (the default), the pipeline is still partially active -- GlobalOpt and the inliner run regardless, but the summary-based import machinery is skipped because there is only one module.


LTO pipeline entry	`sub_12F5F30` (`0x12F5F30`, 37.8 KB)
NVModuleSummary driver	`sub_D81040` (`0xD81040`, 56 KB)
Summary builder	`sub_D7D4E0` (`0xD7D4E0`, 74 KB)
Address range (summary cluster)	`0xD60000`--`0xD82000`
Address range (import/inline cluster)	`0x1850000`--`0x186CA00`
NVVM container IRLevel for LTO	`NVVM_IR_LEVEL_LTO` (value 1)
Compile mode for separate compilation	`NVVM_COMPILE_MODE_SEPARATE_ABI` (value 2)
Module flags read	`EnableSplitLTOUnit`, `UnifiedLTO`, `ThinLTO`

Why LTO Matters for GPU

Three properties of GPU execution make LTO dramatically more valuable than on CPU:

Function calls are expensive. Every GPU function call marshals arguments through the .param calling convention via st.param / ld.param instruction sequences. A function with 8 struct arguments can generate hundreds of cycles of marshaling overhead that inlining eliminates entirely. Cross-module inlining -- which requires LTO -- is the primary mechanism for removing this cost for functions defined in separate translation units. See the inliner cost model for the full cost analysis.

Register pressure determines performance. Occupancy is bounded by per-thread register usage, with discrete cliff boundaries. Call boundaries force the backend to save and restore registers across the call site, often spilling to local memory (device DRAM, 200-800 cycle latency). LTO enables cross-module inlining, which in turn enables cross-function register allocation -- the single most impactful optimization for GPU code.

Indirect calls are catastrophic. An indirect call in PTX (call.uni through a register) prevents backend inlining, forces full register spills, destroys instruction scheduling freedom, and creates warp-divergence hazards. Whole-program devirtualization, which requires LTO-level visibility of the complete type hierarchy, converts indirect calls to direct calls and enables all downstream optimizations.

Regular LTO vs ThinLTO

CICC supports both regular (monolithic) LTO and ThinLTO. The LTO driver at sub_D81040 reads three module flags via sub_BA91D0 to determine which mode is active:

Module Flag	Effect
`EnableSplitLTOUnit`	Enables the split LTO unit mechanism for type metadata
`UnifiedLTO`	Enables LLVM's unified LTO pipeline (combined thin+regular)
`ThinLTO`	Activates summary-based import and the two-phase declaration merge in `sub_D7D4E0`

Regular LTO merges all translation units into a single LLVM module, then runs the full optimization pipeline on the merged result. This gives the optimizer complete visibility but has O(n) memory cost in the total program size and serializes compilation. For GPU programs this is often acceptable because device code is typically smaller than host code.

ThinLTO builds per-module summaries (via NVModuleSummary), uses the summaries to make import decisions without loading full bitcode, then imports selected functions and optimizes each module independently. The builder's a8 parameter (thinlto_mode flag) activates Phase 2 of the summary builder, which performs a second walk over declarations to merge forward-declared and defined symbol tables. This mode enables parallel per-module optimization at the cost of less global visibility.

In practice, NVIDIA's toolchain (nvcc + nvlink) uses regular LTO as the default for device code, because the closed-world model and relatively small code size (compared to CPU programs) make the memory and compile-time cost acceptable. ThinLTO is available for large CUDA programs where compile time is a concern, activated by passing -dlto to nvcc (device LTO) or -flto=thin through the driver.

LTO Pipeline

The LTO pipeline executes five major passes in a fixed order. Each pass consumes the output of its predecessor:

 ┌────────────────────────────────────────────────────────────────────────┐
 │                    NVVM Container (IRLevel=1)                         │
 │                    LLVM Bitcode + Module Flags                        │
 └────────────────────┬───────────────────────────────────────────────────┘
                      │
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  1. NVModuleSummary Builder  (sub_D7D4E0, 74 KB)              │
 │     Build per-function summaries with 4-level import priority, │
 │     complexity budget, CUDA attribute flags, call graph edges  │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  ModuleSummaryIndex
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  2. ThinLTO Function Import  (sub_1854A20, 4.3 KB)            │
 │     Summary-guided cross-module import with floating-point     │
 │     threshold computation, priority-class multipliers,         │
 │     global import budget cap                                   │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  Materialized functions + thinlto_src_module metadata
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  3. Inliner  (sub_1864060 + sub_2613930 + sub_38576C0)        │
 │     Four parallel cost models: NVIDIA custom (20K budget),     │
 │     LLVM standard (225), New PM CGSCC + ML, NVPTX target      │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  Inlined module
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  4. GlobalOpt  (sub_18612A0, 65 KB)                            │
 │     Small-constant promotion (≤2047 bits), SRA for structs     │
 │     (≤16 fields), malloc/free elimination, address-space-aware │
 └────────────────────┬──────────────────────────────────────────-┘
                      │  Optimized globals
                      ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │  5. WholeProgramDevirtualization  (sub_2703170, 13 KB)         │
 │     Type-test metadata → vtable resolution → direct calls      │
 │     Red-black tree for type info lookup, 0x90-byte records     │
 └────────────────────┬──────────────────────────────────────────-┘
                      │
                      ▼
              Dead Kernel Elimination + GlobalDCE
              → Standard optimizer pipeline (tier 0-3)
              → Code generation + PTX emission

The LTO pipeline entry at sub_12F5F30 (37.8 KB) orchestrates this sequence and also runs dead kernel elimination -- removing __global__ functions that are never referenced by host-side kernel launches. This is a GPU-specific optimization: on CPU, the linker preserves all externally-visible entry points, but in GPU LTO the compiler knows the complete set of kernel launch sites from the host code.

LTO Pipeline Entry -- `sub_12F5F30` Algorithm

sub_12F5F30 (0x12F5F30, 37,797 bytes) is the top-level LTO orchestrator. It is called after the CLI parser (sub_12F7D90) has resolved the compilation mode bitmask and the LTO argument vector has been populated from the -Xlto forwarding meta-flag. The function operates in three distinct modes determined by the mode bitmask in a13:

Mode	Bitmask	CLI Flag	Behavior
gen-lto	`0x21`	`-gen-lto`	Emit partially-optimized bitcode for later linking. No dead-kernel pass.
full LTO	`0x23`	`-lto`	Full merge + optimize + dead-kernel elimination + emit PTX.
link-lto	`0x26`	`-link-lto`	Link pre-existing LTO bitcode modules, run full pipeline.

The function's argument list is reconstructed from the LTO output vector v330 (the fourth CLI routing vector, populated by -Xlto and the six -host-ref-* flags). It receives the merged LLVM module, the host reference tables, and the compilation options struct.

Pseudocode: `sub_12F5F30` Top-Level

function sub_12F5F30(module, lto_args, options, error_cb):
    # ---- Phase A: Parse LTO-specific arguments ----
    mode = NONE
    trace_enabled = false
    optimize_unused_vars = false
    host_refs = HostRefTable{}      # 6-field table: ek, ik, ec, ic, eg, ig
    force_device_c = false

    for arg in lto_args:
        switch arg:
            case "-gen-lto":       mode = GEN_LTO
            case "-link-lto":      mode = LINK_LTO
            case "-olto":          lto_opt_level = next_arg()
            case "--device-c":     device_c = true
            case "--force-device-c": force_device_c = true
            case "--trace":        trace_enabled = true
            case "-optimize-unused-variables": optimize_unused_vars = true
            case "-has-global-host-info":      has_host_info = true
            case "-host-ref-ek=*": host_refs.ek = parse_symbol_list(value)
            case "-host-ref-ik=*": host_refs.ik = parse_symbol_list(value)
            case "-host-ref-ec=*": host_refs.ec = parse_symbol_list(value)
            case "-host-ref-ic=*": host_refs.ic = parse_symbol_list(value)
            case "-host-ref-eg=*": host_refs.eg = parse_symbol_list(value)
            case "-host-ref-ig=*": host_refs.ig = parse_symbol_list(value)

    # ---- Phase B: Build preserved-symbol sets ----
    # Collect symbols from llvm.used and llvm.metadata named metadata
    used_set = collect_named_metadata(module, "llvm.used")
    metadata_set = collect_named_metadata(module, "llvm.metadata")

    # Merge host reference tables into a unified "referenced from host" set.
    # The 6 host-ref flags encode three entity types x two reference modes:
    #   e = explicit reference (symbol name appears in host launch site)
    #   i = implicit reference (symbol address taken on host side)
    #   k = kernel (__global__),  c = constant (__constant__),  g = global (__device__)
    host_referenced_kernels  = host_refs.ek  UNION  host_refs.ik
    host_referenced_constants = host_refs.ec  UNION  host_refs.ic
    host_referenced_globals   = host_refs.eg  UNION  host_refs.ig

    # ---- Phase C: Decide what to preserve ----
    preserved = used_set  UNION  metadata_set  UNION  host_referenced_kernels

    if NOT optimize_unused_vars:
        preserved = preserved  UNION  host_referenced_constants
                               UNION  host_referenced_globals

    # ---- Phase D: Dead kernel/variable elimination ----
    if mode == GEN_LTO:
        # gen-lto: emit bitcode only, skip elimination
        return emit_lto_bitcode(module)

    if has_host_info:
        dead_kernel_elimination(module, preserved, trace_enabled)

        if optimize_unused_vars:
            dead_variable_elimination(module, preserved,
                                     host_referenced_constants,
                                     host_referenced_globals,
                                     trace_enabled)

    # ---- Phase E: Run the 5-pass LTO pipeline ----
    if mode == LINK_LTO or mode == FULL_LTO:
        run_module_summary_builder(module)      # sub_D7D4E0 via sub_D81040
        run_thinlto_import(module)              # sub_1854A20  (if ThinLTO)
        run_inliner(module)                     # sub_1864060 + sub_2613930
        run_globalopt(module)                   # sub_18612A0
        run_whole_program_devirt(module)        # sub_2703170
        run_global_dce(module)                  # final GlobalDCE sweep

    # ---- Phase F: Hand off to optimizer pipeline ----
    return module    # returned to sub_12E7E70 for tier 0-3 passes

Host Reference Flag Encoding

The six -host-ref-* flags are the mechanism by which nvlink communicates host-side symbol usage to cicc's LTO pass. nvlink inspects the host-side relocatable objects and emits a semicolon-separated list of device symbol names for each flag. The two-letter suffix encodes:

Suffix	Entity Type	Reference Kind
`-host-ref-ek`	`__global__` kernel	Explicit (launch site in host code)
`-host-ref-ik`	`__global__` kernel	Implicit (address taken, e.g. `&myKernel`)
`-host-ref-ec`	`__constant__` variable	Explicit (`cudaMemcpyToSymbol` target)
`-host-ref-ic`	`__constant__` variable	Implicit (address taken)
`-host-ref-eg`	`__device__` global variable	Explicit (`cudaMemcpyToSymbol` target)
`-host-ref-ig`	`__device__` global variable	Implicit (address taken)

The -has-global-host-info flag signals that nvlink has provided complete host reference information. When this flag is absent, sub_12F5F30 conservatively preserves all externally-visible symbols -- the dead kernel/variable elimination pass is skipped entirely.

Function Map

Function	Address	Size	Role
`sub_12F5F30`	`0x12F5F30`	37.8 KB	LTO pipeline entry and dead-symbol orchestrator
`sub_12F5610`	`0x12F5610`	7.3 KB	LLVM module linker wrapper (`Linker::linkModules`)
`sub_12F7D90`	`0x12F7D90`	14.3 KB	CLI argument parser (architecture, opt level, flags)
`sub_12F4060`	`0x12F4060`	15.7 KB	TargetMachine creation with NVIDIA options
`sub_1C13840`	`0x1C13840`	--	Global/function iterator used for dead-code sweep
`sub_12F1650`	`0x12F1650`	5.2 KB	Bitcode reader variant A
`sub_12F11C0`	`0x12F11C0`	5.2 KB	Bitcode reader variant B

Dead Kernel Elimination Algorithm

Dead kernel elimination is the most impactful GPU-specific optimization in the LTO pipeline. It exploits the closed-world model: every __global__ function that will ever execute must have a corresponding <<<>>> launch site (or cudaLaunchKernel call) in the host code that nvlink has already seen. Any kernel not in the host reference set is dead.

This pass cannot exist on CPU. A CPU linker must preserve all non-hidden external functions because shared libraries loaded at runtime via dlopen could call them. On GPU there is no dlopen, no dynamic symbol resolution, no PLT. The set of reachable kernels is completely determined at link time.

Pseudocode: `dead_kernel_elimination`

function dead_kernel_elimination(module, preserved_set, trace):
    # Walk all functions in the module via sub_1C13840 iterator
    worklist = []

    for func in module.functions():
        if func.isDeclaration():
            continue

        cc = func.getCallingConv()

        # PTX calling convention 71 = __global__ (kernel entry point)
        # PTX calling convention 72 = __device__ (device function)
        # PTX calling convention 95 = CUDA internal (managed init)
        if cc != 71:
            continue    # only eliminate kernels, not device functions

        name = func.getName()

        if name in preserved_set:
            continue    # referenced from host, or in llvm.used -- keep it

        # This kernel has no host launch site.
        if trace:
            emit_diagnostic("no reference to kernel " + name)

        worklist.append(func)

    # ---- Remove dead kernels ----
    for func in worklist:
        # Before erasing, check if any device-side indirect references exist.
        # On GPU, device-side function pointers (callback patterns) can reference
        # kernels via address-of. Check use_empty():
        if NOT func.use_empty():
            # Has device-side users -- cannot safely remove.
            # (This is rare: kernels are almost never called from device code.)
            continue

        func.replaceAllUsesWith(UndefValue)
        func.eraseFromParent()

    return len(worklist)

Pseudocode: `dead_variable_elimination`

When -optimize-unused-variables is enabled, the same logic extends to __device__ and __constant__ global variables:

function dead_variable_elimination(module, preserved_set,
                                   host_constants, host_globals, trace):
    worklist = []

    for gv in module.globals():
        if gv.isDeclaration():
            continue

        name = gv.getName()

        if name in preserved_set:
            continue

        as = gv.getAddressSpace()

        # Address space 1 = global, address space 4 = constant
        if as == 4 and name NOT in host_constants:
            if trace:
                emit_diagnostic("no reference to variable " + name)
            worklist.append(gv)
        elif as == 1 and name NOT in host_globals:
            if trace:
                emit_diagnostic("no reference to variable " + name)
            worklist.append(gv)

    for gv in worklist:
        if NOT gv.use_empty():
            continue    # still referenced from device code
        gv.eraseFromParent()

    return len(worklist)

The --trace-lto CLI flag (which maps to --trace in the LTO argument vector via the flag catalog at line 2394) enables the diagnostic messages. When active, cicc prints one line per eliminated symbol to stderr, enabling build-system integration and debugging of unexpected kernel removal.

Module Merge Process

Before sub_12F5F30 can perform dead-kernel elimination or any LTO optimization, the separate-compilation bitcode modules must be merged into a single LLVM module. This merge happens in two layers: the NVIDIA module linker wrapper sub_12F5610 (7.3 KB) and the underlying LLVM IRLinker at sub_16786A0 (61 KB).

Two-Level Linking Architecture

nvlink extracts .nv_fatbin bitcode sections
         |
         v
┌─────────────────────────────────────────────────────────────┐
│  NVIDIA Module Loader  (sub_12C06E0, 63 KB)                │
│  - Validates LLVM bitcode magic (0xDEC0170B or 0x4243C0DE) │
│  - Checks IR version via sub_12BFF60                        │
│  - Validates target triple (must be "nvptx64-*")            │
│  - Single-module fast path: return directly if N=1          │
│  - Multi-module: normalize triples, set matching DataLayout │
└─────────────────────┬───────────────────────────────────────┘
                      │  N validated modules
                      v
┌─────────────────────────────────────────────────────────────┐
│  NVIDIA Module Linker Wrapper  (sub_12F5610, 7.3 KB)       │
│  - Selects primary module (typically the largest)           │
│  - For each secondary module:                               │
│      Copy triple from primary → secondary                   │
│      Call IRLinker to merge secondary into primary           │
│  - Post-link: restore linkage attributes from hash table    │
│      Values 7-8: external linkage (low 6 bits)              │
│      Other: set low 4 bits + visibility from bits 4-5       │
│      Set dso_local flag (byte+33 |= 0x40)                  │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      v
┌─────────────────────────────────────────────────────────────┐
│  LLVM IRLinker::run  (sub_16786A0, 61 KB)                   │
│  - Allocates 0x2000-byte DenseMap for symbol resolution     │
│  - Hash function: (addr >> 9) ^ (addr >> 4)                 │
│  - Resolves COMDAT groups (sub_167DAB0, 39 KB)              │
│  - Links global value prototypes (sub_1675980, 37 KB)       │
│  - Links function bodies (sub_143B970, 14 KB)               │
│  - Merges named metadata (llvm.dbg.cu, llvm.used, etc.)     │
│  - Resolves llvm.global_ctors / llvm.global_dtors ordering  │
│  - Maps values across modules via DenseMap<Value*, Value*>  │
│  - Tombstone sentinels: empty=-8, deleted=-16               │
└─────────────────────┬───────────────────────────────────────┘
                      │  single merged module
                      v
              sub_12F5F30 (LTO pipeline entry)

Pseudocode: Module Merge (`sub_12F5610` + `sub_12C06E0`)

function module_merge(module_list, llvm_ctx, options):
    # ---- Step 1: Load and validate all modules (sub_12C06E0) ----
    modules = []
    for entry in module_list:
        buf = open_buffer(entry.data, entry.length, entry.name)

        # Validate bitcode magic
        magic = read_u32(buf, 0)
        if magic != 0x0B17C0DE and magic != 0xDEC04342:
            error("invalid bitcode: " + entry.name)
            return NULL

        module = parse_bitcode(buf, llvm_ctx)  # sub_15099C0

        # Check IR version compatibility (sub_12BFF60)
        if ir_version_check(module_list, module, flags) != 0:
            error(entry.name + ": error: incompatible IR detected. "
                  "Possible mix of compiler/IR from different releases.")
            return NULL

        # Validate target triple
        triple = module.getTargetTriple()
        if NOT triple.startswith("nvptx64-"):
            error("Module does not contain a triple, "
                  "should be 'nvptx64-'")
            return NULL

        modules.append(module)

    # ---- Step 2: Single-module fast path ----
    if len(modules) == 1:
        return modules[0]

    # ---- Step 3: Multi-module linking (sub_12F5610) ----
    # Save linkage attributes before linking (they get modified)
    linkage_map = DenseMap<StringRef, u8>{}
    for module in modules:
        for func in module.functions():
            linkage_map[func.getName()] = func.getLinkage()
        for gv in module.globals():
            linkage_map[gv.getName()] = gv.getLinkage()

    # Select primary module and link secondaries into it
    primary = modules[0]
    for i in range(1, len(modules)):
        secondary = modules[i]

        # Normalize: copy DataLayout from primary to secondary
        secondary.setDataLayout(primary.getDataLayout())
        secondary.setTargetTriple(primary.getTargetTriple())

        # IRLinker::run (sub_16786A0)
        # Resolves COMDATs, links globals, maps values, merges metadata
        err = Linker::linkModules(primary, secondary)
        if err:
            error("<module_name>: link error: <details>")
            return NULL

    # ---- Step 4: Restore linkage attributes ----
    # During linking, LLVM may promote linkage (e.g., internal -> external)
    # to resolve cross-module references. Restore the original linkage
    # where possible, preserving the correct visibility for PTX emission.
    for func in primary.functions():
        name = func.getName()
        if name in linkage_map:
            original = linkage_map[name]
            if original in [7, 8]:       # external linkage variants
                func.setLinkage(original & 0x3F)
            else:
                func.setLinkage(original & 0x0F)
                if (original & 0x30) != 0:
                    func.setVisibility(original >> 4)
            func.setDSOLocal(true)       # byte+33 |= 0x40

    for gv in primary.globals():
        # same linkage restoration logic
        ...

    return primary

Key Data Structures in the Merge

Structure	Location	Details
Value map DenseMap	Allocated in `sub_16786A0`	0x2000 bytes (8192), hash: `(addr >> 9) ^ (addr >> 4)`, quadratic probing
Linkage hash table	Stack-allocated in `sub_12E1EF0` (v362)	Maps `StringRef` name to original linkage byte
Function-to-module map	Stack-allocated in `sub_12E1EF0` (v359)	Maps `StringRef` name to function pointer for split-module dispatch
COMDAT group map	Internal to `sub_167DAB0`	Tracks COMDAT selection kinds: any / exact-match / largest / no-dup / same-size
Named metadata merge list	Internal to `sub_1671B40`	Special handling for `llvm.dbg.cu`, `llvm.used`, `llvm.compiler.used`, `llvm.global_ctors`, `llvm.global_dtors`, `llvm.global.annotations`
Module config flag	`dword_4F99BC0`	Controls linker behavior variant

Split-Module Compilation and Re-Linking

When concurrent compilation is active (thread count > 1 and multiple defined functions), the optimization pipeline uses a split-module strategy: each function is extracted into its own bitcode module, optimized independently in a thread pool, and then re-linked. The split/re-link cycle uses the same sub_12F5610 linker wrapper:

Split (sub_1AB9F40): extracts per-function bitcode using a filter callback (sub_12D4BD0) that selects a single function by name from the function-to-module hash table.
Optimize (thread pool via sub_16D5230): each worker runs sub_12E86C0 (Phase II optimizer) with qword_4FBB3B0 = 2.
Re-link (sub_12F5610): merges all per-function bitcode modules back into a single module.
Restore linkage (v362 hash table): the saved linkage attributes from step 0 are written back to prevent linkage promotion artifacts.

This cycle is orchestrated by sub_12E1EF0 (51 KB, the top-level concurrent compilation entry). The GNU Jobserver integration (sub_16832F0) throttles thread pool size to match the build system's -j level when cicc is invoked from make.

Separate Compilation and the NVVM Container

When nvcc --device-c compiles a .cu file, cicc produces an NVVM container with CompileMode = NVVM_COMPILE_MODE_SEPARATE_ABI (value 2) and IRLevel = NVVM_IR_LEVEL_LTO (value 1). This container wraps partially-optimized LLVM bitcode -- the per-module optimizer has run, but cross-module optimization has not. The bitcode is embedded in the ELF .nv_fatbin section of the relocatable object file.

At link time, nvlink extracts the bitcode sections from all input objects, concatenates them, and passes the result back to cicc in LTO mode. cicc deserializes each container, links the bitcode modules via LLVM's Linker::linkModules, and then runs the LTO pipeline described above on the merged module. The pipeline sees the complete device program for the first time at this point.

The IRLevel enum controls which optimizations have already been applied:

IRLevel	Value	Meaning
`NVVM_IR_LEVEL_UNIFIED_AFTER_DCI`	0	Default: fully optimized, no LTO needed
`NVVM_IR_LEVEL_LTO`	1	Partially optimized, awaiting LTO pipeline
`NVVM_IR_LEVEL_OPTIX`	2	OptiX pipeline IR (separate optimization model)

Pass Inventory

Pass	Entry Point	Size	Pipeline Slot	Type	Sub-page
NVModuleSummary Builder	`sub_D7D4E0`	74 KB	N/A (called from driver)	Analysis	module-summary.md
NVModuleSummary Driver	`sub_D81040`	56 KB	N/A (LTO entry)	Module	module-summary.md
ThinLTO Function Import	`sub_1854A20`	4.3 KB	Slot 43 (`"function-import"`)	Module	thinlto-import.md
ThinLTO Threshold Engine	`sub_1853180`	5.1 KB	N/A (called from import driver)	Utility	thinlto-import.md
NVIDIA Custom Inliner	`sub_1864060`	75 KB	CGSCC pass	CGSCC	inliner-cost.md
LLVM Standard InlineCost	`sub_30DC7E0`	51 KB	N/A (library)	Analysis	inliner-cost.md
New PM CGSCC Inliner	`sub_2613930`	69 KB	CGSCC pass	CGSCC	inliner-cost.md
NVPTX Target Cost Modifier	`sub_38576C0`	58 KB	N/A (target hook)	Target	inliner-cost.md
GlobalOpt	`sub_18612A0`	65 KB	Slot 45 (`"globalopt"`)	Module	globalopt.md
WholeProgramDevirt	`sub_2703170`	13 KB	Slot 121 (`"wholeprogramdevirt"`)	Module	devirtualization.md

Key Differences from CPU LTO

Aspect	CPU LTO	CICC GPU LTO
Import threshold	100 instructions (default)	Priority-class multipliers, global budget at `dword_4FAB120`
Cold import	0x multiplier (never import cold)	Imports cold functions if priority >= 2
Inline budget	225 (LLVM default)	20,000 (NVIDIA custom), 89x larger
Devirt conservatism	Must handle DSOs, hidden visibility	Full type hierarchy always visible
Code size concern	Bloats `.text`, impacts cache/pages	No shared libs; size is secondary to register pressure
Address spaces	Trivial (flat memory model)	5+ address spaces; GlobalOpt must preserve AS through splits
Dead symbol elimination	Linker GC sections	Dead kernel elimination in `sub_12F5F30`
Threshold comparison	Integer instruction count	Floating-point threshold with hotness/linkage/priority multipliers
ML-guided inlining	Available upstream	Integrated via InlineAdvisor at `sub_2609820` with model at `sub_29B2CD0`

LTO Knob Summary

NVModuleSummary Knobs

Knob	Default	Effect
`dword_4F87C60` (global override)	0	When nonzero, forces all symbols to importable; value 2 = conservative comdat handling

ThinLTO Import Knobs

Registered in ctor_184_0 (0x4DA920) and ctor_029 (0x489C80):

Knob	Type	Default	Effect
`import-instr-limit`	int	100	Base instruction count threshold for import
`import-hot-multiplier`	float	10.0	Multiplier applied to threshold for hot callsites
`import-cold-multiplier`	float	0.0	Multiplier for cold callsites (0 = never import cold on CPU)
`dword_4FAB120`	int	-1	Global import budget; negative = unlimited
`dword_4FAA770`	int	0	Current import count (runtime accumulator)
`summary-file`	string	--	Path to external summary file for ThinLTO
`function-import`	--	--	Pipeline registration string (slot 43)
`disable-thinlto-funcattrs`	bool	false	Disable ThinLTO function attribute propagation
`thinlto-workload-def`	string	--	Workload definition file for priority-guided import

Inliner Knobs

Registered in ctor_186_0 (0x4DBEC0):

Knob	Type	Default	Effect
`inline-budget`	int	20,000	Per-caller inlining cost budget (NVIDIA custom model)
`inline-total-budget`	int	--	Global total budget across all callers
`inline-adj-budget1`	int	--	Adjusted per-caller budget (secondary)
`nv-inline-all`	bool	off	Force inline every function call
`profuseinline`	bool	off	Verbose inlining diagnostic output
`inline-switchctrl`	int	--	Heuristic tuning for switch statements
`inline-threshold`	int	225	LLVM standard model threshold (separate from NVIDIA's 20K)
`function-inline-cost-multiplier`	float	--	New PM: penalty multiplier for recursive functions

GlobalOpt Knobs

No dedicated cl::opt flags. All thresholds are hardcoded:

Parameter	Value	Description
Max bits for promotion	2,047 (`0x7FF`)	Globals exceeding this fall through to SRA
Max struct fields for SRA	16	Structs with >16 fields are not split
Hash table load factor	75%	Triggers rehash of processed-globals table
Pipeline position	Step 30 (tier 2/3)	After GlobalDCE, before LoopVectorize

Devirtualization Knobs

Knob	Type	Default	Effect
`wholeprogramdevirt`	--	--	Pipeline registration string (slot 121)

The pass has no NVIDIA-specific tuning knobs. It relies entirely on the completeness of type_test metadata produced by the NVModuleSummary builder.

Cross-References

NVModuleSummary Builder -- 4-level import priority, complexity budget, CUDA attribute tracking
ThinLTO Function Import -- threshold computation, priority-class multipliers, global budget
Inliner Cost Model -- four parallel models, .param address space cost, ML advisory
GlobalOpt for GPU -- address-space-aware SRA, small-constant promotion, malloc elimination
Whole-Program Devirtualization -- closed-world virtual call resolution, type test metadata
NVVM Container Format -- IRLevel enum, CompileMode, bitcode payload encoding
LLVM Optimizer -- LTO pipeline entry at sub_12F5F30, tier system
LazyCallGraph & CGSCC -- call graph infrastructure used by the CGSCC inliner
Entry Point & CLI -- flag catalog routing to lto output vector, -dc mode

NVModuleSummary Builder

CICC replaces LLVM's ModuleSummaryAnalysis with a custom NVModuleSummary subsystem that extends the ModuleSummaryIndex with GPU-specific information. The builder at sub_D7D4E0 (74 KB, 2571 decompiled lines) walks every global value in a module, constructs per-function summaries with CUDA-aware call graph edges, assigns four-level import priorities using a custom priority table, tracks function complexity on a profile-guided budget, and records CUDA-specific attributes such as address-space linkage, kernel-vs-device classification, and device memory reference patterns. The summary is the data source for all downstream ThinLTO decisions -- the ThinLTO importer reads these summaries to decide which functions to pull across module boundaries, and the inliner cost model consumes the complexity budget to calibrate cross-module inline thresholds.

Upstream LLVM's computeFunctionSummary (in ModuleSummaryAnalysis.cpp) counts instructions, builds call graph edges from CallBase operands, collects reference edges by walking instruction operands, and records type test / devirtualization metadata. It produces a FunctionSummary with a flat instruction count and a call edge list annotated with CalleeInfo::HotnessType (Unknown/Cold/None/Hot). NVIDIA's replacement does all of this, then adds: a 4-level import priority classification per function, a 28-bit profile-scaled complexity budget, CUDA address-space tracking (filtering out device-memory-only declarations from import candidacy), kernel identification via first-instruction opcode probing, six separate CUDA-specific accumulator structures for device call context, and a two-phase declaration re-walk that merges forward-declared and defined symbol tables for ThinLTO.


Builder entry	`sub_D7D4E0` (`0xD7D4E0`, 74 KB)
LTO driver	`sub_D81040` (`0xD81040`, 56 KB)
Per-function analyzer	`sub_D741C0` (`0xD741C0`, 19 KB)
Call graph analyzer	`sub_D6EA70` (`0xD6EA70`, 19 KB)
Summary packer	`sub_D77220` (`0xD77220`)
Summary serializer	`sub_1535340` (`0x1535340`, 26 KB)
Summary parser	`sub_150B5F0` (`0x150B5F0`, 63 KB)
Address range	`0xD60000`--`0xD82000` (full NVModuleSummary cluster)
Stack frame	1,552 bytes (`0x610`)

Summary Fields Beyond Upstream

Upstream LLVM's FunctionSummary stores instruction count, call edges with hotness, reference edges, type test GUIDs, and a few flags (norecurse, returndoesnotalias, etc). NVIDIA extends this with the following per-function fields:

Field	Encoding	Width	Description
Import priority	`*entry & 0x7`	3 bits	4-level priority: 0 = not importable, 1 = low, 2 = standard, 3 = force-import
Address-taken flag	`*entry & 0x8`	1 bit	Set if `sub_B49220(GV)` returns true (function has its address taken)
Complexity budget	`*entry >> 4`	28 bits	Profile-scaled importance, max 0xFFFFFFF (268,435,455)
Kernel bit	`flags & (1 << 9)`	1 bit	Set if first instruction opcode is 36 (kernel entry point)
Has-unwind-info	`flags & (1 << 0)`	1 bit	`sub_B2DCC0(func)` -- has personality function
Not-inline	`flags & (1 << 1)`	1 bit	Function marked noinline
Read-none	`flags & (1 << 2)`	1 bit	Attribute #34 readnone
No-unwind	`flags & (1 << 3)`	1 bit	Attribute #22 nounwind
Will-return	`flags & (1 << 4)`	1 bit	Attribute #31 willreturn
No-return	`flags & (1 << 5)`	1 bit	Attribute #3 noreturn
Must-progress	`flags & (1 << 6)`	1 bit	Attribute #41 mustprogress
Has-visible-alias	`flags & (1 << 7)`	1 bit	Accumulated alias visibility flag
Has-non-importable-refs	`flags & (1 << 8)`	1 bit	References symbols that cannot be imported
Has-any-import	module flag bit 6	1 bit	OR of device-ref, has-typed-symbol, has-non-importable

The per-entry summary record in the primary hash table is 16 bytes. The lower 32 bits pack the priority/address-taken/budget fields. The upper 64 bits hold a pointer to the full FunctionSummary record built by sub_D77220.

Builder Algorithm

The builder executes in three phases within sub_D7D4E0. The LTO driver sub_D81040 calls the builder after reading module flags (EnableSplitLTOUnit, UnifiedLTO, ThinLTO) and iterating all functions via a callback iterator.

Phase 1: Global Value Walk (lines 559--1671)

The module's global value list is a linked list rooted at Module+72 (the GlobalList field). The sentinel node is at Module+72 itself; the first real element is at Module+80.

// Phase 1: iterate all GlobalValues in the module
GlobalValue *sentinel = (GlobalValue *)(module + 72);
GlobalValue *cur = *(GlobalValue **)(module + 80);

while (cur != sentinel) {
    uint8_t opcode = cur->ir_node[0];  // IR opcode byte
    switch (opcode) {
    case 61: /* '=' -- Function definition */
        process_function(cur);
        break;
    case 62: /* '>' -- GlobalVariable */
        process_global_variable(cur);
        break;
    case 34: /* '"' -- Alias (kind 1) */
    case 40: /* '(' -- Alias (kind 2) */
        process_alias(cur);
        break;
    case 85: /* 'U' -- Declaration/extern */
        process_declaration(cur);
        break;
    }
    cur = cur->next;
}

For each function (opcode 61), the builder performs:

1. Import priority assignment. Queries the ImportPriorityTable via sub_D84370(table, func, PSI, 0). If found and the table is non-null, the priority is determined:

entry = getImportKind(priority_table, func, PSI, 0);
if (entry.found) {
    if (isImported(priority_table, entry))          // sub_D84440
        priority = 3;  // force-import
    else if (isImportCandidate(priority_table, entry, 3) == 0)  // sub_D84450
        priority = 2;  // standard importable
    else
        priority = 1;  // low priority
} else {
    priority = 0;  // not importable
}

2. Complexity budget computation. When ProfileSummaryInfo is available and the function was found in the priority table, the builder computes a profile-scaled importance value:

uint64_t profile_count = getProfileCount(PSI, func);  // sub_FDD860
uint64_t threshold = getHotThreshold(PSI);              // sub_FDC4B0

if (profile_count exists) {
    APInt importance = computeScaledImportance(profile_count, threshold);  // sub_F04200
    normalizeImportPriority(&importance, 8);  // sub_D78C90: right-shift by 8
    budget += importance.getZExtValue();
    budget = min(budget, 0xFFFFFFF);  // clamp to 28-bit max
}

// Pack into entry: lower 4 bits = priority | address_taken, upper 28 bits = budget
*entry_word = (budget << 4) | (*entry_word & 0xF);

The 28-bit budget is consumed downstream by ThinLTO to decide how much inlining budget to allocate for functions imported from other modules. A budget of 0 means the function has no profile data and gets the baseline threshold; a budget near the 268M ceiling means the function is extremely hot and will receive aggressive cross-module inlining.

3. Call graph edge construction. For functions with call graph info (bit 5 of byte 7: func->ir_node[7] & 0x20), the builder extracts two kinds of edges:

Direct call edges from attribute group #35: the callee list. Each callee gets a GUID via sub_9E27D0, and edges are collected into a temporary vector (4-byte stride per GUID).
Reference edges with type info from attribute group #34: operand bundles encoding reference edges with type metadata. Each reference carries a CalleeType byte and parameter type pairs extracted from MDNode operands. The MDNode decoding walks: operand -> parent (opcode 1 = MDString) -> offset 136 (opcode 17 = MDTuple) -> string data at offset 24.

Call graph edge records are 136 bytes each (stride 136 in the edge vector) and contain source name, target name, and edge attributes. Type-metadata edges are 72 bytes each.

4. CUDA address-space filtering. When the CUDA-mode flag (a6) is set and a declaration has address space 25 in its type chain, the function sets the device-reference flag (v327). Functions whose type resolves to address space 25 are excluded from import candidacy -- device-memory-only declarations cannot be cross-module imported in ThinLTO. The check:

if (cuda_mode && is_declaration(func)) {
    Type *ty = func->type_at_offset_minus_2;
    if (getAddressSpace(ty) == 25) {
        has_device_ref = true;
        goto skip_import;  // do not mark as importable
    }
}

Address space 25 appears to be an internal NVVM encoding for device-side linkage. This differs from the standard NVPTX address spaces (0 = generic, 1 = global, 3 = shared, 4 = constant, 5 = local). The summary records this flag so the importer can avoid attempting to import device-side-only symbols, which would fail at link time.

5. CUDA call context collection. For functions with the device attribute bit (func[33] & 0x20), the builder calls sub_D7CF70 to populate six parallel accumulator structures:

Accumulator	Offset	Likely content
`v408`	+0	Direct device call targets
`v415`	+1	Shared memory references
`v422`	+2	Texture/surface references
`v429`	+3	Constant memory references
`v436`	+4	Kernel launch edges
`a5`	+5	Additional context (passed from caller)

These six vectors capture the GPU-specific dependency information that upstream LLVM's summary has no concept of. The ThinLTO importer uses this to make GPU-aware import decisions -- for example, a function that references shared memory in another module must also import the shared memory declaration.

Phase 2: ThinLTO Declaration Re-Walk (lines 1673--1911)

When thinlto_mode (parameter a8) is true, the builder performs a second pass over forward-declared symbols:

Step 1. Re-walk function declarations collected during Phase 1. For each, remove from the "seen" set and re-analyze via sub_D7B190 into a secondary hash table.

Step 2. Re-walk global variable declarations through a separate dedup mechanism using sub_C8CA60 for hash-based deduplication.

Step 3. Merge the secondary (forward-declared) and primary (defined) hash tables. On collision -- the same symbol appears as both declared and defined -- sub_D76140 removes the entry from the defined table and sub_D7AF10 re-inserts into the merged table with updated visibility. This merge ensures that the summary captures cross-module edges even for symbols that are only forward-declared in the current module.

The two-phase design is necessary because CUDA compilation units frequently contain forward declarations of device functions defined in other translation units. Without this re-walk, the summary would miss the cross-module edges for these declarations, and ThinLTO would fail to import them.

Phase 3: Finalize and Emit (lines 1912--2569)

Module-level flag assembly. After processing all globals, the builder computes two flag words:

// v134: module-level attribute summary (bits 0-10)
v134 = (linkage & 0xF)               // bits 0-3
     | ((visibility & 0x3) << 4)     // bits 4-5
     | (has_any_import << 6)          // bit 6: OR of v327|v316|v358
     | (has_comdat << 7)              // bit 7
     | (has_comdat_attr << 8)         // bit 8
     | (dll_storage_class << 9);      // bits 9-10

// v143: per-function flags (bits 0-9)
v143 = has_unwind_info               // bit 0
     | (not_inline << 1)             // bit 1
     | (readnone << 2)               // bit 2
     | (nounwind << 3)               // bit 3
     | (willreturn << 4)             // bit 4
     | (noreturn << 5)               // bit 5
     | (mustprogress << 6)           // bit 6
     | (has_visible_alias << 7)      // bit 7
     | (has_non_importable_refs << 8) // bit 8
     | (is_kernel << 9);             // bit 9

The kernel detection walks to the function's first instruction via offset 24, verifies the opcode is in range 30--40 (basic block terminators), and checks specifically for opcode 36, which encodes a kernel entry point. This is how the summary distinguishes __global__ kernel functions from __device__ helper functions without relying on metadata -- it inspects the compiled IR structure directly.

Summary record packing. All collected data is packed into the final FunctionSummary via sub_D77220, which takes 14 arguments:

sub_D77220(
    &result,             // output FunctionSummary*
    module_flags,        // v134
    instruction_count,   // v324
    function_flags,      // v143 (includes kernel bit)
    &priority_slice,     // import priority table slice
    guid_ref_list,       // GUID reference list
    &typed_refs,         // type-checked reference list (72-byte entries)
    &typed_edges,        // typed call graph edges (136-byte entries)
    &simple_edges,       // simple call graph edges (GUID array)
    device_context,      // CUDA device context edges
    additional_edges,    // extra edge data
    &bundle_refs,        // operand bundle references
    &cross_module_calls, // cross-module call records
    &param_types         // per-parameter type metadata
);

The result is stored via sub_D7A690(index, func, &result) which merges the summary into the module-level index.

Callback invocation. The a9 parameter is a callback object with vtable layout: a9+16 points to a shouldSkip() predicate; a9+24 points to a processFunction(a9, GlobalValue*) handler. When shouldSkip() returns null, the callback is invoked for each function. The callback result is processed by sub_D8D9B0 which extracts additional summary information (likely profile or LTO-specific metadata).

Serialization and the NVVM Container

The summary is serialized into bitcode by sub_1535340 (writeModuleSummary, 26 KB). This function writes a MODULE_STRTAB_BLOCK and GLOBALVAL_SUMMARY_BLOCK into the LLVM bitcode stream using standard bitcode encoding (VBR integers, abbreviation-driven records). The strings "ThinLTO" and "Unexpected anonymous function when writing summary" appear in this function.

On the reading side, sub_150B5F0 (parseModuleSummaryIndex, 63 KB) and sub_9EBD80 (parseGlobalSummaryBlock, 82 KB) deserialize the summary from bitcode back into the in-memory ModuleSummaryIndex. These parsers handle GUID hashes, function/alias/global summaries, and module paths.

The bitcode writer at sub_1538EC0 writes the producer string as "LLVM7.0.1" despite CICC being built on LLVM 20.0.0 internally -- this is the NVVM IR compatibility layer. The summary blocks are embedded in this bitcode stream alongside the IR, so the NVVM container format (see NVVM Container) carries both the IR and its summary in a single bitcode file.

Import Priority System

The 4-level priority system is the primary extension over upstream LLVM's binary importable/not-importable model. Upstream uses GlobalValueSummary::ImportKind which is essentially a boolean; NVIDIA introduces graduated priority levels that feed a floating-point threshold multiplier in the importer.

Level	Value	Meaning	Importer behavior
0	`0b000`	Not importable	Never imported
1	`0b001`	Low priority	Threshold multiplied by cold multiplier (`dword_4FAACC0`)
2	`0b010`	Standard	Threshold multiplied by default multiplier (`dword_4FAB040`)
3	`0b011`	Force-import	Threshold multiplied by hot multiplier (`dword_4FAAE80`)

The importer at sub_1853180 converts the integer base threshold to float, multiplies by the per-priority-level constant, converts back to integer, and compares against the function's cost from the summary (stored at offset 0x40 in the summary entry). A fourth multiplier (dword_4FAADA0) handles "critical" priority (priority class 4 in the importer's switch), though the summary builder only produces levels 0--3.

For comdat/linkonce symbols discovered during Phase 3, a special minimum priority applies:

min_priority = 3 * (dword_4F87C60 != 2) + 1;
// dword_4F87C60 == 2: min_priority = 1 (conservative)
// dword_4F87C60 != 2: min_priority = 4 (aggressive import)

Hash Table Infrastructure

The builder manages multiple open-addressing hash tables with different entry sizes. All use the standard DenseMap pointer hash and growth policy; see Hash Table and Collection Infrastructure for the common implementation.

Table	Entry size	Probe strategy	Purpose
Primary (`v384`--`v387`)	16 bytes	Linear probing	Main summary entries (ptr + metadata)
Secondary (`v388`--`v393`)	8 bytes	Linear probing	Forward-declared symbol GUIDs
GUID dedup (`v406`--`v407`)	8 bytes	Linear scan + memmove	Deduplication during merge
Seen set (`v451`--`v455`)	Variable	Flat array or hash	Tracks processed GlobalValues

The "seen set" has two modes selected by v455: when v455 = 1, it uses a flat inline buffer at v456 with HIDWORD(v453) as the count; when v455 = 0, it switches to a hash table via sub_C8CA60. This dual-mode design optimizes for the common case of small modules (flat scan is faster when count is low) while scaling to large modules.

Rehash strategy: new_capacity = max(64, next_power_of_2(4 * current_count)). The power-of-2 is computed via _BitScanReverse. If the new capacity equals the old, the table is cleared in-place via memset to the empty sentinel (0xFF for 8-byte entries, 0xF8 for 16-byte entries). Otherwise the old buffer is freed and a new one allocated via sub_C7D670 (aligned_alloc(8, size)).

Knobs and Global Variables

Symbol	Type	Default	Effect
`dword_4F87C60`	int	0	Import priority override: 0 = normal, 1 = force all importable, 2 = conservative mode
`qword_4F878A8`	bool	false	When set in ThinLTO mode, forces re-analysis of all referenced-but-undefined symbols
`byte_3F871B3`	byte	(varies)	Cross-module GUID namespace prefix, distinguishes same-named symbols across modules
`dword_4FAB120`	int	-1	Global import budget (-1 = unlimited)
`dword_4FAA770`	int	0	Running count of imports performed
`dword_4FAAE80`	float	(varies)	Hot function threshold multiplier
`dword_4FAACC0`	float	(varies)	Cold function threshold multiplier
`dword_4FAADA0`	float	(varies)	Critical section threshold multiplier
`dword_4FAB040`	float	(varies)	Default threshold multiplier
`byte_4FAAA20`	bool	false	Enable `thinlto_src_module` metadata annotation on imported functions

The dword_4F87C60 override is the most impactful knob. Setting it to 1 makes every function importable regardless of its linkage or visibility, which is useful for whole-program optimization but can cause link-time explosions. Setting it to 2 enables conservative mode where comdat symbols get minimal priority (level 1 instead of 4), preventing aggressive cross-module import of weakly-linked symbols.

Comparison with Upstream ModuleSummaryAnalysis

Aspect	Upstream LLVM	CICC NVModuleSummary
Entry point	`computeFunctionSummary()`	`sub_D7D4E0` (2571 lines vs ~400)
Priority levels	Binary (importable or not)	4 levels (0--3) with float multipliers
Complexity metric	Flat instruction count	28-bit profile-scaled budget
Call edge annotation	`CalleeInfo::HotnessType` (4 values)	136-byte records with full type metadata
Address space awareness	None	Filters device-only (AS 25) from import
Kernel detection	None	Opcode-36 probe for `__global__` functions
Declaration re-walk	None	Two-phase merge of declared + defined
CUDA context	None	6 accumulators for device call patterns
Hash table sizing	LLVM DenseMap	Custom open-addressing with dual-mode seen set
Profile integration	BFI-based hotness	`ProfileSummaryInfo` scaled budget
Serialization	Standard `ModuleSummaryIndex` bitcode	Same format, extended fields

The most architecturally significant difference is the priority system. Upstream LLVM makes a binary import/no-import decision based on a single threshold comparison. NVIDIA's 4-level system allows the importer to process functions in priority order (primary/secondary/tertiary passes in sub_1854A20) with different threshold multipliers per level, enabling much finer control over cross-module optimization aggressiveness.

Function Map

Function	Address	Size	Role
`NVModuleSummary::buildModuleSummary()` -- main builder	`0xD7D4E0`	74 KB	--
`NVModuleSummary::runOnModule()` -- LTO driver	`0xD81040`	56 KB	--
`NVModuleSummary::analyzeFunction()`	`0xD741C0`	19 KB	--
`NVModuleSummary::processGlobalRef()`	`0xD6FF50`	47 KB	--
`NVModuleSummary::collectGlobalInfo()`	`0xD6A180`	21 KB	--
`NVModuleSummary::analyzeCallGraph()`	`0xD6EA70`	19 KB	--
`NVModuleSummary::visitInstruction()`	`0xD7B190`	9 KB	--
Alias processing helper	`0xD738B0`	11 KB	--
`NVModuleSummary::computeImportCost()`	`0xD72D40`	9 KB	--
`NVModuleSummary::resolveReferences()`	`0xD64DE0`	16 KB	--
`NVModuleSummary::getTypeMetadata()`	`0xD669C0`	11 KB	--
`NVModuleSummary::processTypeId()`	`0xD640E0`	12 KB	--
`NVModuleSummary::computeVisibility()`	`0xD63080`	11 KB	--
Summary serialization helper (recursive)	`0xD60CE0`	15 KB	--
Summary serialization helper	`0xD61E90`	10 KB	--
`NVModuleSummary::packFunctionSummary()` -- 14-arg final packer	`0xD77220`	--	--
`NVModuleSummary::addInlineSummary()` -- CUDA context collector	`0xD7CF70`	--	--
`NVModuleSummary::addEdge()`	`0xD76530`	--	--
`NVModuleSummary::addRef()`	`0xD768F0`	--	--
`NVModuleSummary::addSpecialGlobal()` (llvm.used etc.)	`0xD76CA0`	--	--
`NVModuleSummary::addTypeRef()`	`0xD76D40`	--	--
`NVModuleSummary::computeNextPrime()` -- hash table sizing	`0xD76FC0`	--	--
`NVModuleSummary::getModuleHash()`	`0xD771D0`	--	--
`NVModuleSummary::destroyEdgeList()`	`0xD77880`	--	--
`NVModuleSummary::destroyRefList()`	`0xD786F0`	--	--
`NVModuleSummary::compareImportPriority()`	`0xD788E0`	--	--
`NVModuleSummary::computeSymbolHash()`	`0xD789D0`	--	--
`NVModuleSummary::resizeTable()`	`0xD78B00`	--	--
`NVModuleSummary::normalizeImportPriority()`	`0xD78C90`	--	--
`NVModuleSummary::addCallEdge()`	`0xD793D0`	--	--
Rehash/resize (next power-of-2, min 64)	`0xD79200`	--	--
`NVModuleSummary::copyTable()`	`0xD7A410`	--	--
`NVModuleSummary::mergeSymbols()`	`0xD7A690`	--	--
`NVModuleSummary::computeFinalOrder()`	`0xD7AC80`	--	--
`NVModuleSummary::getOrInsertSummary()`	`0xD7BAA0`	--	--
`NVModuleSummary::visitGlobalValue()`	`0xD7BD50`	--	--
`NVModuleSummary::getImportKind()`	`0xD84370`	--	--
`NVModuleSummary::isImported()`	`0xD84440`	--	--
`NVModuleSummary::isImportCandidate()`	`0xD84450`	--	--
`NVModuleSummary::processInliningDecisions()`	`0xD8B020`	21 KB	--
`NVModuleSummary::computeInlineBenefit()`	`0xD8C2B0`	8 KB	--
`NVModuleSummary::buildCalleeList()`	`0xD8D9B0`	9 KB	--
`NVModuleSummary::cloneModuleSummary()`	`0xD8E7E0`	32 KB	--
GUID lookup/creation (namespace-aware)	`0x9CA390`	--	--
Get attribute group by kind from GlobalValue	`0xB91C10`	--	--
`ProfileSummaryInfo::getProfileCount()`	`0xFDD860`	--	--
`ProfileSummaryInfo::getHotThreshold()`	`0xFDC4B0`	--	--
`writeModuleSummary()` -- bitcode serializer	`0x1535340`	26 KB	--
`parseModuleSummaryIndex()` -- bitcode deserializer	`0x150B5F0`	63 KB	--

Cross-References

Inliner Cost Model -- consumes complexity budget for cross-module inline decisions
ThinLTO Function Import -- reads summaries, applies threshold multipliers per priority level
NVVM Container Format -- the bitcode container that carries serialized summaries
GlobalOpt -- uses summary visibility information for global optimization
WholeProgramDevirtualization -- consumes type test GUIDs from the summary

Inliner Cost Model

CICC v13.0 contains four parallel inliner cost models -- an architecturally unusual design that reflects both the historical evolution of NVIDIA's compiler and the fundamental differences between GPU and CPU inlining economics. The NVIDIA custom inliner at 0x1864060 (75 KB, 2135 decompiled lines) uses a 20,000-unit budget that is 89x the upstream LLVM default of 225. Roughly 60% of the custom inliner's code computes type-size comparisons for argument coercion cost, because on GPU the dominant cost of a function call is not instruction count but .param address-space marshaling. Alongside the custom model, CICC also links the standard LLVM InlineCostAnalysis at 0x30DC7E0 (51 KB), a New Pass Manager CGSCC inliner at 0x2613930 (69 KB) with ML-based advisory support, and an NVPTX target-specific cost modifier at 0x38576C0 (58 KB) that injects a +2000 bonus for GPU intrinsics.


Model A: NVIDIA custom	`sub_1864060` (`0x1864060`, 75 KB, CGSCC)
Model B: LLVM standard	`sub_30DC7E0` (`0x30DC7E0`, 51 KB, `InlineCostAnalysis`)
Model C: New PM CGSCC	`sub_2613930` (`0x2613930`, 69 KB, recursive SCC)
Model D: NVPTX target	`sub_38576C0` (`0x38576C0`, 58 KB, opcode-based)
Knob constructor	`ctor_186_0` (`0x4DBEC0`, 14 KB)
LLVM knob constructor	`ctor_625_0` / `ctor_715_0` (`0x58FAD0`, 27 KB)

Why Four Inliner Models

The four models are not truly interchangeable alternatives -- they serve overlapping but distinct roles in the compilation pipeline:

Model A is the original NVIDIA inliner, predating the LLVM 14+ New Pass Manager. It operates on NVIDIA's internal NVVM IR node format (not LLVM IR), walks the callee body with bespoke type-size arithmetic, and is the only model that understands .param-space argument coercion costs. It runs inside the legacy CGSCC inliner framework via sub_186CA00 (Inliner::inlineCallsImpl). When CICC runs in its default optimization pipeline, this is the model that makes the bulk of inlining decisions.

Model B is upstream LLVM's InlineCostAnalysis::analyzeCall, compiled into CICC essentially unmodified. It uses LLVM's instruction-counting cost model with a 225-unit default threshold, the inline-threshold, inlinedefault-threshold, and PGO deferral knobs. It exists because CICC links the full LLVM codebase and certain LLVM passes (e.g., the always-inliner, sample-profile inliner) call into getInlineCost / analyzeCall directly.

Model C is the New Pass Manager's CGSCC inliner at 0x2613930. It handles recursive SCC splitting, carries the function-inline-cost-multiplier knob for penalizing recursive functions, and can delegate decisions to an InlineAdvisor (sub_2609820, 57 KB). The advisor supports three modes registered in the pipeline parser: default, development (training), and release (inference). The ML model inference path lives at sub_29B2CD0 / sub_29B4290. CICC registers the pipeline string "inliner-ml-advisor-release" for the release mode (parser slot 49).

Model D is an NVPTX target-specific cost modifier at 0x38576C0 that adjusts inline costs based on opcode analysis. Its primary contribution is a +2000 cost bonus for functions containing opcode tag 9 instructions (see Opcode Tag 9 Bonus below). This runs as a layer on top of whichever primary cost model is active, modifying the accumulated cost at offset+72 and comparing against the threshold at offset+76.

The historical layering is: NVIDIA built Model A first for their custom NVVM IR, then LLVM matured its own inliner (Model B), then the New PM arrived with ML advisory (Model C), and NVPTX target hooks added GPU-specific adjustments (Model D). Rather than consolidating, NVIDIA kept all four because each handles a different phase or code path in the pipeline.

The `.param` Address Space Problem

Understanding the NVIDIA inliner requires understanding why GPU function calls are so expensive compared to CPU calls. On x86, a function call requires pushing arguments to registers/stack, a CALL instruction, and a RET. The overhead is typically 5-20 cycles.

On NVIDIA GPUs, there is no hardware call stack for registers. The PTX calling convention works through the .param address space:

Caller declares .param variables via DeclareParam (opcode 505) or DeclareScalarParam (opcode 506) for each argument.
Caller stores argument values into .param space via st.param instructions (opcodes 571-573 for StoreV1/V2/V4).
Caller emits the call instruction referencing the .param declarations.
Callee loads arguments from .param space via ld.param instructions.
Return values come back through .param space via ld.param (opcodes 515-516, 568-570 for LoadRetParam / LoadV1/V2/V4).
Byval arguments (structs passed by value) copy the entire struct to .param space field by field.

Each function call therefore generates O(n) st.param + O(n) ld.param instructions where n is the number of arguments, plus register save/restore if the callee needs more registers than are available (spills go to local memory, which is device DRAM -- hundreds of cycles). Additionally, call boundaries destroy instruction scheduling freedom, prevent cross-boundary register allocation, and create branch divergence hazards at the call/return sites.

This is why NVIDIA's default inline budget of 20,000 is not as aggressive as it sounds: inlining a function with 50 instructions but 8 struct arguments might save hundreds of cycles of .param marshaling overhead.

Model A: NVIDIA Custom Inliner

Knob Inventory

All knobs are registered in ctor_186_0 at 0x4DBEC0:

Knob	Type	Default	Purpose
`inline-budget`	int	20,000	Per-caller inlining cost budget
`inline-total-budget`	int	(none)	Global total budget across all callers in the module
`inline-adj-budget1`	int	(none)	Secondary per-caller budget, dynamically adjusted
`nv-inline-all`	bool	off	Force inline every function call unconditionally
`profuseinline`	bool	off	Verbose inlining diagnostics (NVIDIA profuse framework)
`inline-switchctrl`	int	(none)	Switch-statement inlining heuristic tuning
`inline-numswitchfunc`	int	(none)	Penalty based on number of switch stmts in callee
`inline-maxswitchcases`	int	(none)	Maximum switch cases before cost penalty applies
`disable-inlined-alloca-merging`	bool	off	Disable post-inline alloca merging

CLI surface mapping:

User Flag	Routed To
`-aggressive-inline`	`-inline-budget=40000` (2x default)
`-disable-inlining`	`-disable-inlining`
`-inline-budget=N`	Sets per-caller budget directly
`-inline-info`	Diagnostic flag for inline decisions

Entry and Early Bail-Outs

The entry point sub_1864060 takes four arguments: a1 = function/callsite node, a2 = context, a3 = callback, a4 = data pointer. The function performs a series of eligibility checks before any cost computation:

Intrinsic name check. Calls sub_1649960(a1) to retrieve the function name. If the name starts with the 4-byte magic 0x6D6C6C6C (an LLVM intrinsic prefix) followed by '.', returns 0 immediately. LLVM intrinsics are never inlined through this path.

Pre-analysis walk. Initializes a 32-byte inline-analysis state struct via sub_1ACF5D0, then calls sub_1ACF600 which delegates to sub_1ACF0B0. This walks the callee body to collect basic metrics (instruction count, call count, basic block count). If the pre-analysis returns nonzero, the function is not analyzable.

Linkage check. Reads the byte at a1+32. The low nibble encodes linkage class: values 7 (linkonce_odr) and 8 (weak_odr) are eligible for inlining. Bits [7:6] encode visibility: 0x2 = hidden (OK), 0x1 = protected (bail). The function also requires byte at a1+16 == 3 (function definition, not declaration), bit 0 of byte at a1+80 == 0 (no noinline attribute), and sub_15E4F60(a1) returning false (no optnone).

function shouldInline(callsite):
    name = getName(callsite.callee)
    if name starts with LLVM_INTRINSIC_PREFIX:
        return NEVER_INLINE

    state = initAnalysisState()
    if preAnalyze(callsite.callee, state) != 0:
        return NEVER_INLINE

    linkage = callsite.callee.linkage
    if linkage not in {linkonce_odr, weak_odr}:
        return NEVER_INLINE
    if callsite.callee.isDeclaration:
        return NEVER_INLINE
    if callsite.callee.hasNoinline:
        return NEVER_INLINE
    if callsite.callee.hasOptnone:
        return NEVER_INLINE

    // ... proceed to cost computation

Callee Body Scan

After eligibility checks pass, the inliner walks the callee's operand/argument list (linked list at a1+8). Each argument node is classified by its type tag at byte offset +16 via sub_1648700:

Tag Range	Meaning	Action
<= `0x17`	Basic types or call-like	If tag == 5 (phi): recurse into operands, check all > `0x17`; otherwise bail
`0x36` (54)	Load-like instruction	Collect into loads vector
`0x37` (55)	Store-like instruction	Collect into stores vector
`0x47` (71, `'G'`)	Aggregate/GEP	Enter sub-operand scan

The loads and stores are accumulated into two SmallVectors (v357, v360) with initial inline capacity of 4 elements each. These vectors are the input to the argument coercion cost check.

Load-Store Combinatorial Bail-Out

Before proceeding to the expensive type-size computation, the function checks:

if (num_loads * num_stores > 100):
    return BAIL_OUT  // Too expensive argument copy pattern

This prevents inlining functions where argument materialization would create a quadratic load-store explosion. Consider a function taking 4 struct-by-value arguments, each with 30 fields: that is 120 loads times 120 stores = 14,400 combinations, far above the 100 threshold. Without this guard, the type-size computation engine below would take unreasonable time.

Type-Size Computation Engine

The bulk of sub_1864060 -- lines 1140 through 2100, approximately 60% of the function -- is a type-size computation engine. This is the single most distinctive feature of the NVIDIA inliner: where LLVM counts instructions, NVIDIA computes byte-level argument coercion costs.

The engine walks NVVM IR type nodes and computes byte sizes for each argument at both the callsite (actual argument) and the callee (formal parameter). The type tag dispatch is repeated 8+ times across different contexts:

Type Tag	Type	Size Computation
`0x01`	`half`	16 bits
`0x02`	`float`	32 bits
`0x03`	`double`	64 bits
`0x04`	`fp80`	80 bits
`0x05`	`fp128`	128 bits
`0x06`	`ppc_fp128`	128 bits
`0x07`	pointer	`sub_15A9520(module, 0)` for target pointer size
`0x08`	array	`element_type_size * count` (recursive)
`0x09`	`x86_mmx`	64 bits
`0x0A`	vector	`element_type_size * count` (recursive)
`0x0B`	integer	`(dword >> 8)` bits
`0x0C`	function	Recurse (unusual, but handled)
`0x0D`	struct	`sub_15A9930` for layout size
`0x0E`	packed struct	Manual: `8 * count * align * ceil`
`0x0F`	named type	`sub_15A9520(module, type_id)`
`0x10`	opaque/token	`element_type_size * count`

The byte-size formula applied uniformly is:

byte_size = (multiplier * bit_width + 7) >> 3

The core comparison at the heart of the cost model:

if callee_arg_size > callee_formal_size:
    // Argument is being widened at the call boundary
    // This costs extra st.param + ld.param instructions
    // Proceed to next comparison level (accumulate cost)
else:
    // Sizes match or shrink -- this argument pair is OK

Arguments are processed in groups of 4 (loop unrolled at line 2098: v142 += 4, --v306 where v306 = num_stores * 8 >> 5, i.e., groups of 4 store arguments). Remainder arguments (1-3 after the groups-of-4 loop) are handled by the type compatibility check function sub_185CCC0 which calls sub_15CCEE0 for type matching.

Struct Layout Walk

The helper sub_185B2A0 (3 KB) performs a stack-based DFS walk of struct type trees to count fields. It handles pointer types (tag 15), struct types (tag 13/14), and array types (tag 16). The walk has a hard depth limit of 20 levels, preventing runaway recursion on deeply nested struct definitions.

Argument Coercion Check

The helper sub_185D7C0 (9 KB) classifies each callee operand and determines whether argument coercion is needed at the inline callsite. For each operand in the callee's argument linked list at a1+8, it:

Reads the instruction tag via sub_1648700.
Computes the formal parameter type size.
Computes the actual argument type size at the callsite.
If sizes differ, flags this argument as requiring coercion (extra cost).
If the argument is a struct, invokes the struct layout walk to count individual field copies.

Callsite Transformation

When the callee qualifies for "alias inline" (replacing a call with direct body substitution), the function:

Allocates a new 88-byte IR node via sub_1648A60(88, 1).
Builds a function reference node via sub_15F8BC0.
Builds a call replacement node via sub_15F9660.
Walks callee operands to collect phi nodes into a worklist.
For each phi: copies via sub_1596970, updates operands via sub_15F2120, replaces references via sub_1648780.
Deletes original phis via sub_159D850.
Performs final callsite replacement via sub_164D160 + sub_15E55B0.

Switch Statement Heuristics

Three dedicated knobs control inlining of switch-heavy functions. On GPU, large switch statements are particularly costly because:

Branch divergence: Each thread in a warp may take a different case, serializing execution.
No branch prediction hardware: Every divergent branch pays full penalty.
Control flow reconvergence: The hardware must synchronize threads after the switch, wasting cycles.

The inline-switchctrl knob tunes the general heuristic sensitivity. inline-numswitchfunc penalizes functions containing many switch statements. inline-maxswitchcases sets a case-count ceiling beyond which a switch-heavy callee is considered too expensive to inline regardless of other factors.

nv-inline-all: Force-All Mode

The nv-inline-all knob bypasses cost analysis entirely and forces inlining of every call. This is used for specific compilation modes where the call graph must be completely flattened:

OptiX ray tracing: The hardware intersection pipeline requires a single monolithic function. All user-defined intersection, closest-hit, any-hit, and miss programs must be inlined into a single continuation function.
Aggressive LTO: When doing whole-program optimization with small modules, flattening removes all call overhead.

Two-Budget System

NVIDIA uses a two-level budget to control inlining granularity:

inline-budget (default 20,000): Per-caller limit. Caps how much code can be inlined into a single function, preventing any one function from becoming unreasonably large.
inline-total-budget: Module-wide limit. Caps the total amount of inlining across all callers in the compilation unit.
inline-adj-budget1: A secondary per-caller limit that may be dynamically adjusted based on context -- for example, kernel entry points (__global__ functions) may receive a higher adjusted budget because they are the outermost scope and benefit most from aggressive inlining.

The threshold adjustment helper at sub_1868880 (12 KB) modifies thresholds based on calling context through pure arithmetic on cost/threshold values (no string evidence, entirely numeric).

Alloca Merging

The disable-inlined-alloca-merging knob controls post-inline stack allocation merging. On GPU, "stack" means local memory, which is device DRAM (hundreds of cycles latency). Merging allocas from inlined callees with the caller's allocations reduces total local memory consumption. Lower local memory usage directly improves occupancy (more concurrent thread blocks per SM). The default is to enable merging.

Model B: LLVM Standard InlineCostAnalysis

The standard LLVM InlineCostAnalysis::analyzeCall at 0x30DC7E0 (51 KB) is compiled into CICC from upstream LLVM sources. Its knobs are registered in ctor_625_0 / ctor_715_0 at 0x58FAD0 (27 KB of option registration, an unusually large constructor due to the 40+ individual cost parameter registrations).

Key upstream LLVM knobs present in CICC:

Knob	Default	Purpose
`inline-threshold`	225	Base inlining threshold
`inlinedefault-threshold`	225	Default when no hint/profile
`inlinehint-threshold`	325	Threshold for `__attribute__((always_inline))` hint
`inline-cold-callsite-threshold`	45	Threshold for cold callsites
`inlinecold-threshold`	45	Threshold for functions with cold attribute
`hot-callsite-threshold`	3000	Threshold for hot callsites (PGO)
`locally-hot-callsite-threshold`	525	Threshold for locally hot callsites
`inline-instr-cost`	5	Cost per instruction
`inline-call-penalty`	25	Penalty per callsite in callee
`inline-memaccess-cost`	0	Cost per load/store
`inline-savings-multiplier`	8	Multiplier for cycle savings
`inline-savings-profitable-multiplier`	4	Multiplier for profitability check
`inline-size-allowance`	100	Max callee size inlined without savings proof
`inline-cost-full`	false	Compute full cost even when over threshold
`inline-enable-cost-benefit-analysis`	false	Enable cost-benefit analysis
`inline-deferral`	(PGO)	Defer inlining in cold paths
`inline-remark-attribute`	(off)	Emit inline remarks

The LLVM model fundamentally counts instructions (at inline-instr-cost = 5 units each) and subtracts savings from constant propagation, dead code elimination after argument specialization, and simplified control flow. This instruction-counting approach is appropriate for CPUs where call overhead is small and code size is the primary concern. It is inadequate for GPUs where argument marshaling dominates.

Model C: New PM CGSCC Inliner

The New Pass Manager inliner at 0x2613930 (69 KB) handles recursive SCC processing and integrates with LLVM's InlineAdvisor framework. Its key differentiation is the function-inline-cost-multiplier knob that penalizes recursive function inlining -- a scenario the NVIDIA custom inliner (Model A) does not handle.

The InlineAdvisor at sub_2609820 (57 KB) supports three modes:

Mode	Pipeline String	Behavior
`default`	`"inline-advisor"`	Heuristic-based (uses Model B cost analysis)
`development`	(training path)	Feature extraction for ML model training
`release`	`"inliner-ml-advisor-release"`	ML model inference via `sub_29B2CD0` / `sub_29B4290`

The ML inference path extracts features from the callsite and callee (instruction count, call depth, loop nesting, etc.) and feeds them through a model to produce an inline/no-inline decision. This is standard upstream LLVM ML inlining infrastructure compiled into CICC; there is no evidence of NVIDIA-custom ML model weights, though NVIDIA could supply custom weights via the enable-ml-inliner knob (registered as an enum: {default, development, release}).

NVPTX Opcode Tag 9 Bonus (+2000)

Model D at sub_38576C0 modifies inline costs based on NVPTX-specific opcode analysis. The key logic:

for each instruction in callee:
    tag = getOpcodeTag(instruction)
    if ((tag >> 4) & 0x3FF) == 9:
        inline_cost += 2000
    // ... accumulate other per-instruction costs

The state layout of the cost analyzer object:

Offset	Field	Purpose
+72	Accumulated cost	Running sum of per-instruction costs
+76	Threshold	Budget for this callsite
+120	Per-instruction cost (lo)	Cost array element (low)
+128	Per-instruction cost (hi)	Cost array element (high)

The +2000 bonus for tag 9 opcodes encourages inlining of functions containing specific GPU operations -- likely tensor core instructions, warp-level intrinsics, or other operations that benefit significantly from being visible to the register allocator and instruction scheduler within the caller's scope. The bonus is large enough (equivalent to inlining ~400 regular LLVM instructions at cost 5 each) to override most size-based objections.

NVIDIA vs. LLVM: Complete Comparison

Feature	NVIDIA (Model A)	LLVM (Model B)
Default threshold	20,000	225
Aggressive threshold	40,000	Varies by `-O` level
Primary cost metric	Argument type-size coercion	Instruction count
Cost per instruction	N/A (not instruction-based)	5 units
Struct handling	Deep field-by-field walk (depth limit 20)	Aggregate flat cost
GPU opcode bonus	+2000 for tag 9	N/A
Load x store bail-out	> 100 combinations	N/A
Switch heuristics	3 dedicated knobs	1 (`case-cluster-penalty`)
Budget system	Per-caller + module total + adjusted	Per-callsite only
Diagnostic knob	`profuseinline`	`inline-remark-attribute`
Force-all mode	`nv-inline-all`	`inline-all-viable-calls` (hidden)
ML-based advisor	No (separate path via Model C)	Yes (`InlineAdvisor`)
Recursive cost multiplier	No	`function-inline-cost-multiplier`
Alloca merging control	`disable-inlined-alloca-merging`	N/A
Call penalty	Implicit (`.param` marshaling cost)	25 units per callsite
PGO integration	No evidence	`inline-deferral`, `hot-callsite-threshold`

Decision Flowchart

The complete inlining decision flow through Model A:

                     CallSite arrives at sub_186CA00
                              |
                   sub_186B510: check remarks
                              |
                   sub_1864060: shouldInline
                              |
                     +--------+--------+
                     |                 |
              Name is LLVM       Name is user
              intrinsic?         function
                     |                 |
                NEVER INLINE     Init analysis state
                                 sub_1ACF5D0
                                      |
                                 Pre-analyze callee
                                 sub_1ACF600
                                      |
                              +-------+-------+
                              |               |
                         Returns 0       Returns != 0
                         (analyzable)    (cannot analyze)
                              |               |
                     Check linkage      NEVER INLINE
                     (7=linkonce_odr
                      8=weak_odr)
                              |
                  +-----------+-----------+
                  |                       |
            Eligible                Not eligible
                  |                  (wrong linkage,
             Check noinline,         declaration,
             optnone attrs           protected vis)
                  |                       |
            +-----+-----+          NEVER INLINE
            |           |
         Has attr    No attr
            |           |
       NEVER INLINE  Walk callee body
                     collect loads/stores
                              |
                     loads * stores > 100?
                        +-----+-----+
                        |           |
                       Yes         No
                        |           |
                   BAIL OUT    Type-size computation
                               (60% of function)
                                    |
                              Compute per-argument
                              coercion cost
                                    |
                              Total cost < inline-budget?
                                 +-----+-----+
                                 |           |
                                Yes         No
                                 |           |
                              INLINE     DO NOT INLINE
                              Transform callsite
                              sub_1648A60 / sub_15F8BC0

Call Graph

sub_186CA00  Inliner::inlineCallsImpl (CGSCC SCC walk)
  +-> sub_186B510  Inline decision with remarks
      +-> sub_1864060  shouldInline / cost computation (THIS)
          +-> sub_1ACF5D0  Inline analysis state init
          +-> sub_1ACF600  Pre-analysis callee walk
          |   +-> sub_1ACF0B0  Metric collection
          +-> sub_185FD30  Argument materialization cost (5 KB)
          +-> sub_185E850  Post-inline cleanup assessment (9 KB)
          +-> sub_185B2A0  Struct layout walk, depth limit 20 (3 KB)
          +-> sub_185D7C0  Argument matching / coercion (9 KB)
          +-> sub_185B9F0  Recursive operand simplification (5 KB)
          +-> sub_185CCC0  Type compatibility check (4 KB)
          +-> sub_18612A0  GlobalOpt integration (65 KB, conditional)
  +-> sub_1868880  Inline threshold adjustment (12 KB)
  +-> sub_1866840  Post-inline callsite update (42 KB)

Why 89x the LLVM Budget

The 20,000 vs. 225 ratio sounds extreme, but the economics are different:

CPU call overhead is approximately 5-20 cycles (push/pop registers, branch prediction handles the rest). A function with 50 instructions that is not inlined costs perhaps 60-70 cycles total. Inlining saves ~15 cycles. The savings must justify the I-cache pressure increase.

GPU call overhead includes: (1) declaring .param variables for every argument, (2) st.param for each argument value, (3) ld.param in the callee for each argument, (4) register save/restore to local memory (device DRAM, 200-800 cycle latency) if the callee's register demand exceeds what is available, (5) loss of instruction scheduling across the call boundary, (6) branch divergence at call/return. For a function with 8 arguments, the .param overhead alone is 16+ memory operations. With register spilling, a single function call can cost 1000+ cycles.

Furthermore, GPU functions tend to be small (typically 10-100 instructions for device helper functions). The NVIDIA cost model does not count instructions at all -- it counts the argument marshaling cost. A function with 200 instructions but 2 scalar arguments is cheap to call; a function with 10 instructions but 8 struct arguments is expensive. The 20,000 budget reflects this: it is not 89x more aggressive in inlining large functions; it is calibrated for a cost model where the per-argument coercion cost dominates rather than instruction count.

With -aggressive-inline (budget 40,000, i.e., 178x the LLVM default), NVIDIA targets workloads like OptiX where complete flattening is desired but nv-inline-all is too blunt (it ignores all cost analysis).

What Upstream LLVM Gets Wrong for GPU

Upstream LLVM's inliner cost model was built for x86/AArch64 where function call overhead is small and code size is the primary inlining constraint. On GPU, every assumption is wrong:

Upstream assumes a 225-instruction budget is sufficient. The default inline-threshold of 225 reflects CPU economics where a function call costs 5-20 cycles (register push/pop + branch). On GPU, a single function call with 8 struct arguments generates 16+ .param-space memory operations, potential register spills to device DRAM (200-800 cycle latency), loss of cross-boundary scheduling, and branch divergence hazards. NVIDIA's 20,000-unit budget (89x upstream) is calibrated for this reality, not because GPU code is more aggressive about inlining large functions.
Upstream counts instructions as the primary cost metric. LLVM prices each instruction at 5 units and subtracts savings from constant propagation and dead code elimination. NVIDIA's custom inliner (Model A) does not count instructions at all -- 60% of its 75KB body computes byte-level argument type-size coercion costs, because on GPU the dominant cost of a function call is .param address-space marshaling, not instruction count.
Upstream has no concept of .param-space argument passing cost. CPU calling conventions pass arguments in registers (nearly free) or via L1-cached stack (3-5 cycles). On GPU, every argument requires explicit DeclareParam + st.param (caller) + ld.param (callee) sequences. A function with 10 instructions but 8 struct arguments is more expensive to call than one with 200 instructions and 2 scalar arguments. Upstream's model gets this exactly backwards.
Upstream uses a single per-callsite budget. NVIDIA uses a three-level system: per-caller budget (inline-budget), module-wide total budget (inline-total-budget), and a dynamically adjusted secondary budget (inline-adj-budget1) that can give kernel entry points higher limits. This multi-level approach prevents any single caller from bloating while still allowing aggressive inlining where it matters most.
Upstream has no GPU intrinsic awareness. NVIDIA's Model D applies a +2000 cost bonus for functions containing opcode tag 9 instructions (likely tensor core or warp-level intrinsics), because these operations benefit enormously from being visible to the register allocator and scheduler within the caller's scope. Upstream LLVM has no mechanism to express "this function contains operations that are disproportionately valuable to inline."

Key Addresses

Address	Size	Function
`0x1864060`	75 KB	`shouldInline` / inline cost computation
`0x186CA00`	61 KB	`Inliner::inlineCallsImpl` (CGSCC core)
`0x186B510`	20 KB	Inline decision with remarks
`0x1866840`	42 KB	Post-inline callsite update
`0x1868880`	12 KB	Inline threshold adjustment
`0x185FD30`	5 KB	Argument materialization
`0x185E850`	9 KB	Post-inline cleanup
`0x185B2A0`	3 KB	Struct layout walk (depth 20)
`0x185D7C0`	9 KB	Argument coercion check
`0x185B9F0`	5 KB	Recursive operand simplification
`0x185CCC0`	4 KB	Type compatibility check
`0x18612A0`	65 KB	GlobalOpt integration
`0x1ACF5D0`	--	Inline analysis state init
`0x1ACF600`	--	Pre-analysis callee walk
`0x30DC7E0`	51 KB	`InlineCostAnalysis::analyzeCall` (LLVM)
`0x2613930`	69 KB	New PM CGSCC inliner
`0x2609820`	57 KB	Inline advisor / ML inliner
`0x38576C0`	58 KB	NVPTX target-specific cost modifier
`0x4DBEC0`	14 KB	NVIDIA inliner knob registration
`0x58FAD0`	27 KB	LLVM InlineCost option registration

Reimplementation Checklist

Type-size-based cost model (60% of the inliner). Implement the argument coercion cost engine that walks NVVM IR type nodes (16 type tags: half through opaque/token) to compute byte-level sizes for both callsite actuals and callee formals, using the formula byte_size = (multiplier * bit_width + 7) >> 3. Flag arguments where callee_arg_size > callee_formal_size as requiring .param-space widening.
20,000-unit budget system. Implement the three-level budget: per-caller inline-budget (default 20,000), module-wide inline-total-budget, and dynamically adjusted inline-adj-budget1 (kernel entry points may receive higher limits). Include the -aggressive-inline mapping to budget 40,000 and nv-inline-all force-all mode.
Early bail-out chain. Implement the eligibility checks in order: LLVM intrinsic name prefix rejection, pre-analysis callee walk (instruction/call/block counts), linkage check (linkonce_odr/weak_odr only), visibility check, noinline/optnone attribute rejection, and the loads * stores > 100 combinatorial bail-out.
Struct layout walk (depth limit 20). Implement the stack-based DFS walk of struct type trees to count fields for coercion cost, handling pointer types (tag 15), struct types (tag 13/14), and array types (tag 16), with a hard depth limit of 20 levels.
Switch statement heuristics. Implement the three GPU-specific switch knobs (inline-switchctrl, inline-numswitchfunc, inline-maxswitchcases) that penalize switch-heavy callees where branch divergence, absent branch prediction, and reconvergence overhead make inlining particularly costly.
NVPTX opcode tag 9 bonus (+2000). Implement the target-specific cost modifier that scans callee instructions for opcode tag 9 (likely tensor core/warp intrinsics) and adds a +2000 bonus to encourage inlining functions containing GPU operations that benefit from cross-boundary register allocation and scheduling.

ThinLTO Function Import

CICC v13.0 implements LLVM's ThinLTO function import pipeline with GPU-specific modifications to the threshold computation, candidate filtering, and provenance tracking. The core of the system lives in two functions -- sub_1854A20 (the import driver, 4,326 bytes) and sub_1853180 (the threshold computation engine, 5,059 bytes) -- with an entry point at sub_1855B10 that parses the -summary-file / -function-import command line and orchestrates the whole-module import flow. The fundamental difference from CPU ThinLTO is that GPU compilation operates in a closed-world model: there are no shared libraries, no dynamic linking, and no PLT/GOT indirection. Every device function will be statically linked into the final PTX. This means CICC can afford far more aggressive import thresholds than CPU compilers, because the code size cost of importing is paid once per GPU binary rather than once per shared-object load.

The import subsystem reads NVModuleSummary data (built by sub_D7D4E0, see Module Summary) to make summary-guided decisions about which functions to pull from other translation units. Each candidate is evaluated against a floating-point threshold that incorporates callsite hotness, linkage type, and a per-priority-class multiplier. A global import budget caps the total number of imports to prevent compile-time explosion. After import, each materialized function receives thinlto_src_module metadata so downstream passes (particularly the inliner) know its origin module.


Import driver	`sub_1854A20` (`0x1854A20`, 4,326 B)
Threshold computation	`sub_1853180` (`0x1853180`, 5,059 B)
Threshold comparison gate	`sub_18518A0` (`0x18518A0`)
Import execution	`sub_15E4B20` (`0x15E4B20`)
Import candidate evaluator	`sub_1852CC0` (`0x1852CC0`)
Entry point	`sub_1855B10` (`0x1855B10`, 10,503 B)
Whole-module processing	`sub_1858B90` (`0x1858B90`, 31,344 B)
Type metadata propagation	`sub_185E850` (`0x185E850`, 24,263 B)
Pipeline registration	`"function-import"` (slot 43, Module pass)
Knob constructor (primary)	`ctor_184_0` (`0x4DA920`, 13,693 B)
Knob constructor (supplementary)	`ctor_029` (`0x489C80`, 1,120 B)
Knob constructor (pass-level)	`ctor_420_0` (`0x532010`, 11,787 B)

Why GPU ThinLTO Differs from CPU ThinLTO

Upstream LLVM's ThinLTO was designed for CPU executables and shared libraries where import decisions must balance code size (impacts disk, cache, page faults) against optimization opportunity (cross-module inlining, constant propagation). The default import-instr-limit is 100 instructions, the cold multiplier is 0, and the hot multiplier is 10x. These conservative defaults reflect a world where over-importing bloats .text sections shared across address spaces.

GPU compilation inverts these tradeoffs:

No shared libraries. Device code is statically linked into a fatbinary. There is no dynamic linker, no GOT, no PLT. Importing a function costs compile time but has zero runtime overhead beyond instruction cache pressure.
Function calls are expensive. As documented in the inliner cost model, every GPU function call marshals arguments through .param address space via st.param / ld.param sequences. Inlining (which requires importing first) eliminates this overhead entirely.
Closed-world optimization. The compiler sees all device code. There are no opaque DSOs. This means aggressive import cannot break ABI contracts that don't exist.
Register pressure is the real constraint. On GPU, the limiting factor is not code size but register count, which determines occupancy. Import + inline can actually reduce register pressure by enabling cross-function register allocation and eliminating .param-space spills.

These factors push CICC toward much more aggressive import thresholds. The priority-class multiplier system (section below) allows CICC to tune import aggressiveness per-callsite rather than using a single global threshold.

What Gets Imported and What Does Not

The NVModuleSummary builder (sub_D7D4E0) assigns a 4-level import priority to every global value when building the module summary index:

Priority	Meaning	Import behavior
0	Not importable	Local/hidden linkage, never imported
1	Importable, not preferred	Will import only if threshold is generous
2	Standard importable	Normal import candidate
3	Force-import	Highest priority, always imported if budget allows

The priority is determined by querying the ImportPriorityTable (parameter a4 of sub_D7D4E0) via sub_D84370, sub_D84440 (force-import check), and sub_D84450 (importable check). A global override at dword_4F87C60 can force all symbols to priority 1 or higher.

Functions that are imported:

__device__ functions with internal or linkonce_odr linkage (template instantiations, inline functions)
Math library implementations (libdevice functions) called from device code
Helper functions from header-only libraries (Thrust, CUB, cutlass templates)
Constant global variables with initializers (import-constants-with-refs = true by default)

Functions that are NEVER imported:

Kernels (__global__ functions). These are entry points. They are never candidates for cross-module import because they represent the root of execution; they are called from host code, not from other device functions. The summary builder marks them as non-importable.
Host functions. Host code is handled by the host compiler (gcc/clang), not cicc. They never appear in the device module summary.
Functions in address space 25. The summary builder at lines 1388-1395 explicitly skips functions whose type resolves to address space 25, with a goto LABEL_495 that bypasses the import-eligible path. The raw report notes: "device functions can't be cross-module imported in ThinLTO" -- this refers specifically to functions that are declarations only with device-memory address space linkage, meaning they reference device-side symbols without a definition in the current TU.
Functions with the "not importable" flag. Bit 4 (0x10) of the linkage byte at offset +0x0C in the function summary entry. The import driver checks test byte [entry+0Ch], 0x10 and skips on set.

Import Algorithm: Complete Pseudocode

Complexity. Let C = number of import candidates across all modules, G = number of unique GUIDs, and L = total number of name entries across all candidates. Stage 1 (threshold computation, sub_1853180) iterates every candidate once: O(C). For each candidate, the GUID dedup hash table (slot = GUID * 37 & (size - 1)) provides O(1) amortized lookup with linear probing. The name array scan is up to 4-level unrolled, giving O(L) total across all candidates. The 11-case linkage dispatch via jump table is O(1) per entry. The priority-class threshold adjustment is O(1) per candidate (a single float multiply). The global budget check is O(1). Overall Stage 1: O(C + L). Stage 2 (triple-pass driver, sub_1854A20) processes three priority-ordered linked lists, each in a single pass: O(C) total. Per-candidate import execution (sub_15E4B20) is O(I_f) where I_f = instructions in the imported function (bitcode materialization). The whole-module processing (sub_1858B90, 31KB) is O(F * I_avg) where F = total functions and I_avg = average instruction count. The dedup hash table growth follows standard load-factor 75% doubling, maintaining O(1) amortized operations. Total: O(C + L + sum(I_imported)).

The import process runs in two major stages. Stage 1 (sub_1853180) builds a prioritized list of qualifying candidates by evaluating each against a computed threshold. Stage 2 (sub_1854A20) materializes candidates via a triple-pass sweep over three priority-ordered linked lists, executing the actual cross-module function import.

Stage 1: Threshold Computation Engine (`sub_1853180`)

Address range: 0x1853180--0x1854543 (5,059 bytes). Six parameters, 0xB8-byte stack frame. Uses a jump table at dword_42BA140 for the 11-case linkage-type dispatch.

// sub_1853180 -- Threshold computation with GUID dedup and priority-class multipliers
//
// Evaluates every candidate in summary_ctx against base_threshold adjusted by
// priority class.  Emits qualifying candidates to result_array as 24-byte
// entries {GUID, threshold, import_record_ptr}.  Tracks already-evaluated
// GUIDs via guid_hash_table to prevent duplicate work.
//
// Binary: 0x1853180, 5059 bytes.  Stack: 0xB8.
// Jump table: dword_42BA140 (11 entries, linkage dispatch).
//
// Globals read:
//   dword_4FAAE80  hot_multiplier      (float, default 10.0)
//   dword_4FAACC0  cold_multiplier     (float, default 0.0)
//   dword_4FAADA0  critical_multiplier (float, default 100.0)
//   dword_4FAB040  default_multiplier  (float, default 1.0)
//   dword_4FAB120  global_import_budget (int, default -1 = unlimited)
//   dword_4FAA770  running_import_count (int, reset per module)

fn threshold_compute(
    summary_ctx,       // rdi -> [rbp-0x88]: candidate arrays and metadata
    module_info,       // rsi -> [rbp-0x58]: source module summary
    base_threshold,    // edx -> [rbp-0x7C]: integer base threshold (import-instr-limit)
    guid_hash_table,   // rcx -> [rbp-0x50]: DenseMap<uint64_t, metadata> for dedup
    result_array,      // r8  -> [rbp-0x60]: growable output array
    visited_set,       // r9  -> [rbp-0xA0]: tracks already-evaluated GUIDs
):
    candidate_begin = summary_ctx[+0x28]   // r12: start of candidate pointer array
    candidate_end   = summary_ctx[+0x30]   // r14: one-past-end

    // ---- Outer loop: iterate every candidate ----
    while candidate_begin != candidate_end:                 // 0x18531C4
        candidate_ptr = *candidate_begin
        guid = candidate_ptr & ~0x7                         // mask low 3 tag bits

        // ---- GUID dedup via multiplicative-hash table ----
        table_size = guid_hash_table[+0x18]
        if table_size > 0:                                  // 0x18531D0
            table_data = guid_hash_table[+0x00]
            raw_guid   = candidate_ptr[+0x00]               // 8-byte GUID

            // Hash: slot = (GUID * 37) & (table_size - 1)
            // Implemented as: lea edx,[rsi+rsi*8] -> edx=GUID*9
            //                  lea edx,[rsi+rdx*4] -> edx=GUID+GUID*36=GUID*37
            slot = (raw_guid * 37) & (table_size - 1)       // 0x18531E8

            // 16-byte slots: {GUID (8B), metadata (8B)}
            probe_ptr = table_data + slot * 16
            stored_guid = probe_ptr[+0x00]

            if stored_guid == raw_guid:
                goto next_candidate                         // already evaluated

            // Linear probing on collision
            probe_step = 1
            while stored_guid != 0xFFFFFFFFFFFFFFFF:        // -1 = empty sentinel
                slot = (slot + probe_step) & (table_size - 1)
                probe_step += 1
                probe_ptr = table_data + slot * 16
                stored_guid = probe_ptr[+0x00]
                if stored_guid == raw_guid:
                    goto next_candidate                     // found: already seen

            // GUID not in table -- fall through to evaluation

        // ---- Name array scan ----
        // When dedup table is absent, scan name components directly
        name_begin = candidate_ptr[+0x18]                   // 0x1853250
        name_end   = candidate_ptr[+0x20]

        // Up-to-4-level unrolled name comparison (0x1853670-0x18538BA):
        //   Level 1: entry = [name_ptr - 8]
        //   Level 2: entry = [name_ptr + 0]
        //   Level 3: entry = [name_ptr + 8]
        //   Level 4: entry = [name_ptr + 0x10]
        // Each level checks:
        //   visibility flag at [r14+0xB0] -> if set: test byte [entry+0Ch], 0x20
        //   entry type:  entry[+0x08] must == 2 (function summary)
        //   not-importable: test byte [entry+0Ch], 0x10 -> skip if set
        //   linkage:     entry[+0x0C] & 0x0F -> 11-case switch

        for each name_entry in name_begin..name_end:
            entry = *name_entry
            if entry[+0x08] != 2:                           // not a function summary
                continue
            linkage_byte = entry[+0x0C]
            if linkage_byte & 0x10:                         // "not importable" flag
                continue

            linkage = linkage_byte & 0x0F                   // 0x185324E

            // ---- Linkage-type dispatch (11 cases via jump table) ----
            switch linkage:                                 // dword_42BA140
                case 0:  // ExternalLinkage
                case 1:  // AvailableExternallyLinkage
                case 3:  // InternalLinkage
                case 5:  // ExternalWeakLinkage
                case 6:  // CommonLinkage
                    goto standard_threshold_path            // loc_18536E8

                case 7:  // WeakAnyLinkage
                case 8:  // WeakODRLinkage
                    // Weak linkage requires name verification via memcmp
                    // to confirm the candidate matches the expected symbol
                    // before allowing import.
                    expected_name = resolve_name(candidate_ptr)
                    actual_name   = resolve_name(entry)
                    if memcmp(expected_name, actual_name, name_len) != 0:
                        continue                            // 0x1853A71: name mismatch
                    goto standard_threshold_path

                case 2:  // AppendingLinkage
                case 4:  // PrivateLinkage
                case 9:  // LinkOnceAnyLinkage
                case 10: // LinkOnceODRLinkage
                    goto special_handling_path               // loc_1853928

            // ---- Standard threshold path ----
            standard_threshold_path:
                // Dereference alias chain for external linkage
                if entry.function_type == 0:                // external
                    entry = entry[+0x40]                    // follow alias pointer
                    linkage = entry[+0x0C] & 0x0F           // re-extract

                // ---- Priority-class threshold adjustment ----
                // 0x1853441: convert base_threshold to float
                threshold_f = (float)base_threshold         // cvtsi2ss xmm2, eax

                priority_class = entry[+0x08] & 0x7         // 3-bit field, al=[r15+8]&7

                switch priority_class:
                    case 3:  // HOT callsite
                        threshold_f *= dword_4FAAE80        // hot_multiplier (10.0)
                                                            // mulss xmm0, cs:dword_4FAAE80
                    case 1:  // COLD callsite
                        threshold_f *= dword_4FAACC0        // cold_multiplier (0.0)
                                                            // mulss xmm0, cs:dword_4FAACC0
                    case 4:  // CRITICAL callsite
                        threshold_f *= dword_4FAADA0        // critical_multiplier (100.0)
                                                            // mulss xmm0, cs:dword_4FAADA0
                    default: // no priority match
                        threshold_f *= dword_4FAB040        // default_multiplier (1.0)
                                                            // mulss xmm0, cs:dword_4FAB040

                adjusted_threshold = (int)threshold_f       // cvttss2si rax, xmm0
                // Stored to [rbp-0x78] and r11d for comparison

                // ---- Cost comparison (0x1853AA8) ----
                function_cost = entry[+0x40]                // IR instruction count
                if adjusted_threshold < function_cost:      // cmp r11d, [rcx+40h]
                    continue                                // jb not_eligible

                // ---- "Not importable" double-check ----
                if entry[+0x0C] & 0x10:                     // test byte [rcx+0Ch], 0x10
                    continue

                // ---- Max-threshold-wins for duplicates (0x18534C2) ----
                if guid already in result_array:
                    existing_record = result_slot[+0x10]
                    if existing_record != NULL:
                        existing_threshold = result_slot[+0x08]
                        if (float)existing_threshold >= threshold_f:
                            continue                        // existing is better; skip
                        result_slot[+0x08] = adjusted_threshold  // update to higher
                        goto next_candidate

                // ---- Global budget check (0x185340A) ----
                budget = dword_4FAB120                      // global_import_budget
                if budget >= 0:                             // test eax,eax; js proceed
                    if dword_4FAA770 >= budget:             // cmp counter vs budget
                        continue                            // jge skip: budget exhausted

                // ---- Allocate dedup hash table node (0x1853953) ----
                node = malloc(16)                           // 0x22077B0: edi=0x10
                if node != NULL:
                    node[+0x00] = 0                         // clear forward pointer
                    node[+0x08] = guid
                    sub_1851560(                            // hash table insert
                        guid_hash_table[+0x08],             // insert point
                        bucket_index,                       // slot
                        guid,                               // key
                        1                                   // insert_mode
                    )

                // ---- Emit to result array (0x1853517) ----
                count    = result_array[+0x08]              // current count
                capacity = result_array[+0x0C]
                if count >= capacity:
                    grow_result_array(result_array)         // realloc path

                // 24-byte entry: offset = count * 24
                entry_ptr = result_array.base + count * 24  // lea rax,[rax+rax*2]; shl rax,3
                entry_ptr[+0x00] = guid                     // 8 bytes: function GUID
                entry_ptr[+0x08] = adjusted_threshold       // 4 bytes: threshold value
                entry_ptr[+0x10] = import_record_ptr        // 8 bytes: import record

                result_array[+0x08] = count + 1             // increment count

                // ---- Increment global counter (0x1853510) ----
                dword_4FAA770 += 1                          // add cs:dword_4FAA770, 1

    next_candidate:
        candidate_begin += 8                                // advance to next candidate

Threshold computation arithmetic in detail. The four multiplier constants live in .data as IEEE 754 single-precision floats. The SSE scalar path is:

; At 0x1853441 -- convert integer base threshold to float
pxor   xmm2, xmm2
cvtsi2ss xmm2, rax          ; xmm2 = (float)base_threshold

; Priority dispatch -- one of four paths selected:
; HOT (priority 3):
movss  xmm0, cs:dword_4FAAE80   ; xmm0 = 10.0f
mulss  xmm0, xmm2               ; xmm0 = 10.0 * base

; COLD (priority 1):
mulss  xmm0, cs:dword_4FAACC0   ; xmm0 = 0.0 * base = 0.0

; CRITICAL (priority 4):
mulss  xmm0, cs:dword_4FAADA0   ; xmm0 = 100.0 * base

; DEFAULT (all others):
mulss  xmm0, cs:dword_4FAB040   ; xmm0 = 1.0 * base

; Convert back to integer for comparison
cvttss2si rax, xmm0             ; rax = (int)threshold_f (truncation)

The cvttss2si truncation means threshold values are floored, not rounded. For base_threshold=100 and hot_multiplier=10.0, the adjusted threshold is exactly 1000. The cold path with multiplier 0.0 always produces threshold 0, meaning cold functions are never imported unless the multiplier is overridden.

Stage 2: Triple-Pass Import Driver (`sub_1854A20`)

Address range: 0x1854A20--0x1855B06 (4,326 bytes). Four parameters, 0x278-byte stack frame. Callee-saved: r15, r14, r13, r12, rbx.

The driver processes candidates across three priority-ordered linked lists embedded in the guid_import_map structure. Each list covers a different import priority class. The three passes guarantee that high-priority candidates are imported (and consume budget) before lower-priority ones get a chance.

// sub_1854A20 -- Triple-pass import driver
//
// Materializes cross-module function bodies for candidates that pass
// threshold evaluation.  Processes three linked lists in priority order:
//   Pass 1: primary   list at [import_map + 0x00]  (highest priority)
//   Pass 2: secondary list at [import_map + 0x10]  (medium priority)
//   Pass 3: tertiary  list at [import_map + 0x30]  (lowest priority)
//
// For each candidate: check importable flag, evaluate threshold via
// sub_18518A0, execute import via sub_15E4B20, optionally attach
// thinlto_src_module metadata.
//
// Binary: 0x1854A20, 4326 bytes.  Stack: 0x278.
//
// Globals read:
//   byte_4FAAA20   enable_import_metadata (bool)

fn import_driver(
    import_ctx,          // rdi -> [rbp-0x258]: import state object
    module_summary_idx,  // rsi -> [rbp-0x260]: combined summary index
    source_module_info,  // rdx -> [rbp-0x278]: source module descriptor
    guid_import_map,     // rcx -> [rbp-0x268]: hash map of GUID -> import lists
                         //        also saved to rbx
):
    // ---- Initialize resolved-summary storage (0x1854A45) ----
    sub_1674380(
        &local_resolved_storage,   // rdi = [rbp-0x290]
        source_module_info         // rsi = rdx
    )

    // ---- Check if import map is empty (0x1854A6C) ----
    entry_count = guid_import_map[+0x08]
    if entry_count == 0:
        goto empty_import_path                              // 0x1854AB3

    // ======================================================================
    // PASS 1: PRIMARY CANDIDATE LIST  (0x1854B99 -- 0x1854F3B)
    // List head: [guid_import_map + 0x00]
    // Importable flag: byte [node - 0x21] & 0x20
    // Summary ptr:     [node - 0x38]
    // ======================================================================

    primary_list = guid_import_map[+0x00]                   // rsi = [rbx]

    // Scan to first valid entry (skip sentinels -8 and NULL)
    cursor = primary_list[+0x00]
    if cursor == 0xFFFFFFFFFFFFFFF8 || cursor == NULL:
        scan forward through primary_list[+0x08], [+0x10], ...
        // Inner scan: load qword, test for NULL, cmp against -8
        // Stop at first non-null, non-sentinel entry

    end_of_candidates = primary_list + entry_count * 8      // r12

    while cursor != end_of_candidates:                      // 0x1854BF0
        // ---- Load candidate descriptor ----
        desc = *cursor                                      // rax = [r14]
        summary_data = desc[+0x00]                          // rdx = [rax]
        cost_info    = desc + 0x40                          // threshold/cost at +0x40

        // ---- Evaluate candidate (0x1854C02) ----
        sub_1852CC0(&local_buf, guid_import_map)            // import candidate evaluator

        // ---- Advance to next valid entry ----
        next = cursor[+0x08]
        // Scan forward: skip NULL and sentinel -8 entries
        while next == NULL || next == 0xFFFFFFFFFFFFFFF8:
            next += 8

        // ---- Per-node import decision loop (0x1854E39) ----
        for each node in candidate.linked_nodes:
            if node == NULL:
                continue                                    // test r15, r15

            // Importable flag check
            importable = node[-0x21] & 0x20                 // test byte [r15-0x21], 0x20
            if !importable:
                continue                                    // jz skip

            // Extract function summary (stored 0x38 bytes before node)
            func_summary = node[-0x38]                      // r13 = [r15-0x38]

            // Resolve function name/info
            sub_15E4EB0(cursor, func_summary)               // 0x1854E61

            // ---- Format import remark (diagnostic output) ----
            resolved_threshold = [rbp-0x1D8]
            resolved_info      = [rbp-0x1E0]
            sub_16C1840(guid_import_map, resolved_info, resolved_threshold)
                                                            // cost component remark
            sub_16C1A90(guid_import_map, resolved_info, resolved_threshold)
                                                            // threshold component remark
            sub_16C1AA0(guid_import_map, [rbp-0x210])       // finalize remark string
            free([rbp-0x1E0])                               // cleanup temp string

            // ---- Threshold comparison gate (0x1854EE3) ----
            cost      = cursor[+0x10]                       // estimated function cost
            hot_count = cursor[+0x08]                       // call frequency / hotness
            qualifies = sub_18518A0(hot_count, cost)        // THRESHOLD GATE
            if !qualifies:                                  // test rax,rax; jz skip
                continue

            // ---- Execute import (0x1854EF7) ----
            sub_15E4B20(import_ctx, func_summary)           // MATERIALIZE FUNCTION

            // Check abort signal
            status = [rbp-0xD0]
            if status & 0xFFFFFFFFFFFFFFFE:                 // caller requested abort
                goto early_return

            // ---- Attach provenance metadata (0x1854F0D) ----
            if byte_4FAAA20 != 0:                           // enable-import-metadata
                source_name = sub_161FF10(func_summary)     // resolve source module name
                // Create optimization remark
                sub_1627350(remark_ctx, 1)                  // edx=1: enabled

                // Attach metadata string (0x1855261):
                //   lea rsi, "thinlto_src_module"  ; 0x42BA2F8, length 0x12
                sub_1627100(
                    func_summary,                           // target function
                    "thinlto_src_module",                   // metadata key (18 chars)
                    source_name                             // metadata value
                )

    // ======================================================================
    // PASS 2: SECONDARY CANDIDATE LIST  (0x1854F41 -- 0x1855074)
    // List head: [guid_import_map + 0x10]
    // Same importable-flag check: byte [node - 0x21] & 0x20
    // Same summary extraction:    [node - 0x38]
    // ======================================================================

    secondary_list = guid_import_map[+0x10]                 // r15 = [rcx+10h]
    secondary_sentinel = guid_import_map[+0x08]

    // Identical processing pattern:
    //   - Iterate linked-list nodes
    //   - Check importable flag: byte [r15-0x21] & 0x20
    //   - Extract summary: [r15-0x38]
    //   - sub_18518A0 threshold gate
    //   - sub_15E4B20 import execution
    //   - Conditional thinlto_src_module metadata attachment

    for each node in secondary_list:
        if node[-0x21] & 0x20 == 0:
            continue
        summary = node[-0x38]
        if !sub_18518A0(node.hot_count, node.cost):
            continue
        sub_15E4B20(import_ctx, summary)
        if byte_4FAAA20:
            attach_provenance_metadata(summary)

    // ======================================================================
    // PASS 3: TERTIARY CANDIDATE LIST  (0x1855074 -- 0x1855190)
    // List head: [guid_import_map + 0x30]
    // Different offsets:
    //   Summary extraction: [node - 0x30]  (not -0x38)
    //   Importable flag:    byte [node - 0x19] & 0x20  (not -0x21)
    // ======================================================================

    tertiary_list = guid_import_map[+0x30]

    // Same processing pattern but with adjusted offsets:
    for each node in tertiary_list:
        if node[-0x19] & 0x20 == 0:                        // note: -0x19, not -0x21
            continue
        summary = node[-0x30]                               // note: -0x30, not -0x38
        if !sub_18518A0(node.hot_count, node.cost):
            continue
        sub_15E4B20(import_ctx, summary)
        if byte_4FAAA20:
            attach_provenance_metadata(summary)

    // ======================================================================
    // POST-IMPORT: Result materialization (0x1854B3C -- 0x1854B97)
    // ======================================================================

    result_count = [rbp-0x100]
    if result_count > 0:
        import_source = sub_16704E0()                       // r13: source module handle
        import_dest   = sub_16704F0()                       // r14: destination module handle

        result_base = [rbp-0x110]
        result_end  = result_base + result_count * 8

        for each result_entry in result_base..result_end:   // 0x1854B7D
            func = *result_entry

            // Skip if function already exists in source module
            if sub_1670560(func, import_source):            // test al,al; jnz next
                continue

            // Materialize into destination module
            sub_1670560(func, import_dest)

    // ======================================================================
    // CLEANUP (0x1854AE7 -- 0x1854B22)
    // ======================================================================

    // Release import list entries (16-byte stride)
    cleanup_base = [rbp-0xF0]
    cleanup_count = eax
    cleanup_end = cleanup_base + cleanup_count * 16

    for each entry in cleanup_base..cleanup_end (stride=16):
        value = entry[+0x00]
        if value == 0xFFFFFFFFFFFFFFF8:                     // sentinel -8: empty
            continue
        if value == 0xFFFFFFFFFFFFFFFC:                     // sentinel -4: deleted
            continue
        sub_161E7C0(entry[+0x08])                           // release associated data

    free(cleanup_base)                                      // j___libc_free_0

    // ---- Empty-import finalization ----
    empty_import_path:                                      // 0x1854AB3
        import_ctx.status = 0                               // clear status byte
        flags = import_ctx[+0x08]
        flags = (flags & 0xFC) | 0x02                       // set "import complete, no imports"
        import_ctx[+0x08] = flags
        sub_1851C60(&local_import_list)                     // finalize empty path cleanup

Why three passes with different offsets. The three linked lists represent three structural layers in the guid_import_map:

Pass	List head offset	Summary offset	Importable-flag offset	Interpretation
1 (primary)	`[map+0x00]`	`node[-0x38]`	`node[-0x21] & 0x20`	Direct call targets from the current module -- highest priority because they are on the critical path
2 (secondary)	`[map+0x10]`	`node[-0x38]`	`node[-0x21] & 0x20`	Transitively-reachable functions (callees of callees) -- import enables deeper inlining chains
3 (tertiary)	`[map+0x30]`	`node[-0x30]`	`node[-0x19] & 0x20`	Speculative candidates (address-taken functions, indirect call targets inferred from devirtualization) -- lowest confidence

The different offsets in pass 3 (-0x30 instead of -0x38, -0x19 instead of -0x21) indicate a different node layout for speculative candidates. These nodes carry less metadata (8 fewer bytes between the summary pointer and the node base, and the importable flag is 8 bytes closer to the node).

Threshold Comparison Gate (`sub_18518A0`)

The gate function takes two arguments -- hot_count (rdi) and cost (rsi) -- and returns nonzero if the candidate qualifies for import. The driver calls it at three points (once per pass). This function encapsulates the final accept/reject decision after the per-priority-class threshold adjustment has already been applied by sub_1853180.

// sub_18518A0 -- Threshold comparison gate
// Returns: nonzero if candidate should be imported, zero otherwise
//
// rdi = hot_count (call frequency from profile or summary)
// rsi = cost      (adjusted threshold value from Stage 1)

fn threshold_gate(hot_count, cost) -> bool:
    // The exact comparison logic depends on whether profile data
    // is available.  With profile data, hot_count is a raw call
    // count; the gate compares the cost against a profile-weighted
    // threshold.  Without profile data, this degenerates to a
    // direct comparison: cost <= threshold.
    return hot_count > 0 || cost <= current_threshold

Threshold Multiplier Constants

The four floating-point multiplier constants are stored in the .data section and are set by the corresponding cl::opt registrations in ctor_184_0:

Address	Knob	Default	Purpose
`dword_4FAAE80`	`import-hot-multiplier`	10.0	Multiplier for hot callsites
`dword_4FAACC0`	`import-cold-multiplier`	0.0	Multiplier for cold callsites
`dword_4FAADA0`	`import-critical-multiplier`	100.0	Multiplier for critical callsites
`dword_4FAB040`	(default path)	1.0	Multiplier when no priority class matches

With the upstream default import-instr-limit of 100, a hot callsite gets threshold 1,000 instructions and a critical callsite gets threshold 10,000. The cold multiplier of 0.0 means cold functions are never imported by default -- the threshold evaluates to zero.

Effective threshold table (for import-instr-limit=100):

Priority class	Multiplier	Effective threshold	Typical candidates
Critical (4)	100.0x	10,000 instructions	Manually annotated hot paths, PGO-identified critical edges
Hot (3)	10.0x	1,000 instructions	Profile-guided hot callsites, frequently-called templates
Default (0,2)	1.0x	100 instructions	Standard callsites without profile data
Cold (1)	0.0x	0 instructions	Provably cold paths -- never imported at default settings

The evolution factors control how thresholds decay as imports cascade through the call graph:

Knob	Default	Effect
`import-instr-evolution-factor`	0.7	Each transitive import level reduces the threshold to 70% of the previous
`import-hot-evolution-factor`	1.0	Hot callsite chains do not decay (threshold stays constant through transitive imports)

The evolution factor is applied by the caller of sub_1853180 before passing base_threshold. For a chain A -> B -> C where A is the root module:

Import B into A: threshold = import-instr-limit (100)
Import C into A (transitively via B): threshold = 100 * 0.7 = 70
Import D into A (transitively via C via B): threshold = 100 * 0.7 * 0.7 = 49

For hot chains with import-hot-evolution-factor=1.0, the threshold remains 1,000 at every transitive level, enabling arbitrarily deep import chains for hot call paths.

Global Import Budget

Two globals control the total import count:

Address	Role	Default
`dword_4FAB120`	Maximum allowed imports	-1 (unlimited)
`dword_4FAA770`	Running import counter	0 (reset per module)

The budget check at 0x185340A:

mov  eax, cs:dword_4FAB120   ; load budget
test eax, eax
js   proceed                   ; negative = unlimited
cmp  cs:dword_4FAA770, eax   ; counter vs budget
jge  skip                     ; at or over budget -> skip

When the budget is -1 (the import-cutoff default), the js (jump-if-sign) branch is taken unconditionally, bypassing the budget check. Setting -import-cutoff=N limits the total number of imported functions to N, useful for debugging import-related miscompilations via bisection.

The counter increment at 0x1853510:

add  cs:dword_4FAA770, 1     ; increment after successful import

This is a non-atomic add -- safe because ThinLTO import runs single-threaded per module in CICC (unlike CPU LLVM where the thin link runs in parallel). The counter resets to 0 at the start of each module's import phase.

Integration with the 20,000-Budget Inliner

The import + inline pipeline in CICC works as a two-phase system:

Import phase (this page): ThinLTO brings cross-module function bodies into the current module based on summary-guided threshold decisions. The imported functions are marked with thinlto_src_module metadata.
Inline phase (inliner cost model): The NVIDIA custom inliner at sub_1864060 runs with a 20,000-unit per-caller budget. Imported functions are prime inlining candidates because they were specifically imported because they are called from this module.

The inliner-function-import-stats knob (registered in ctor_186_0 at 0x4DBEC0, values: basic or verbose) tracks how many imported functions were actually inlined. This provides feedback on whether the import thresholds are well-calibrated: if functions are imported but then not inlined (because they exceed the inline budget), the import was wasted compile time.

The typical flow for a template-heavy CUDA library like CUB or cutlass:

Each .cu file compiles to a ThinLTO bitcode module with a summary index
The thin link step reads all summaries and builds a combined index
For each module, sub_1853180 evaluates import candidates using the combined index
Hot template instantiations (e.g., cub::DeviceReduce::Sum<float>) get threshold base * 10.0 (hot) or base * 100.0 (critical)
The imported function bodies arrive in the module and are immediately available to the 20,000-budget inliner
The inliner folds the imported template bodies into their callers, eliminating .param marshaling

Entry Point: `sub_1855B10`

Address: 0x1855B10, 10,503 bytes. This is the runOnModule entry for the "function-import" pass (pipeline slot 43). It orchestrates the entire import flow:

fn function_import_pass_entry(module):
    // Parse required options
    if summary_file_path is empty:
        error("error: -function-import requires -summary-file")
        return

    summary_index = load_summary_file(summary_file_path)
    if summary_index is error:
        error("Error loading file")
        return

    // Build GUID-to-import map from summary index
    guid_import_map = build_import_map(module, summary_index)

    // Stage 1: threshold computation
    sub_1853180(summary_ctx, module_info, import_instr_limit,
                guid_hash_table, result_array, visited_set)

    // Stage 2: triple-pass import
    sub_1854A20(import_ctx, summary_index, source_module, guid_import_map)

    // Post-import: attribute propagation (if enabled)
    if propagate_attrs:
        propagate_summary_attributes(module, summary_index)

Knob Inventory

All knobs are registered across three constructors:

ctor_184_0 at 0x4DA920 (13,693 B -- ThinLTO Function Import options):

Knob	Type	Default	Effect
`import-instr-limit`	unsigned	100	Base instruction count threshold
`import-cutoff`	int	-1	Max total imports (-1 = unlimited)
`import-instr-evolution-factor`	float	0.7	Threshold decay per transitive level
`import-hot-evolution-factor`	float	1.0	Hot chain decay (1.0 = no decay)
`import-hot-multiplier`	float	10.0	Threshold multiplier for hot callsites
`import-critical-multiplier`	float	100.0	Threshold multiplier for critical callsites
`import-cold-multiplier`	float	0.0	Threshold multiplier for cold callsites
`print-imports`	bool	false	Print names of imported functions
`print-import-failures`	bool	false	Print rejected candidates with reasons
`compute-dead`	bool	true	Strip dead symbols from index
`enable-import-metadata`	bool	false	Attach `thinlto_src_module` / `thinlto_src_file` metadata
`summary-file`	string	(none)	Summary file path for `-function-import`
`import-all-index`	bool	false	Import every external function in the index

ctor_420_0 at 0x532010 (11,787 B -- pass-level ThinLTO options):

Knob	Type	Default	Effect
`force-import-all`	bool	false	Import even `noinline` functions
`import-declaration`	bool	false	Import function declarations as fallback
`thinlto-workload-def`	string	(none)	JSON file mapping root functions to import lists

ctor_029 at 0x489C80 (1,120 B -- supplementary ThinLTO options):

Knob	Type	Default	Effect
`propagate-attrs`	bool	true	Propagate attributes through the summary index
`import-constants-with-refs`	bool	true	Import constant globals that have references

ctor_419 at 0x531850 (6,358 B -- FunctionAttrs inference):

Knob	Type	Default	Effect
`disable-thinlto-funcattrs`	bool	false	Disable function attribute inference from ThinLTO summaries

Data Structures

Import Candidate Linked List

Each of the three priority lists in the guid_import_map is a singly-linked list with 8-byte node entries:

Offset	Content
`[node+0x00]`	Entry value (pointer to candidate descriptor, or GUID)
`[node+0x08]`	Next slot / next node pointer

Sentinels: 0xFFFFFFFFFFFFFFF8 (-8) = empty slot, 0xFFFFFFFFFFFFFFFC (-4) = deleted slot. These sentinel values are standard open-addressing hash map markers repurposed for the linked-list traversal.

GUID Import Map Layout

The guid_import_map structure (parameter rcx of sub_1854A20) contains the three priority lists:

Offset	Size	Content
`+0x00`	8	Primary list head (direct call targets)
`+0x08`	8	Entry count / secondary sentinel
`+0x10`	8	Secondary list head (transitive callees)
`+0x18`	8	(reserved / alignment)
`+0x20`	8	(reserved / alignment)
`+0x28`	8	(reserved / alignment)
`+0x30`	8	Tertiary list head (speculative candidates)

GUID Dedup Hash Table

Field	Size	Description
Slot size	16 bytes	`{GUID (8B), metadata (8B)}`
Hash function	multiplicative	`slot = (GUID * 37) & (table_size - 1)`
Collision resolution	linear probing	Increment slot by 1, wrap at table_size
Empty sentinel	-1	`0xFFFFFFFFFFFFFFFF`
Size field	offset +0x18	Number of slots in table (always power of 2)

The multiplication constant 37 produces reasonable distribution for GUIDs that are typically MD5 hashes of mangled names. The linear probing is adequate because the table is sized to maintain a low load factor.

Result Array

Growable array with 24-byte entries:

Offset	Size	Content
`+0x00`	8	Function GUID
`+0x08`	4	Adjusted threshold value
`+0x10`	8	Import record pointer

Header: [+0x08] = current count, [+0x0C] = capacity. Growth is handled by a realloc path when count >= capacity.

Per-Function Summary Entry (import-relevant fields)

Offset	Size	Content
`+0x08`	4	Entry type (2 = function summary)
`+0x0C`	1	Linkage byte: low 4 bits = linkage type, bit 4 = not-importable flag, bit 5 = importable flag
`+0x40`	4	Function cost (IR instruction count, used for threshold comparison)

Function Map

Function	Address	Size	Role
ThinLTO import driver (triple-pass candidate processing)	`sub_1854A20`	4,326 B	--
Threshold computation with GUID dedup and priority-class multipliers	`sub_1853180`	5,059 B	--
Threshold comparison gate (returns nonzero if candidate qualifies)	`sub_18518A0`	--	--
Import candidate evaluator (prepares candidate for threshold check)	`sub_1852CC0`	--	--
Import list builder (called by `sub_1853180`)	`sub_1852FB0`	--	--
Import list node allocator (called by `sub_1853180`)	`sub_1852A30`	--	--
Import list initialization (called by `sub_1853180`)	`sub_1851200`	--	--
Execute import decision (materialize function into destination)	`sub_15E4B20`	--	--
Resolve function name/info from summary	`sub_15E4EB0`	--	--
Entry point (parses `-function-import` / `-summary-file`)	`sub_1855B10`	10,503 B	--
Whole-module ThinLTO processing	`sub_1858B90`	31,344 B	--
Type metadata propagation during import	`sub_185E850`	24,263 B	--
Attach named metadata (used for `thinlto_src_module`)	`sub_1627100`	--	--
Create optimization remark (import diagnostic)	`sub_1627350`	--	--
Resolve source module name string	`sub_161FF10`	--	--
Check if function exists in a given module	`sub_1670560`	--	--
Get "import source" module handle	`sub_16704E0`	--	--
Get "import destination" module handle	`sub_16704F0`	--	--
Format import remark (cost component)	`sub_16C1840`	--	--
Format import remark (threshold component)	`sub_16C1A90`	--	--
Finalize import remark string	`sub_16C1AA0`	--	--
Hash table insert (GUID dedup table)	`sub_1851560`	--	--
Initialize resolved function summary storage	`sub_1674380`	--	--
Finalize empty-import path cleanup	`sub_1851C60`	--	--
Release import list entry data	`sub_161E7C0`	--	--
`malloc` wrapper (used for 16-byte dedup node allocation)	`sub_22077B0`	--	--

Cross-References

Inliner Cost Model -- the downstream consumer of imported functions. Import brings bodies into the module; the 20,000-budget inliner decides whether to fold them into callers.
Module Summary -- sub_D7D4E0 builds the NVModuleSummary that drives import decisions. The 4-level priority system, complexity budget, and CUDA-specific filtering all originate here.
Pipeline & Ordering -- function-import is registered as pipeline slot 43, a Module-level pass.
IP Memory Space Propagation -- after import, cross-module functions may carry address-space annotations that IPMSP must reconcile.
Hash Infrastructure -- the GUID dedup table uses the same DenseMap pattern documented there.

GlobalOpt for GPU

CICC implements a custom GlobalOpt pass (sub_18612A0, 65 KB, 2179 decompiled lines) that replaces LLVM's stock GlobalOptPass with GPU-aware global variable transformations. The pass operates on NVIDIA's internal IR representation rather than LLVM IR directly, and adds address-space-aware logic that stock LLVM lacks entirely: it extracts the CUDA address space from the global's flags byte ((flags >> 2) & 7), preserves address space through all generated replacement globals, and applies promotion thresholds calibrated for GPU memory hierarchy. The pass runs at pipeline position 30 in the tier-2 and tier-3 optimization sequences (via wrapper sub_196A2B0), immediately after GlobalDCE / ConstantProp (sub_1968390) and before LoopVectorize. It runs at -O2 and above; tier-1 does not include it. The inliner cost model at sub_18612A0 also calls into GlobalOpt as a subroutine when evaluating whether a callee's globals can be folded after inlining, creating a tight coupling between inlining decisions and global optimization.

The pass implements four transformation strategies with decreasing priority: small-constant promotion for globals under 2047 bits, scalar replacement of aggregates (SRA) for struct globals with up to 16 fields, malloc/free elimination for heap-allocated globals with single-unit access, and a hash-table-driven deduplication cleanup pass. Each strategy preserves the original global's NVPTX address space, which is critical -- a __device__ global in address space 1 must remain in AS 1 after splitting, not silently migrate to AS 0 (generic). The generated IR uses distinctive suffixes (.body, .init, .val, .notinit, .f0...f15, .isneg, .isnull) that survive through to PTX emission and are visible in cuobjdump output.


Core transform	`sub_18612A0` (`0x18612A0`, 65 KB, 2179 lines)
Pipeline wrapper	`sub_196A2B0` (`0x196A2B0`)
Recursive re-application	`sub_185B1D0` (`0x185B1D0`)
Pre-SRA setup	`sub_185B7E0` (`0x185B7E0`)
Hash table rehash	`sub_1860410` (`0x1860410`)
Per-user SRA rewrite	`sub_1860BE0` (`0x1860BE0`)
Pipeline position	Step 30 (tier 2/3), after GlobalDCE, before LoopVectorize
Minimum opt level	`-O2` (tier 2)
Pass registration	`"globalopt"` in pipeline parser at slot 45
IR node allocation	88 bytes per global, 64 bytes per basic block, 56 bytes per instruction

Address Space Handling

Every transformation in this pass must respect CUDA address spaces. The global's address space is extracted at line 577 of the decompilation:

uint8_t addr_space = (*(uint8_t*)(global + 33) >> 2) & 7;

The NVPTX address spaces relevant here are 0 (generic), 1 (global/__device__), 3 (shared/__shared__), 4 (constant/__constant__), and 5 (local). See Address Spaces for the complete table with hardware mapping, pointer widths, and latency numbers.

When sub_18612A0 creates replacement globals via sub_15E51E0, it passes the extracted address space to the constructor. The created global inherits the same address space, linkage (always internal, linkage code 7), and metadata (copied via sub_15E6480). This is the key delta from stock LLVM: upstream GlobalOpt does not consider address space when splitting globals because host-side address spaces are trivial. On GPU, promoting a __shared__ struct global to per-field __shared__ globals preserves the 10x latency advantage over DRAM, while accidentally demoting to generic would force the hardware to resolve address space at runtime via the generic-to-specific address resolution unit.

Entry Guard: Type Filtering

Before attempting any transformation, the pass filters on the global's type tag (byte at type + 8). The acceptance bitmask is 0x8A7E:

// Bits set: 1,2,3,4,5,9,11,13,15
uint16_t bitmask = 0x8A7E;
if ((1 << type_tag) & bitmask) {
    // accepted: i16, i32, i64, i80, float, double, arbitrary-int, struct, opaque-ptr
}

Additionally, struct (tag 13), vector (tag 14), and array (tag 16) types are accepted if sub_16435F0(type, 0) returns true -- this is the isAnalyzableType predicate that recursively checks whether the type's leaf elements are all scalars or pointers.

After type filtering, the pass walks the global's use-list. Every user must be either a store (opcode tag 54) or a load (opcode tag 55). If any user is an arithmetic instruction (tag <= 23), a GEP used in a non-trivial way, or any other instruction kind, the global is rejected -- it cannot be optimized because its address escapes or is used in a way the pass cannot model.

Path A: Small-Constant Promotion

When the global's initializer is a struct constant and its total bit-size (including alignment padding) fits within 2047 bits (0x7FF), the pass promotes it into a function-local value with a separate initializer function. This threshold is NVIDIA-specific -- upstream LLVM uses different heuristics based on TargetData layout considerations. The 2047-bit ceiling corresponds roughly to 64 32-bit registers, aligning with the per-thread register budget on most SM architectures where promoting beyond that limit would spill to local memory and negate the benefit.

Size Computation

The pass walks the type tree recursively to compute total bit-size. The implementation at lines 499-570 of the decompilation uses a switch on the type tag byte at type + 8:

Type tag	Type	Bits
0x1	i16 / half	16
0x2	i32 / float	32
0x3	i64	64
0x4	x86_fp80	80
0x5	i128	128
0x6	fp128 / ppc_fp128	128
0x7	pointer	`sub_15A9520(target, 0) * 8`
0x9	double	64
0xB	iN (custom width)	from type word >> 8
0xD	struct	8 * field_count (via `sub_15A9930`)
0xE	vector	8 * alignment * num_elements * padded_size
0xF	opaque ptr	`sub_15A9520(target, addr_space) * 8`
0x0, 0x8, 0xA, 0xC, 0x10	array variants	element_size * array_length (recursive)

Note that opaque pointers (tag 0xF) use getPointerSizeInBits(target, addr_space) -- the pointer size varies by address space on NVPTX (64-bit for AS 0/1, potentially 32-bit for AS 3/5 on some targets). Tags 0x0, 0x8 (label/token), 0xA (metadata), and 0xC (bfloat) all fall into the array-multiplier path -- they extract an element count and recurse, which handles the case where these type wrappers contain inner array types.

The pseudocode for the size computation:

// sub_18612A0, lines 499-570
uint64_t compute_total_bits(Type *type, TargetInfo *target, uint8_t addr_space) {
    uint8_t tag = *(uint8_t *)(type + 8);
    switch (tag) {
    case 0x1:  return 16;                                     // i16 / half
    case 0x2:  return 32;                                     // i32 / float
    case 0x3:  return 64;                                     // i64
    case 0x4:  return 80;                                     // x86_fp80
    case 0x5:  return 128;                                    // i128
    case 0x6:  return 128;                                    // fp128 / ppc_fp128
    case 0x7:  return sub_15A9520(target, 0) * 8;             // generic pointer
    case 0x9:  return 64;                                     // double
    case 0xB:  return *(uint32_t *)(type + 8) >> 8;           // iN custom-width
    case 0xD: {                                               // struct
        uint64_t layout = sub_15A9930(target, type);          // getStructLayout
        return 8 * *(uint32_t *)(layout + 12);                // 8 * element_count
    }
    case 0xE: {                                               // vector
        uint64_t align = sub_15A9FE0(target, type);           // getAlignment
        uint64_t n_elts = *(uint32_t *)(type + 12);
        uint64_t elem_bits = compute_total_bits(
            sub_16463B0(type, 0), target, addr_space);        // getArrayElementType
        return 8 * align * n_elts * ((elem_bits + align - 1) / align);
    }
    case 0xF:  return sub_15A9520(target, addr_space) * 8;    // opaque ptr (AS-aware)
    default: {                                                 // 0x0,0x8,0xA,0xC,0x10: array
        uint64_t n_elts = *(uint32_t *)(type + 12);
        Type *elem = sub_16463B0(type, 0);                    // getArrayElementType
        return n_elts * compute_total_bits(elem, target, addr_space);
    }
    }
}

The acceptance check at line 570:

if (total_elements * alignment * ceil_div(total_bits, alignment) > 0x7FF)
    goto path_b;  // too large, try SRA instead

Generated IR Pattern

For a qualifying global, the pass generates three components:

; Original: @my_global = addrspace(1) global { i32, i32 } { i32 42, i32 7 }

; After promotion:
@my_global.body = internal addrspace(1) global { i32, i32 } { i32 42, i32 7 }

define internal void @my_global.init() {
  store { i32, i32 } { i32 42, i32 7 }, ptr addrspace(1) @my_global.body
  ret void
}

; All loads of @my_global replaced with: load ptr addrspace(1) @my_global.body
; ExtractValue users get ".val" accessors
; Uninitialized code paths get "notinit" sentinel via sub_15FB630

The .body global is created via sub_15E51E0 with the same address space and internal linkage (code 7). The .init function is created via sub_15E5070. The pass then walks all users of the original global: loads (tag 55) get redirected to the .body global, GEPs (tag 71) get RAUW'd via sub_164D160, and extractvalue instructions (tag 75) get specialized .val accessors. Sub-opcodes on the extractvalue determine further handling: codes 0x20/0x25/0x29 produce notinit sentinels, 0x24/0x28 extract terminal types via sub_159C540, and 0x21-0x23/0x26-0x27 pass through unchanged.

The full promotion pseudocode covering body creation, init creation, and use rewriting:

// sub_18612A0, lines 577-805 — Path A: small-constant promotion
void promote_small_constant(Global *global, Module *module, Value *init_val,
                            Type *type, TargetInfo *target) {
    // --- Extract address space from global flags ---
    uint8_t addr_space = (*(uint8_t *)(global + 33) >> 2) & 7;

    // --- Create ".body" global in same address space ---
    void *node = sub_1648A60(88, 1);                           // IRBuilder::create
    Global *body_gv = sub_15E51E0(
        get_scope(module), type, /*init=*/0, /*linkage=*/7,
        concat_name(global, ".body"), addr_space);             // createGlobalVar
    sub_15E6480(global, body_gv);                              // copyMetadata

    // --- Rewrite all users of original global ---
    Use *use = *(Use **)(global + 8);                          // use-list head
    while (use != NULL) {
        Instruction *inst = sub_1648700(use);                  // getInstruction
        uint8_t opcode = *(uint8_t *)(inst + 16);

        if (opcode == 71) {                                    // GEP
            // If GEP references old global, RAUW to body
            sub_164D160(inst, body_gv);                        // RAUW
            sub_15F20C0(inst);                                 // eraseFromParent
        } else {
            // Create local variable referencing body
            Value *local = sub_15FD590(inst, get_scope(module),
                                       "newgv", module);       // createLocalVar
            sub_1648780(use, local);                           // replaceUseWith
        }
        use = *(Use **)(use + 8);                              // next use
    }

    // --- Create ".init" function ---
    Function *init_fn = sub_15E5070(
        get_scope(module), type, /*linkage=*/7,
        init_val, concat_name(global, ".init"));               // createFunction
    int init_user_count = 0;

    // Walk users again for extractvalue and load rewriting
    use = *(Use **)(body_gv + 8);
    while (use != NULL) {
        Instruction *inst = sub_1648700(use);
        uint8_t opcode = *(uint8_t *)(inst + 16);

        if (opcode == 55) {                                    // load
            sub_15F9480(init_val, init_fn);                    // createStoreInit
            init_user_count++;
        } else if (opcode == 75) {                             // extractvalue
            Value *val_acc = sub_15F8F80(inst, type, init_fn,
                concat_name(global, ".val"));                  // createExtractValue
            uint8_t sub_opcode = *(uint8_t *)(inst + 24);
            switch (sub_opcode) {
            case 0x20: case 0x25: case 0x29:
                // Uninitialized path: create "notinit" sentinel
                sub_15FB630(val_acc, "notinit", inst);         // createNotInit
                break;
            case 0x24: case 0x28:
                // Terminal type extraction
                sub_159C540(val_acc);                          // getTerminalType
                break;
            default:                                           // 0x21-0x23, 0x26-0x27
                break;                                         // pass-through
            }
            sub_164D160(inst, val_acc);                        // RAUW
            sub_15F20C0(inst);                                 // eraseFromParent
            init_user_count++;
        }
        use = *(Use **)(use + 8);
    }

    // --- Finalize ---
    if (init_user_count > 0) {
        sub_1631BE0(module_fn_list, init_fn);                  // insertIntoFnList
        // Patch metadata chain at global+56
        *(void **)(global + 56) = init_fn;
    } else {
        // Dead init function: destroy
        sub_15E5530(init_fn);                                  // destroyFunctionBody
        sub_159D9E0(init_fn);                                  // destroyFunction
        sub_164BE60(init_fn);                                  // dropAllReferences
        sub_1648B90(init_fn);                                  // markDead (flags |= 1)
    }

    sub_15E55B0(global);                                       // erase original global
    sub_15F20C0(module_entry);                                 // erase module-level ref

    // --- Recursive re-application to newly created .body ---
    sub_185B1D0(body_gv, target);                              // recursiveGlobalOpt
}

After rewriting all uses, if the .init function has users, it is linked into the module's function list via sub_1631BE0. If it has zero users (the initializer was never needed), the function body is destroyed and marked dead. The original global is erased via sub_15E55B0. Finally, sub_185B1D0 recursively re-applies GlobalOpt to the newly created .body global, enabling cascaded optimizations.

Path B: Scalar Replacement of Aggregates (SRA)

When a global is too large for constant promotion, the pass attempts SRA -- exploding a struct global into per-field scalar globals. This path has stricter preconditions:

The caller's flag parameter (a4) must be zero -- when set, SRA is disabled.
The initializer must be the unique initializer for this global (verified via sub_15A0680).
The type must be a struct (tag 13) with 1 to 16 fields: field_count - 1 <= 0xF.
Every user must reference only this global -- no cross-global pointer arithmetic.

The 16-field limit is a hardcoded constant at line 822 of the decompilation. It prevents combinatorial explosion in the null-check and free chains that follow: each field generates one icmp eq (null check), one or, one conditional branch, one free_it block, and one next block. Beyond 16 fields the cost of the generated guard code would exceed the benefit of splitting.

Use Analysis: Store Value Collection

Before field explosion, the pass collects all stored values into a hash set to determine which initializers are live. For each store (tag 54) user of the global, sub_185CAF0 inserts the stored value into a hash/set structure at v432. The scratch buffer starts with capacity 32 and grows via sub_16CC920 when full. This collection serves two purposes: it validates that all stores write analyzable values (no opaque function pointers or computed addresses), and it builds the value set used later to initialize the per-field globals.

// sub_18612A0, lines 823-868 — Store value collection for SRA
void collect_store_values(Global *global, Module *module,
                          HashSet *store_set, Buffer *scratch) {
    Use *use = *(Use **)(global + 8);
    int store_count = 0;

    while (use != NULL) {
        Instruction *inst = sub_1648700(use);                  // getInstruction
        uint8_t opcode = *(uint8_t *)(inst + 16);

        if (opcode == 54) {                                    // store
            sub_185CAF0(use, store_set, scratch);              // collectStoredValue
            store_count++;

            // Grow scratch if full
            if (scratch->size >= scratch->capacity) {
                if (scratch->capacity < 64)
                    memset(scratch->data, 0xFF, scratch->capacity * 8);
                else
                    sub_16CC920(scratch);                      // growScratchBuffer
            }
        }
        use = *(Use **)(use + 8);
    }
}

Global-Only-Use Validation

After collection, lines 878-1017 validate that every user of every collected global references only the target global -- no cross-global pointer arithmetic is allowed. The validation walks the use chain of each collected global. For each operand slot (24-byte stride, count from *(uint32_t *)(global + 20) & 0xFFFFFFF):

If the operand is the module itself: accepted.
If the opcode tag is <= 0x17 (arithmetic/comparison): rejected -- the global's address is used in computation.
If the opcode is 77 (GEP): the pass calls sub_16CC9F0 (find in sorted set) to verify the GEP's base pointer is the same global being split.
If the opcode is 54 (store): the pass checks that the store's parent basic block (at offset -24 from the operand) belongs to the global being analyzed.

If any operand fails validation, a flag v17 is set to zero and the entire SRA path is abandoned for this global.

Field Explosion

For each field index 0 through field_count - 1, the pass creates a replacement global variable in the same address space with internal linkage. The full pseudocode at lines 1084-1476:

// sub_18612A0, lines 1084-1476 — SRA field explosion
typedef struct {
    Global **data;
    uint64_t size;
    uint64_t capacity;
} FieldVec;

void sra_explode_fields(Global *global, Module *module, Type *struct_type,
                        Value *init_val, TargetInfo *target, FieldVec *fields) {
    uint8_t addr_space = (*(uint8_t *)(global + 33) >> 2) & 7;
    const char *global_name = sub_1649960(global);             // getName
    uint32_t field_count = *(uint32_t *)(struct_type + 12);
    uint64_t ptr_bits = sub_15A9520(target, addr_space);       // getPointerSizeInBits

    for (uint32_t i = 0; i < field_count; i++) {
        // --- Extract field type and offset ---
        Type *field_type = sub_1646BA0(struct_type, ptr_bits); // getStructFieldType
        uint64_t field_offset = sub_15A06D0(struct_type, i);   // computeFieldOffset

        // --- Generate name: "my_global.f0", "my_global.f1", ... ---
        char name[256];
        snprintf(name, sizeof(name), "%s.f%d", global_name, i);

        // --- Extract field initializer from parent init ---
        Value *field_init = sub_15FEBE0(module, init_val, field_type); // createBitcast/GEP

        // --- Create field global in same address space, internal linkage ---
        Global *field_gv = sub_15E51E0(
            get_scope(module), field_type, field_init,
            /*linkage=*/7, name, addr_space);                  // createGlobalVar

        // --- Copy metadata from parent to field global ---
        sub_15E6480(global, field_gv);                         // copyMetadata

        // --- Store into dynamically-grown field vector ---
        if (fields->size >= fields->capacity) {
            // Realloc growth: double capacity (lines 1161-1220)
            uint64_t new_cap = fields->capacity * 2;
            if (new_cap < 8) new_cap = 8;
            fields->data = realloc(fields->data, new_cap * sizeof(Global *));
            fields->capacity = new_cap;
        }
        fields->data[fields->size++] = field_gv;

        // --- Compute field bit-size (same type switch as Path A) ---
        uint64_t field_bits = compute_total_bits(field_type, target, addr_space);
        uint64_t alignment;
        if (*(uint8_t *)(field_type + 8) == 0xD) {            // struct
            uint64_t layout = sub_15A9930(target, field_type);
            alignment = *(uint64_t *)(layout + 8);
        } else {
            alignment = sub_15A9FE0(target, field_type);       // getAlignment
        }
        uint64_t padded = alignment * ((field_bits + alignment - 1) / alignment);

        // --- Create GEP replacement and store initializer ---
        Value *gep = sub_15FEBE0(module, field_gv, field_type); // createBitcast/GEP
        sub_15F9660(field_offset, field_gv, global);            // createFieldStore
    }
}

The field globals are stored in a dynamically-grown std::vector with realloc growth strategy (lines 1161-1220 of the decompilation). The growth factor is 2x with a minimum initial capacity of 8 entries.

Null/Negative Guards

After field explosion, the pass generates safety checks for the original global's pointer value. This pattern handles the case where the global was heap-allocated via malloc -- the original pointer might be null or negative (indicating allocation failure on some platforms). The guard chain is constructed at lines 1478-1535:

// sub_18612A0, lines 1478-1535 — Null/negative guard chain generation
Value *build_guard_chain(Global *global, FieldVec *fields,
                         Module *module, TargetInfo *target) {
    // --- Create %isneg = icmp slt <ptr>, 0 ---
    // Opcode 51 = ICmp, predicate 40 = SLT (signed less than zero)
    Value *isneg = sub_15FEC10(
        /*dest=*/NULL, /*type_id=*/1, /*opcode=*/51, /*pred=*/40,
        get_module_sym(module), /*offset=*/0,
        concat_name(global, ".isneg"), get_current_bb(module)); // createICmp

    Value *chain = isneg;

    // --- For each field: %isnullI = icmp eq <field_ptr>, null ---
    for (uint64_t i = 0; i < fields->size; i++) {
        Global *field_gv = fields->data[i];
        uint64_t field_offset = sub_15A06D0(
            get_type(global), i);                              // computeFieldOffset

        // Predicate 32 = EQ (equal to null)
        Value *isnull = sub_15FEC10(
            /*dest=*/NULL, /*type_id=*/1, /*opcode=*/51, /*pred=*/32,
            field_gv, field_offset,
            concat_name(global, ".isnull"), get_current_bb(module));

        // Chain with OR: %tmpI = or i1 %chain, %isnullI
        // Opcode 27 = OR
        char tmp_name[16];
        snprintf(tmp_name, sizeof(tmp_name), "tmp%lu", i);
        chain = sub_15FB440(/*opcode=*/27, chain, isnull,
                            tmp_name, module);                 // createBinOp(OR)
    }

    return chain;  // final chained predicate
}

The generated IR for a 3-field struct:

%isneg  = icmp slt ptr @original_global, null    ; predicate 40 = SLT
%isnull0 = icmp eq ptr @my_global.f0, null        ; predicate 32 = EQ
%tmp0   = or i1 %isneg, %isnull0
%isnull1 = icmp eq ptr @my_global.f1, null
%tmp1   = or i1 %tmp0, %isnull1
%isnull2 = icmp eq ptr @my_global.f2, null
%tmp2   = or i1 %tmp1, %isnull2
br i1 %tmp2, label %malloc_ret_null, label %malloc_cont

The .isneg guard is created by sub_15FEC10 with opcode 51 (ICmp), predicate 40 (SLT with zero). Per-field .isnull guards use predicate 32 (EQ with null). The guards are chained with OR instructions (opcode 27) via sub_15FB440. The chain evaluation is linear in the number of fields -- for the maximum 16 fields, this produces 17 icmp instructions and 16 or instructions, plus one terminal conditional branch.

Malloc/Free Decomposition Algorithm

This is the core of NVIDIA's per-field malloc/free elimination, covering lines 1537-1640 of the decompilation. When the chained null check indicates a valid allocation, the pass generates a multi-block control flow that replaces the original single malloc/free pair with per-field conditional frees. This is the key divergence from upstream LLVM: stock tryToOptimizeStoreOfMallocToGlobal treats the malloc/free as an atomic pair, replacing it with a single static allocation. NVIDIA decomposes to per-field granularity, generating N+2 basic blocks (one malloc_ret_null, one malloc_cont, and for each field one free_it plus one next block).

The complete pseudocode:

// sub_18612A0, lines 1537-1640 — Malloc/free decomposition
void decompose_malloc_free(Global *global, Module *module, Function *fn,
                           FieldVec *fields, Value *guard_chain,
                           TargetInfo *target) {
    uint8_t addr_space = (*(uint8_t *)(global + 33) >> 2) & 7;

    // === Step 1: Create control flow skeleton ===

    // "malloc_cont" — continuation after successful allocation check
    BasicBlock *malloc_cont_bb = sub_157FBF0(
        fn, get_global_chain(module), "malloc_cont");          // createBB

    // "malloc_ret_null" — failure path returning null
    BasicBlock *ret_null_body = sub_157E9C0(fn);               // createReturnBB
    BasicBlock *malloc_ret_null_bb = sub_157FB60(
        NULL, ret_null_body, "malloc_ret_null", NULL);         // createBBWithPred

    // === Step 2: Emit conditional branch on guard chain ===
    // br i1 %guard_chain, label %malloc_ret_null, label %malloc_cont
    sub_15F8650(
        get_terminator(fn),                                    // insertion point
        malloc_ret_null_bb,                                    // true target (fail)
        malloc_cont_bb,                                        // false target (success)
        guard_chain,                                           // condition (isneg|isnull)
        fn);                                                   // createCondBr

    // === Step 3: Per-field conditional free and reinitialization ===
    BasicBlock *current_bb = malloc_cont_bb;

    for (uint64_t i = 0; i < fields->size; i++) {
        Global *field_gv = fields->data[i];
        uint64_t field_offset = sub_15A06D0(get_type(global), i);
        Type *field_type = sub_1646BA0(get_type(global),
                                       sub_15A9520(target, addr_space));

        // 3a. Create "tmp" alloca in current block
        Value *tmp_alloca = sub_15F9330(
            NULL, field_type, "tmp", current_bb);              // createAlloca

        // 3b. Create non-null check: %condI = icmp ne <field_ptr>, null
        // Opcode 51 = ICmp, predicate 33 = NE (not equal to null)
        char cond_name[64];
        snprintf(cond_name, sizeof(cond_name), "%s.f%lu.nonnull",
                 sub_1649960(global), i);
        Value *cond = sub_15FED60(
            NULL, /*type_id=*/1, /*opcode=*/51, /*pred=*/33,
            field_gv, field_offset, cond_name, current_bb);    // createICmpNE

        // 3c. Create "free_it" block — frees this field if non-null
        char free_name[64];
        snprintf(free_name, sizeof(free_name), "free_it%lu", i);
        BasicBlock *free_it_bb = sub_157FB60(
            NULL, NULL, free_name, NULL);                      // createBBWithPred

        // 3d. Create "next" block — fallthrough after conditional free
        char next_name[64];
        snprintf(next_name, sizeof(next_name), "next%lu", i);
        BasicBlock *next_bb = sub_157FB60(
            NULL, NULL, next_name, NULL);                      // createBBWithPred

        // 3e. Conditional branch: non-null → free, null → skip
        // br i1 %condI, label %free_itI, label %nextI
        sub_15F8650(
            get_terminator_of(current_bb),
            free_it_bb,                                        // true: free
            next_bb,                                           // false: skip
            cond, fn);                                         // createCondBr

        // 3f. In free_it block: wire field into use-def chain, then branch to next
        sub_15FDB00(field_gv, get_use_chain(free_it_bb),
                    i, free_it_bb);                            // wireDef

        // Unconditional branch: free_it → next
        sub_15F8590(NULL, next_bb, free_it_bb);                // createBr

        // 3g. In next block: store field initializer into the new field global
        sub_15F9850(field_offset, tmp_alloca, next_bb);        // createStoreToField

        current_bb = next_bb;
    }

    // === Step 4: Wire entry into malloc_cont, erase original ===
    // Unconditional branch from entry into malloc_cont
    sub_15F8590(NULL, malloc_cont_bb, get_entry_bb(fn));       // createBr

    // Erase the original global
    sub_15F20C0(get_module_entry(module));                      // eraseFromParent
}

The generated CFG for a 2-field struct { i32, float }:

entry:
  br i1 %tmp1, label %malloc_ret_null, label %malloc_cont

malloc_ret_null:
  ret null

malloc_cont:
  %cond0 = icmp ne ptr @g.f0, null
  br i1 %cond0, label %free_it0, label %next0

free_it0:
  ; free(@g.f0)   — conditional per-field deallocation
  br label %next0

next0:
  store i32 <init0>, ptr addrspace(1) @g.f0
  %cond1 = icmp ne ptr @g.f1, null
  br i1 %cond1, label %free_it1, label %next1

free_it1:
  ; free(@g.f1)
  br label %next1

next1:
  store float <init1>, ptr addrspace(1) @g.f1
  ; ... continuation

Each free_it block is conditionally entered only when the field pointer is non-null, preventing double-free on fields that were never successfully allocated. The next blocks store the field initializer after the conditional free, ensuring the field global is properly initialized regardless of whether freeing occurred. This per-field decomposition enables a critical optimization that upstream LLVM cannot perform: if a later pass (dead store elimination, constant propagation) determines that only some fields of the struct are actually used, the unused field globals and their associated free_it/next blocks become dead code and are trivially eliminated by GlobalDCE.

Address-Space-Aware Splitting

The address space preservation logic is woven throughout both the field explosion and the malloc/free decomposition. Every call to sub_15E51E0 (createGlobalVar) passes the extracted address space from the parent global. The extraction point is always the same: (*(uint8_t *)(global + 33) >> 2) & 7. This is critical for three reasons:

Shared memory splitting: A __shared__ struct global (AS 3) split into per-field globals must keep each field in AS 3. If any field migrated to AS 0 (generic), the hardware would resolve the address at runtime via the generic-to-specific resolution unit, adding 10-20 cycles of latency per access and defeating the purpose of placing data in shared memory.
Constant memory splitting: A __constant__ struct (AS 4) split into fields must remain in AS 4 to benefit from the constant cache's broadcast capability. A single warp reading the same constant field hits the cache once and broadcasts to all 32 threads. In AS 0 (generic), this broadcast would not occur.
Pointer size consistency: On some NVPTX targets, pointers in AS 3 (shared) and AS 5 (local) are 32-bit, while AS 0 and AS 1 pointers are 64-bit. The size computation for opaque pointers (tag 0xF) calls sub_15A9520(target, addr_space) -- if the address space were lost during splitting, the pointer size calculation would be wrong, producing incorrect field offsets and corrupted stores.

The per-field null checks in the guard chain also respect address space: the icmp eq with null uses a null pointer of the correct address space width. A 32-bit null in AS 3 is not the same bit pattern as a 64-bit null in AS 1.

Hash Table for Processed Globals

After field explosion and malloc rewrite, the pass uses a custom hash table (open addressing, 32-byte entries) to track which globals and their transitive users have been processed. This is an instance of the NVIDIA-original hash table variant (sentinel pair -8/-16) as documented in the hash infrastructure page.

Offset	Field	Description
+0	key	Pointer to global (sentinel: -8 = empty, -16 = tombstone)
+8	data	Pointer to field-global vector
+16	size	Current vector size
+24	cap	Vector capacity

Hash function, quadratic probing with triangular numbers, and 75% load factor / 12.5% tombstone compaction thresholds all follow the standard DenseMap infrastructure; see Hash Table and Collection Infrastructure for details.

The processing loop (lines 1710-1812) iterates remaining users of the original global and rewrites them to reference the new field globals:

// sub_18612A0, lines 1710-1812 — Post-SRA user rewriting via hash table
void rewrite_remaining_users(Global *global, FieldVec *fields,
                             HashTable *table, Module *module,
                             TargetInfo *target) {
    Use *use = *(Use **)(global + 8);

    while (use != NULL) {
        Use *next_use = *(Use **)(use + 8);
        Instruction *inst = sub_1648700(use);
        uint8_t opcode = *(uint8_t *)(inst + 16);

        if (opcode == 54) {                                    // store
            // Walk the store's own use-chain
            Use *store_use = *(Use **)(inst + 8);
            while (store_use != NULL) {
                Use *next_su = *(Use **)(store_use + 8);

                // Per-user SRA rewrite: replaces GEP+store/load sequences
                // with direct accesses to the appropriate field global
                sub_1860BE0(store_use, table, fields, target); // rewriteUserForSRA

                store_use = next_su;
            }

            // If store has no remaining uses, erase it
            if (*(Use **)(inst + 8) == NULL) {
                sub_15F20C0(inst);                             // eraseFromParent
                // Remove from hash table (mark as tombstone)
                HashEntry *entry = sub_1860630(
                    inst, 0, table, NULL);                     // lookupInTable
                if (entry != NULL)
                    entry->key = (void *)(-16);                // tombstone
            }
        } else {
            // For non-store users (loads, etc.): create direct stores
            // to the appropriate field global
            for (uint64_t i = 0; i < fields->size; i++) {
                uint64_t offset = sub_15A06D0(
                    get_type(global), i);                      // computeFieldOffset
                sub_15F9660(offset, fields->data[i], inst);    // createFieldStore
            }
        }

        use = next_use;
    }
}

After all users are rewritten, cleanup proceeds in two phases: first, operand lists of dead GEP (tag 77) and store (tag 54) instructions are unlinked from the use chain (nulling out 24-byte-stride operand slots at lines 2004-2079); second, the dead instructions are erased via sub_15F20C0 at lines 2081-2117. Finally, the original global declaration is erased via sub_15E55B0, and all temporary data structures (hash table backing array, field vectors, scratch buffers) are freed at lines 2119-2161.

Top-Level Driver: sub_18612A0

The complete control flow of the core transform function, integrating all four strategies. This pseudocode corresponds to the entire 2179-line decompilation:

// sub_18612A0 — Core GlobalOpt transform for a single global variable
// Returns: 1 if transformed, 0 if no transformation applied
int globalopt_transform(Global *global, Module *module, Type *type,
                        int flag, TargetInfo *target, TargetInfo *target2) {
    // === Phase 1: Type filter (lines 444-451) ===
    uint8_t type_tag = *(uint8_t *)(type + 8);
    uint16_t bitmask = 0x8A7E;  // bits: 1,2,3,4,5,9,11,13,15
    if (!((1 << type_tag) & bitmask)) {
        // Additional acceptance for struct(13), vector(14), array(16)
        if (type_tag == 13 || type_tag == 14 || type_tag == 16) {
            if (!sub_16435F0(type, 0))                         // isAnalyzableType
                return 0;
        } else {
            return 0;
        }
    }

    // === Phase 2: Use validation — all users must be store/load (lines 452-481) ===
    Buffer scratch = { .data = alloca(8 * sizeof(void *)), .size = 0, .capacity = 8 };
    Use *use = *(Use **)(global + 8);
    while (use != NULL) {
        Instruction *inst = sub_1648700(use);                  // getInstruction
        uint8_t opcode = *(uint8_t *)(inst + 16);
        if (opcode <= 0x17) return 0;                          // arithmetic: reject
        if (opcode == 54) {                                    // store
            if (!sub_185C920(inst, &scratch))                  // analyzeStore
                return 0;
        } else if (opcode != 55) {                             // not load either
            return 0;
        }
        use = *(Use **)(use + 8);
    }

    // === Phase 3: Collect store values and evaluate initializer (lines 482-493) ===
    Buffer store_buf = { .data = calloc(32, sizeof(void *)), .size = 0, .capacity = 32 };
    sub_185C560(module, global, &store_buf);                   // collectStoreValues
    Value *init_val = sub_140B2F0(module, target, global, 1);  // evaluateInitializer

    // === Phase 4: Try Path A — small-constant promotion (lines 494-805) ===
    uint8_t init_tag = *(uint8_t *)(init_val + 16);
    if (init_tag == 13) {                                      // struct constant
        uint8_t addr_space = (*(uint8_t *)(global + 33) >> 2) & 7;
        uint64_t total_bits = compute_total_bits(type, target, addr_space);
        uint64_t alignment = sub_15A9FE0(target, type);
        uint64_t padded = alignment * ((total_bits + alignment - 1) / alignment);

        if (padded <= 0x7FF) {                                 // <= 2047 bits
            promote_small_constant(global, module, init_val, type, target);
            free(store_buf.data);
            return 1;
        }
    }

    // === Phase 5: Try Path B — SRA of struct globals (lines 807-2177) ===
    if (flag != 0) { free(store_buf.data); return 0; }        // SRA disabled by caller

    // Verify unique initializer
    if (init_val != sub_15A0680(get_module_sym(module), 1, 0)) {
        free(store_buf.data); return 0;
    }

    // Check struct with 1-16 fields
    if (type_tag == 14) type = unwrap_vector(type);            // vector peeling
    if (*(uint8_t *)(type + 8) != 13) { free(store_buf.data); return 0; }
    uint32_t field_count = *(uint32_t *)(type + 12);
    if (field_count - 1 > 0xF) { free(store_buf.data); return 0; }  // > 16 fields

    // Collect stored values into hash set (lines 823-868)
    HashSet store_set;
    init_hashset(&store_set);
    collect_store_values(global, module, &store_set, &scratch);

    // Validate all users reference only this global (lines 878-1017)
    if (!validate_global_only_uses(global, &store_set)) {
        free(store_buf.data); return 0;
    }

    // Optional vector type peeling (lines 1026-1083)
    if (*(uint8_t *)(type + 8) == 14) {
        peel_vector_type(global, module, type, target);
    }

    // Field explosion (lines 1084-1476)
    FieldVec fields = { .data = NULL, .size = 0, .capacity = 0 };
    sra_explode_fields(global, module, type, init_val, target, &fields);

    // Null/negative guard chain (lines 1478-1535)
    Value *guard = build_guard_chain(global, &fields, module, target);

    // Malloc/free decomposition (lines 1537-1640)
    Function *fn = get_parent_function(global);
    decompose_malloc_free(global, module, fn, &fields, guard, target);

    // Hash-table-driven user rewriting (lines 1642-2161)
    HashTable processed;
    init_hashtable(&processed);
    rewrite_remaining_users(global, &fields, &processed, module, target);

    // Cleanup: unlink dead operands, erase dead instructions
    cleanup_dead_instructions(&processed);                     // lines 2004-2117

    // Erase original global and free temporaries
    sub_15E55B0(global);                                       // lines 2119-2161
    free(fields.data);
    free(store_buf.data);
    destroy_hashtable(&processed);
    destroy_hashset(&store_set);

    return 1;
}

LTO Interaction

GlobalOpt benefits significantly from LTO's whole-program visibility. In single-compilation mode, a __device__ global with external linkage cannot be optimized because the compiler cannot prove it is unused by other translation units. With ThinLTO, the NVModuleSummary builder records per-global reference edges, and the ThinLTO importer pulls definitions across module boundaries. After import, GlobalOpt can see all users of a global across the entire program and make decisions that are impossible in per-module compilation:

Internalization: A global referenced only within one module (after import) can be marked internal (linkage 7), enabling all four transformation paths.
Dead global elimination: A global with zero users after import is trivially dead and erased. The NVModuleSummary builder's address-space tracking ensures that __device__ globals referenced by kernels are not prematurely killed -- a kernel's reference counts as a use even when no host-side code touches the global.
Cross-module constant propagation: After import, if a __device__ global is stored exactly once (from a host-side cudaMemcpyToSymbol) and loaded many times across multiple device functions, the single-store can be propagated as a constant, unlocking Path A's small-constant promotion.

The pass wrapper sub_196A2B0 is also called from the inliner cost model (sub_18612A0 address shared by both -- the inliner calls the GlobalOpt transform function to evaluate whether post-inline global folding would pay for the inline cost). This creates a feedback loop: inlining a caller that references a global may expose the global for optimization, which reduces code size, which makes further inlining cheaper.

Recursion

After completing either Path A or Path B, the pass recursively calls sub_185B1D0 on the newly created replacement globals. This handles cascading opportunities: splitting a struct global into fields may expose one of the field globals for further small-constant promotion (if a field is a small struct itself), or for dead elimination (if one field is never used). The recursion terminates when no further transformations apply -- each recursive call runs the same type filter and use validation, so it will return 0 for leaf scalars or globals with non-store/load users.

Knobs and Thresholds

Threshold	Value	Source	Effect
Max bits for Path A	2047 (0x7FF)	Hardcoded	Globals exceeding this fall through to SRA
Max struct fields for SRA	16	Hardcoded	Structs with >16 fields are not split
Hash table load factor	75% (3/4)	Hardcoded	Triggers rehash of processed-globals table
Tombstone threshold	12.5% (1/8)	Hardcoded	Triggers compacting rehash
Initial scratch buffer	8 entries	Hardcoded	For use analysis; grows via `sub_16CC920`
Store collection buffer	32 entries	Hardcoded	For store value collection; grows dynamically
SRA disable flag (a4)	Caller-set	Runtime	When set, Path B is bypassed entirely
Pipeline gate	opts[1440]	Config array	When set, the `sub_196A2B0` wrapper is skipped
Optimization tier	>= 2	Pipeline config	GlobalOpt not run at tier 1

The pipeline parser registers "globalopt" at slot 45 in the pass name table, mapping to llvm::GlobalOptPass. The NVIDIA wrapper sub_196A2B0 is gated by the config array at offset 1440 -- when opts[1440] is set, the wrapper skips the pass entirely. At tier 2, GlobalOpt runs unconditionally at pipeline position 30. At tier 3, it runs with the same parameters but benefits from more aggressive SCCP and GlobalDCE having run upstream.

There are no user-facing CLI flags that directly control the 2047-bit threshold or the 16-field SRA limit. These are compile-time constants in the binary. The only external control is the tier-level gate and the opts[1440] kill switch.

Function Map

Function	Address	Size	Role
`sub_18612A0`	`0x18612A0`	--	Core transform: type filter, Path A, Path B
`sub_196A2B0`	`0x196A2B0`	--	Pipeline wrapper (calls core after GlobalDCE)
`sub_185B1D0`	`0x185B1D0`	--	Recursive re-application to split globals
`sub_185B7E0`	`0x185B7E0`	--	Pre-SRA setup
`sub_1860410`	`0x1860410`	--	Hash table rehash
`sub_1860630`	`0x1860630`	--	Hash table lookup
`sub_1860BE0`	`0x1860BE0`	--	Per-user SRA rewrite
`sub_185C560`	`0x185C560`	--	Collect all store values for a global
`sub_185C920`	`0x185C920`	--	Analyze single store for optimizability
`sub_185CAF0`	`0x185CAF0`	--	Collect stored value into hash set
`sub_15E51E0`	`0x15E51E0`	--	Create global variable (88 bytes, with AS)
`sub_15E5070`	`0x15E5070`	--	Create init function
`sub_164D160`	`0x164D160`	--	RAUW (Replace All Uses With)
`sub_15F20C0`	`0x15F20C0`	--	Erase instruction from parent
`sub_15E55B0`	`0x15E55B0`	--	Erase global declaration
`sub_15A9520`	`0x15A9520`	--	`getPointerSizeInBits(target, addr_space)`
`sub_15A9930`	`0x15A9930`	--	`getStructLayout` (field offsets)
`sub_15A06D0`	`0x15A06D0`	--	`computeFieldOffset`
`sub_1646BA0`	`0x1646BA0`	--	`getStructFieldType`
`sub_16435F0`	`0x16435F0`	--	`isAnalyzableType(type, depth)`
`sub_140B2F0`	`0x140B2F0`	--	`evaluateInitializer(module, target, ..., 1)`
`sub_15FB630`	`0x15FB630`	--	Create `notinit` sentinel
`sub_15FB440`	`0x15FB440`	--	Create binary OR (opcode 27)
`sub_15FEC10`	`0x15FEC10`	--	Create ICmp instruction
`sub_15F8650`	`0x15F8650`	--	Create conditional branch
`sub_15F8590`	`0x15F8590`	--	Create unconditional branch
`sub_157FBF0`	`0x157FBF0`	--	Create basic block
`sub_15FED60`	`0x15FED60`	--	Create ICmp NE (opcode 51, predicate 33)
`sub_15F9330`	`0x15F9330`	--	Create alloca (`"tmp"` variable in block)
`sub_15FDB00`	`0x15FDB00`	--	Wire def into use-def chain
`sub_15F9850`	`0x15F9850`	--	Create store-to-field-global
`sub_157E9C0`	`0x157E9C0`	--	Create return basic block (null-return)
`sub_157FB60`	`0x157FB60`	--	Create basic block with predecessor
`sub_15F55D0`	`0x15F55D0`	--	Grow operand list
`sub_1648700`	`0x1648700`	--	`getInstruction(use)` from use-chain
`sub_1649960`	`0x1649960`	--	`getName(global/fn)` returns C string
`sub_1648A60`	`0x1648A60`	--	`IRBuilder::create(size, kind)` allocates IR node
`sub_15E5530`	`0x15E5530`	--	Destroy function body
`sub_159D9E0`	`0x159D9E0`	--	Destroy function
`sub_164BE60`	`0x164BE60`	--	Drop all references
`sub_1648B90`	`0x1648B90`	--	Mark dead (flags or-equals 1)
`sub_1631BE0`	`0x1631BE0`	--	Insert into function list
`sub_15A9FE0`	`0x15A9FE0`	--	`getAlignment(target, type)` ABI alignment
`sub_15A0680`	`0x15A0680`	--	`lookupSymbol(module_sym, idx, flags)`
`sub_16463B0`	`0x16463B0`	--	`getArrayElementType(ptr, idx)`
`sub_159C540`	`0x159C540`	--	`getTerminalType(type)`
`sub_1752100`	`0x1752100`	--	Collect use-def chain
`sub_15E6480`	`0x15E6480`	--	Copy metadata from global to global
`sub_15F8F80`	`0x15F8F80`	--	Create extractvalue instruction
`sub_15F9480`	`0x15F9480`	--	Create store-init (initializer store)
`sub_15F9660`	`0x15F9660`	--	Create field store (offset + field global)
`sub_15FD590`	`0x15FD590`	--	Create local variable (`"newgv"`)
`sub_15FEBE0`	`0x15FEBE0`	--	Create bitcast/GEP for field extraction
`sub_1648780`	`0x1648780`	--	Replace use with value
`sub_16CC920`	`0x16CC920`	--	Grow scratch buffer
`sub_16CC9F0`	`0x16CC9F0`	--	Find in sorted set
`sub_1968390`	`0x1968390`	--	GlobalDCE / ConstantProp (runs before GlobalOpt)

Differences from Upstream LLVM GlobalOpt

Stock LLVM's GlobalOptPass (in lib/Transforms/IPO/GlobalOpt.cpp) performs similar high-level transformations: SRA of globals, shrink-to-bool, constant marking, dead global elimination, malloc/free removal, static constructor evaluation, calling convention optimization (fastcc), and alias resolution. The NVIDIA implementation diverges in these concrete ways:

Internal IR, not LLVM IR. The pass operates on NVIDIA's custom IR node format with 88-byte global nodes, 24-byte operand stride, and type tags at offset +8/+16 of type/instruction nodes. A reimplementation targeting upstream LLVM would use GlobalVariable, StoreInst, LoadInst, and GetElementPtrInst directly.
2047-bit constant promotion threshold. LLVM does not have a single bit-count gate for constant promotion. NVIDIA's threshold likely targets the GPU register file: 2047 bits is approximately 64 32-bit registers, close to the per-thread register budget on many SM architectures.
Per-field malloc decomposition. Stock LLVM's tryToOptimizeStoreOfMallocToGlobal handles malloc/free as a single pair. NVIDIA generates per-field null checks, conditional frees, and continuation blocks -- a more aggressive decomposition.
Custom hash table. LLVM uses DenseMap/SmallPtrSet. NVIDIA uses a hand-rolled open-addressing hash table with 32-byte entries (see Hash Table and Collection Infrastructure for the hash function and sentinel values).
Address-space preservation. Every created global explicitly receives the source global's address space. Stock LLVM does not special-case address spaces in GlobalOpt.
Recursive re-application. After splitting, NVIDIA calls sub_185B1D0 to re-run GlobalOpt on the results. Upstream LLVM relies on the pass manager to schedule re-runs via its invalidation mechanism.
Inliner integration. The inliner cost model at the same address range calls into GlobalOpt to evaluate post-inline global folding benefit. This tight coupling does not exist in upstream LLVM where inlining and GlobalOpt are independent passes.

Cross-References

NVModuleSummary Builder -- builds the global reference edges that determine which globals are live across modules
Inliner Cost Model -- calls GlobalOpt's transform function to evaluate post-inline global optimization benefit
ThinLTO Function Import -- imports functions across module boundaries, exposing globals for cross-module optimization
Alias Analysis & NVVM AA -- address-space-aware alias analysis that informs which memory operations can alias globals in different address spaces
MemorySpaceOpt -- resolves generic pointers to specific address spaces; runs before GlobalOpt and may expose globals that were previously behind generic pointers
Pipeline & Ordering -- full pass ordering showing GlobalOpt's position at step 30
Type Translation, Globals & Special Vars -- how EDG frontend assigns address spaces to global variables during IR generation
Hash Infrastructure -- hash function, sentinel values, and probing strategy used by the processed-globals table
Struct Splitting -- the NewPM lower-aggr-copies pass that handles similar aggregate decomposition at a different pipeline stage
Address Spaces -- complete NVPTX address space reference including pointer sizes and latency characteristics

Whole-Program Devirtualization

CICC v13.0 includes LLVM's WholeProgramDevirtPass at sub_2703170 (13,077 bytes), which replaces indirect virtual calls with direct calls using whole-program type information. On GPU this optimization is far more consequential than on CPU: an indirect call in PTX compiles to a call.uni through a register, which prevents the backend from inlining the callee, forces all live registers across the call boundary into local memory spills, destroys instruction scheduling freedom, and creates a warp-divergence hazard if threads in the same warp resolve the function pointer to different targets. A single devirtualized call site in a hot kernel loop can therefore improve performance by an order of magnitude -- the direct call enables inlining by the inliner cost model, which in turn eliminates .param-space marshaling, enables cross-boundary register allocation, and restores the instruction scheduler's ability to interleave memory and arithmetic operations.

CICC's devirtualization operates in a privileged position: GPU compilation is inherently a closed-world model. Every function that can be called on the device must be visible at link time -- there is no dynamic loading, no shared libraries, and no dlopen on GPU. This means the set of possible implementations for any virtual function is fully known, making single-implementation devirtualization almost always profitable and branch funnels rare. The pass runs as a module-level pass (pipeline parser slot 121, registered as "wholeprogramdevirt") during the LTO phase, after the NVModuleSummary builder has computed type test metadata and before GlobalDCE eliminates dead virtual methods.


Entry point	`sub_2703170` (`0x2703170`, 13,077 bytes)
Address range	`0x2703170`--`0x2706485`
Stack frame	856 bytes (`0x358`)
Pass name	`"wholeprogramdevirt"` (pipeline slot 121)
Pass type	Module pass
Callee-saved	`r15`, `r14`, `r13`, `r12`, `rbx`
Return value	1 = module modified, 0 = no changes
Remark category	`"wholeprogramdevirt"` / `"Devirtualized"`
Helper range	`sub_2700B00`--`sub_2708220` (branch funnel helpers, summary I/O)

The Closed-World GPU Advantage

Upstream LLVM's WholeProgramDevirt is designed primarily for LTO pipelines where some modules may not be visible (ThinLTO import/export split, shared libraries with hidden visibility). The pass must therefore be conservative: it can only devirtualize when !type metadata proves that the vtable set is complete. On GPU, this conservatism is unnecessary. All device code is statically linked into a single fatbinary -- there are no device-side shared libraries, no runtime code loading (the driver JIT compiles PTX, but does not add new device functions), and __device__ virtual functions cannot escape to host code. The entire class hierarchy is visible.

CICC exploits this by running WPD in regular LTO mode (not ThinLTO export/import split), where the pass directly resolves virtual calls against the merged module. The NVModuleSummary builder records type_test metadata for all device vtables, and the pass consumes this metadata to build a complete picture of every virtual call site and every possible target. In practice, GPU programs rarely have deep polymorphic hierarchies in device code (the hardware penalties discourage it), so most virtual call sites resolve to a single implementation.

The Formal Closed-World Argument

The closed-world guarantee on GPU rests on five architectural invariants, each of which eliminates a source of conservatism that forces upstream LLVM to leave calls indirect:

#	Invariant	What upstream LLVM must worry about	Why GPU is immune
1	No device-side shared libraries	A `.so` loaded at runtime could add a new vtable entry for a class. LTO must mark `!vcall_visibility` metadata `linkage-unit` to prove the vtable set is closed within the link unit.	The CUDA driver loads PTX/SASS as a monolithic blob. `cuModuleLoad` does not support incremental symbol addition. There is no `dl_iterate_phdr` on device.
2	No `dlopen` on device	Host-side `dlopen` can inject new implementations of virtual functions. Upstream must check `!vcall_visibility` for `translation-unit` scope.	Device code has no equivalent of `dlopen`. The only way to add device code is to recompile and reload the entire module.
3	No device-side RTTI	`dynamic_cast` and `typeid` on host can defeat devirtualization by requiring the vtable to contain RTTI pointers that reference external type_info objects.	CUDA explicitly prohibits `dynamic_cast` and `typeid` in `__device__` functions. Device vtables contain no RTTI pointers. The NVVM IR verifier (`sub_12DD660`) rejects code that attempts `dynamic_cast` in device context.
4	No exceptions on device	Virtual destructors in exception-handling code create additional vtable entries and `__cxa_throw` unwinding paths that must be considered.	CUDA does not support exceptions in device code. Virtual destructors are simple (no EH cleanup), and the compiler can see every destructor call site.
5	Complete link-time visibility	ThinLTO's import/export split means some modules may not be available during WPD. The pass must use summary-based resolution with `wholeprogramdevirt-summary-action=import/export`.	CICC uses `wholeprogramdevirt-summary-action=none` (direct resolution on the merged module). All device functions, including those from separate compilation units, are linked by nvlink into a single merged module before the LTO pipeline runs.

The practical consequence: CICC sets whole-program-visibility effectively to true for all device code. The !vcall_visibility metadata that upstream uses to distinguish "linkage-unit" from "translation-unit" scope becomes irrelevant -- every device vtable is within a single, complete, closed translation unit.

How NVModuleSummary Feeds WPD

The NVModuleSummary builder at sub_D7D4E0 (2,571 decompiled lines, 74KB) produces the type metadata that WPD consumes. The interaction is:

NVModuleSummary walks every GlobalValue in the module (linked list at Module+72). For each function (opcode 0x3D), it extracts attribute groups #34 (reference edges with type metadata) and #35 (direct call targets) via sub_B91C10.
For reference edges with type info (attribute #34), the builder decodes MDNode operands (lines 1193-1228 of the decompilation): each parameter position >= 2 yields a type node (opcode range 5-36), walked to a parent MDTuple (opcode 17) containing the type name string at offset 24 (indirect through pointer if length > 64).
These type-metadata edges are packed into the FunctionSummary record by sub_D77220 as the v378 (type-checked references) argument. The resulting metadata lands in the module as llvm.type.test / type_test_assume named metadata nodes.
WPD reads these nodes back via sub_B6AC80(module, 0x166) at its entry point, completing the producer-consumer chain.

DevirtSCCRepeatedPass: The Outer Loop

WPD at the module level is one of two devirtualization mechanisms. The other operates at CGSCC granularity: DevirtSCCRepeatedPass at sub_2284BC0 (16KB) wraps the CGSCC pipeline in a fixed-point iteration loop that re-runs until no new devirtualization opportunities are discovered or a maximum iteration count is reached. On reaching the limit, the pass emits "Max devirtualization iterations reached". The abort-on-max-devirt-iterations-reached knob (registered at constructor 378) controls whether this is a fatal error or a warning. The iteration count at O1-O3 is 1; at tier 3 (maximum optimization) it is 5, giving the inliner and devirtualizer multiple rounds to discover indirect-to-direct call conversions that expose further inlining opportunities.

The two mechanisms are complementary: module-level WPD resolves virtual calls using global type hierarchy information (vtable metadata), while CGSCC-level devirtualization catches cases where inlining reveals new constant function pointers that can be resolved without type metadata.

Algorithm

The pass executes in seven phases:

Phase 1: Metadata Extraction (`0x2703170`--`0x27031CA`)

The entry point fetches four named metadata nodes from the module using sub_B6AC80 (getNamedMetadata):

Enum ID	Metadata Node	Purpose
`0x166` (358)	`llvm.type.test` / type_test_assume	Records of `@llvm.assume(@llvm.type.test(%ptr, %typeID))` intrinsic results
`0x164` (356)	`llvm.type.checked.load`	Call sites using type-checked vtable loads
`0x165` (357)	`llvm.type.checked.load.relative`	Relative vtable pointer variant (compact vtables)
`0x0B` (11)	Module-level type metadata	Type summaries describing vtable layouts

If neither type_test_assume nor module-level type metadata are present, the pass checks for type_checked_load and type_checked_load_relative as fallbacks. If none exist, the pass returns 0 immediately.

The assembly sequence at the entry point:

; 0x2703170: entry
mov  esi, 0x166              ; enum ID = 358 (type_test_assume)
call sub_B6AC80              ; rbx = getNamedMetadata(module, 0x166)

mov  esi, 0x164              ; enum ID = 356 (type_checked_load)
call sub_B6AC80              ; r13 = getNamedMetadata(module, 0x164)

mov  esi, 0x165              ; enum ID = 357 (type_checked_load_relative)
call sub_B6AC80              ; [rbp-0x338] = result

mov  esi, 0x0B               ; enum ID = 11 (module-level type metadata)
call sub_B6AC80              ; r12 = result

Phase 2: Type Test Record Iteration (`0x2703296`--`0x2703383`)

Type test records are stored in an array at offset +0xA0 of the metadata state, with count at +0xA8. Each record is 144 bytes (0x90):

struct TypeTestRecord {       // 0x90 = 144 bytes per record
    uint8_t *type_value;      // +0x00: pointer to type test value
    // ... call site references, metadata links ...
};

// Iteration pattern at 0x2703296:
TypeTestRecord *base = state->records;          // [state + 0xA0]
uint32_t count = state->record_count;           // [state + 0xA8]
TypeTestRecord *end = base + count;             // stride = 0x90
// Address computation in binary:
//   lea rax, [rax+rax*8]      ; count * 9
//   shl rax, 4                ; count * 144 = count * 0x90
//   add rax, rdx              ; end pointer

for (TypeTestRecord *rec = base; rec != end; rec++) {
    if (rec->type_value[0] != 0) continue;      // skip already-processed
    // ... look up type in hierarchy ...
}

For each record whose type byte is 0 (unprocessed), the pass computes a string hash of the type name via sub_B91420 (get type name) and sub_B2F650 (string hash), then looks up the type in a red-black tree rooted at offset +0xE0 of the module state.

Phase 3: Hash Table Construction (`0x2703589`--`0x2703AE2`)

Unique type test values are tracked in an open-addressed hash table with 56-byte entries. The hash function combines bit-shifted fields to reduce clustering:

uint32_t hash(uint32_t val, uint32_t mask) {
    return ((val >> 4) ^ (val >> 9)) & mask;
}

The table uses power-of-2 sizing with LLVM-layer sentinels (empty = 0xFFFFFFFFE000, deleted = 0xFFFFFFFFF000). See Hash Table and Collection Infrastructure for the probing and growth policy.

Each 56-byte hash table entry stores:

Offset	Size	Field
`+0x00`	8	Type test value (key)
`+0x08`	8	Flags / padding
`+0x10`	8	Type info pointer
`+0x18`	8	Associated data (resolution result)
`+0x20`	8	Red-black tree node (self-referential on init)
`+0x28`	8	Link pointer
`+0x30`	8	Count / size

Slot addressing uses the identity slot_index * 7 * 8 = slot_index * 56:

; At 0x27035A0:
lea  rdx, ds:0[rsi*8]    ; rsi = slot index, rdx = slot*8
sub  rdx, rsi             ; rdx = slot*8 - slot = slot*7
mov  rsi, [rdi+rdx*8]    ; load from table base + slot*56

Table growth is handled by sub_2702540, which reallocates and rehashes all entries using the same (val >> 4) ^ (val >> 9) function against the new mask. Entry initialization at 0x2703A33:

; Insert new entry:
add  [rbp-0x2D0], 1      ; increment unique type count
call sub_2702540          ; grow table if needed (returns new entry ptr in rax)
mov  dword [rax+10h], 0  ; clear type info
mov  qword [rax+18h], 0  ; clear data
mov  [rax], rdx           ; store type test value
lea  rdx, [rax+10h]
mov  [rax+20h], rdx       ; self-referential link (RB tree node init)
mov  [rax+28h], rdx       ; self-referential link
mov  qword [rax+30h], 0  ; zero count

Phase 4: Type Hierarchy Lookup via Red-Black Tree (`0x27032F7`--`0x2703362`, `0x2704183`--`0x2704267`)

For each unique type, the pass searches a red-black tree keyed by hashed type name. The tree is rooted at offset +0xE0 of the module state, with the sentinel node at +0xD8. The search is a two-phase process with a three-field comparison:

Phase 4a: Compute Type Name Hash

// At 0x27032F7:
char *name = sub_B91420(type_value);     // returns (name_ptr, name_len)
uint64_t hash = sub_B2F650(name, len);   // string hash

// Tree root and sentinel:
RBTreeNode *root = module_state[+0xE0];  // root pointer
RBTreeNode *sentinel = module_state + 0xD8; // sentinel node address

sub_B2F650 (stringHash) is LLVM's standard xxHash-style string hasher. It produces a 64-bit hash that is stored at node[+0x20] for each type in the tree.

Phase 4b: Descend Tree by Hash

// At 0x270330C:
RBTreeNode *current = root;
RBTreeNode *best = sentinel;    // rcx = sentinel initially

while (current != NULL) {
    uint64_t node_hash = current[+0x20];    // hash stored in node
    if (target_hash < node_hash) {
        best = current;                      // track nearest greater
        current = current[+0x10];            // left child
    } else if (target_hash > node_hash) {
        current = current[+0x18];            // right child
    } else {
        // hash matches -- proceed to Phase 4c
        break;
    }
}

if (current == NULL) goto not_found;

The binary encodes this as:

compare_node:
    cmp  rsi, [r15+20h]    ; compare target hash vs node hash
    ja   go_right           ; target > node -> right child
    jnb  hash_match         ; target == node -> verify
    mov  rcx, r15           ; track best (left-leaning)
    mov  r15, [r15+10h]    ; r15 = left child
    test r15, r15
    jnz  compare_node
    jmp  not_found

go_right:
    mov  r15, [r15+18h]    ; r15 = right child
    test r15, r15
    jnz  compare_node
    jmp  not_found

Phase 4c: Verify Full Match (Hash Collision Resolution)

On hash match, the pass performs a two-step verification to handle collisions:

// At 0x2704200:
// Step 1: compare string lengths
if (current[+0x30] != target_length) {
    // Length mismatch -- this is a hash collision, not a real match.
    // Continue tree traversal to the next candidate.
    goto next_candidate;
}

// Step 2: compare actual type name strings
char *node_name = (char *)current[+0x28];    // node's type name data
char *target_name = target_string;            // from sub_B91420
int cmp = memcmp(node_name, target_name, target_length);
if (cmp != 0) goto next_candidate;

// Verified match -- read vtable data

The binary at 0x2704200--0x2704240:

    cmp  r12, [r15+30h]    ; compare string length
    jnz  next_candidate    ; length mismatch

    mov  rdi, [r15+28h]    ; s1 = node's string data
    mov  rsi, [rbp-0x348]  ; s2 = target string data
    mov  rdx, r12           ; n = length
    call _memcmp
    test eax, eax
    jz   found_match

Phase 4d: Extract Vtable Data

After verifying the type match, the pass reads the vtable descriptor from the type node:

// At 0x2704248:
void *vtable_start = current[+0x68];    // vtable start address
void *vtable_data  = current[+0x70];    // vtable data pointer (function pointers)

if (vtable_data == NULL) goto skip_type; // no vtable -> nothing to devirtualize

The vtable_data pointer leads to an array of function pointers representing the virtual method implementations for this type. The pass iterates this array comparing each entry against call site signatures to identify devirtualization candidates.

Phase 5: Virtual Call Resolution (`0x2703974`--`0x27039BA`)

For each call site on a matched type, the pass calls sub_26FEE10 (resolveVirtualCall):

bool resolveVirtualCall(
    void *module_state,         // rdi: r15 (module/pass state)
    void *target_candidates,    // rsi: candidates vector from [rbp-0x230]
    void *hash_entry,           // rdx: r12 (pointer to hash table entry + 8)
    uint32_t candidate_count,   // rcx: from [rbp-0x228]
    void *call_site_info        // r8:  r13 (call site from [r15+0x28])
);
// Returns: al = 1 if unique resolution found, 0 otherwise

The resolution algorithm within sub_26FEE10 works by comparing the vtable offset encoded in each call site's llvm.type.test / llvm.type.checked.load intrinsic against the vtable slot offsets of all candidate implementations. When exactly one candidate matches, the resolution succeeds with strategy 1 (direct call). When multiple candidates exist but all return the same constant or can be distinguished by a single offset, strategy 2 (unique member) is chosen. When multiple distinct targets exist, strategy 3 (branch funnel) is produced.

The resolution result is written to hash_entry[+0x28] as a strategy selector:

Value	Strategy	Upstream LLVM counter
1	Direct call (single implementation)	`NumSingleImpl`
2	Unique member dispatch	`NumUniformRetVal` / `NumUniqueRetVal`
3	Branch funnel	`NumBranchFunnel`

Before calling sub_26FEE10, the pass checks two preconditions:

// At 0x2703974:
void *call_site_list = module_state[+0x28];   // r13 = [r15+0x28]
if (call_site_list == NULL) goto skip;

if (type_value[0] != 0) goto skip;            // byte check: direct type info only

void *existing = hash_entry[+0x28];
if (existing != 0) goto already_resolved;     // skip if previously resolved

Phase 6: Strategy Application (`0x2703BA3`--`0x27046F0`)

Strategy 1 -- Direct Call Replacement (`0x27044DA`)

When only one class implements the virtual function (the common case on GPU), the indirect call is replaced with a direct call to the resolved function. This is handled by sub_26F9AB0 (rewriteCallToDirectCall):

// At 0x27044DA:
void rewriteCallToDirectCall(
    void *type_entry,           // rdi: r12
    void *call_site,            // rsi: [r15+0x38]
    uint64_t vtable_data,       // rdx: byte_3F871B3 (vtable offset data)
    uint32_t flags,             // ecx: 0
    void *resolved_function     // r8:  [rbx+0x40]
);

This is the simplest and most common optimization: the call.reg becomes call.direct, enabling downstream inlining. On GPU this is by far the dominant strategy. Consider a CUDA kernel with a virtual method call inside a loop:

; Before devirtualization (PTX):
ld.global.u64  %rd1, [%rd0];     // load vtable ptr
ld.global.u64  %rd2, [%rd1+16];  // load function ptr at vtable slot 2
call.uni       %rd2, (%args);    // indirect call -- full scheduling barrier

; After devirtualization (PTX):
call.uni       _ZN7DerivedN4workEv, (%args);  // direct call -- inlinable

The direct call then becomes an inlining candidate with CICC's 20,000-unit budget (89x the upstream LLVM default of 225), and the inliner typically eliminates it entirely, producing fully-inlined code with no call overhead.

Strategy 2 -- Unique Member Dispatch (`0x27045C9`)

When multiple classes exist but the call can be dispatched through a unique member offset, the pass rewrites via sub_26F9080 (rewriteToUniqueMember), passing the diagnostic string "unique_member" (13 chars). The member offset is read from hash_entry[+0x60] and the base type from hash_entry[+0x00].

; At 0x27045D9:
mov  r11, [r12]              ; type info (base type)
mov  rsi, [r12+60h]          ; member offset
lea  rax, "unique_member"    ; diagnostic string (13 chars)
call sub_26F9080             ; rewriteToUniqueMember
;   rdx = r14 (type test record)
;   rcx = r13 (call site)
;   r9  = rdi (vtable byte offset / 8)
;   "unique_member" + length 0x0D pushed on stack

After the initial rewrite, sub_26FAF90 performs call-site-specific fixup, checking [rbx+0x40] to determine if additional adjustment is needed (e.g., adjusting this pointer offset for multiple inheritance).

Upstream LLVM's equivalent covers two sub-strategies: uniform return value optimization (all implementations return the same constant -- replace the call with that constant) and unique return value optimization (for i1 returns, compare the vptr against the one vtable that returns a different value). Both are folded under the "unique_member" label in CICC's implementation.

Strategy 3 -- Branch Funnel (`0x27043B5`)

When multiple possible targets exist and cannot be reduced to a single dispatch, the pass creates a branch funnel -- a compact conditional dispatch sequence that checks the vtable pointer and branches to the correct target. This is handled by three functions:

sub_26F78E0 -- create branch funnel metadata (with diagnostic string "branch_funnel", 13 chars)
sub_BCF480 -- build the conditional dispatch structure
sub_BA8C10 -- emit the indirect branch sequence

; At 0x27043B5:
mov  r12, [rbx]               ; vtable pointer
mov  rdi, [r12]               ; function pointer from vtable
call sub_BCB120                ; get function declaration

; At 0x27043D3:
lea  rax, "branch_funnel"     ; 13 chars at 0x42BCB92
call sub_26F78E0               ; create branch funnel metadata
call sub_BCF480                ; build dispatch structure
call sub_BA8C10                ; emit indirect branch sequence

The branch funnel supports two dispatch granularities:

Granularity	String	Function	Description
Byte	`"byte"` (4 chars, at `0x3F8C256`)	`sub_26F9120`	Check byte offset into vtable to select target
Bit	`"bit"` (3 chars, at `0x43ADFE0+0xE`)	`sub_26F9120`	Check bit offset for single-bit discrimination

The emission sequence at 0x270450C--0x27045BF:

; Byte-granularity dispatch:
lea  rcx, "byte"              ; at 0x3F8C256
mov  [rbp-0x318], 4           ; string length
call sub_26F9120               ; emit byte-offset check

; Bit-granularity dispatch:
lea  rbx, "bit"               ; at 0x43ADFE0 + 0xE
mov  [rbp-0x328], 3           ; string length
call sub_26F9120               ; emit bit-offset check

; Finalize:
call sub_26FB610               ; r8=byte_result, r9=bit_result
                               ; rdi=r12, rdx=byte_3F871B3

The finalization call sub_26FB610 receives both byte and bit results and produces the final dispatch sequence. On GPU, branch funnels are rare because device code hierarchies are typically shallow, but the infrastructure exists for cases like thrust/CUB polymorphic iterators.

Upstream LLVM gates branch funnels behind the wholeprogramdevirt-branch-funnel-threshold knob (default: 10 targets per call site). CICC inherits this threshold.

Phase 7: Cleanup (`0x2704144`--`0x270342E`)

After processing all types, the pass performs four cleanup operations:

Function attribute cleanup (0x2704144): iterates the module's function list (red-black tree at [rax+10h]), calling sub_B98000 with parameter 0x1C (attribute cleanup enum) on each function.
Import list cleanup (0x270416C): processes entries at module[+0x110..+0x118], calling sub_B43D60 to release function metadata for imported declarations.
Type hierarchy destruction: sub_26F92C0 releases all type hierarchy data structures.
Hash table deallocation (0x27033C3): iterates all non-sentinel entries, calls sub_26F75B0 to release per-entry resolution data, then sub_C7D6A0 to free the table buffer. Type test result vectors (0x70-byte elements with sub-vectors at offsets +0x10, +0x28, +0x40, +0x58) are freed element by element.

Hash table cleanup detail:

// At 0x27033C3:
uint32_t count = hash_table_entry_count;     // [rbp-0x2B8]
if (count == 0) goto skip_cleanup;

void *base = hash_table_base;                // [rbp-0x2C8]
void *end = base + count * 56;               // count * 7 * 8

for (void *entry = base; entry < end; entry += 56) {
    uint64_t key = *(uint64_t *)entry;
    if (key == 0xFFFFFFFFE000) continue;     // empty sentinel
    if (key == 0xFFFFFFFFF000) continue;     // deleted sentinel
    sub_26F75B0(entry[+0x18]);               // release resolution data
}
sub_C7D6A0(base);                            // free table buffer

GPU-Specific Constraints

Virtual Functions in Device Code

CUDA allows __device__ virtual functions, but with restrictions that simplify devirtualization:

No RTTI on device. There is no typeid or dynamic_cast on GPU. This means vtable layouts do not contain RTTI pointers, simplifying vtable reconstruction. The NVVM IR verifier rejects code that attempts dynamic_cast in device context.
No exceptions on device. Virtual destructors do not need to handle __cxa_throw unwinding paths.
Closed world. No device-side shared libraries, no dlopen, no runtime code generation. All virtual targets are known at compile time.
No separate compilation for virtual dispatch. Device linking (nvlink) resolves all symbols before PTX emission, so the merged module always has complete type information.
Simplified vtable layout. Without RTTI pointers and exception tables, device vtables are a flat array of function pointers at known offsets. This makes vtable slot arithmetic straightforward for the WPD pass.

Cost of Unresolved Indirect Calls

If devirtualization fails, the PTX backend must emit a call.uni or call through a register. This has several penalties:

No inlining. The callee is unknown, so the inliner cannot evaluate it.
Full .param marshaling. Every argument must be written to .param space; no copy elision is possible. The call ABI (opcodes 510-513: CallDirect, CallDirectNoProto, CallIndirect, CallIndirectNoProto) forces .param-space round-tripping.
Register pressure spike. All live registers across the call must be spilled to local memory (device DRAM, ~400 cycle latency on SM 70-90).
Scheduling barrier. The call is a full fence for instruction scheduling -- no operations can be reordered across it.
Divergence hazard. If different threads in a warp resolve the pointer to different functions, execution serializes both paths. In the worst case (32 different targets), this is a 32x slowdown.
Occupancy reduction. The register spills increase per-thread local memory usage, reducing occupancy and thus hiding less memory latency.

This is why CICC's default inlining budget of 20,000 (89x the upstream LLVM default) makes sense in combination with aggressive devirtualization: the pass converts expensive indirect calls into direct calls, and the inliner then eliminates them entirely.

Relationship to LowerTypeTests

The LowerTypeTests pass (sub_188C730, 96,984 bytes at 0x188C730; also sub_2638ED0 at 70KB) is the other half of the type-test infrastructure. While WPD consumes type test metadata to resolve virtual calls, LowerTypeTests produces the runtime type-checking implementation. The interaction:

Pass	Role	When
NVModuleSummary (`sub_D7D4E0`)	Produces type metadata in function summaries	During summary construction
WholeProgramDevirt (`sub_2703170`)	Consumes type metadata, resolves virtual calls	LTO phase, after summary, before GlobalDCE
LowerTypeTests (`sub_188C730`)	Lowers remaining `@llvm.type.test` intrinsics to runtime bit tests	After WPD, if CFI is active

On GPU, LowerTypeTests is largely dead code -- CUDA does not use Control-Flow Integrity (CFI), and WPD resolves most type tests statically. The sweep at 0x1880000 confirms: "WPD/CFI/LowerTypeTests cluster is also upstream-only; CUDA does not use CFI or type-based devirtualization" in the sense of runtime CFI checks. The type metadata is consumed entirely by WPD's compile-time resolution.

LowerTypeTests validates its input with: "Second argument of llvm.type.test must be metadata" and "Second argument of llvm.type.test must be a metadata string". These error paths are unreachable in normal CUDA compilation but exist because CICC links the full upstream LLVM IPO library.

Optimization Remarks

When a call site is successfully devirtualized, the pass emits an optimization remark through the diagnostic handler. The remark is constructed at 0x2703EDA using three components:

Component	String	Address
Remark name	`"Devirtualized"` (13 chars)	`0x42BCBEe`
Pass name	`"wholeprogramdevirt"` (18 chars)	`0x42BC950`
Body prefix	`"devirtualized "` (14 chars)	`0x42BCBE2`
Attribute key	`"FunctionName"` (12 chars)	`0x42BC980`

The remark construction sequence:

// At 0x2703EDA:
sub_B17560(&remark, "Devirtualized", 13, "wholeprogramdevirt", 18);
sub_B18290(&remark, "devirtualized ", 14);        // append body
sub_B16430(&remark, "FunctionName", 12);          // create named attribute
sub_26F69E0(&remark, resolved_function);          // attach target name
sub_B180C0(&remark);                              // finalize
sub_1049740(diag_handler, &remark);               // publish to handler

The remark is visible via -Rpass=wholeprogramdevirt and includes the name of the resolved target function (obtained from the function's name metadata or via sub_26F69E0 for unnamed functions).

After remark emission, extensive cleanup of small-string-optimized (SSO) std::string objects is performed -- each remark component checks if the string buffer was heap-allocated (compare pointer vs stack buffer address) and frees if necessary.

Knobs

Knob	Type	Default	Effect
`wholeprogramdevirt-branch-funnel-threshold`	unsigned	10	Maximum number of call targets per call site for branch funnel emission. Beyond this threshold, the call site is left indirect.
`whole-program-visibility`	bool	false	Force enable whole-program visibility even without `!vcall_visibility` metadata. On GPU this is effectively always true.
`disable-whole-program-visibility`	bool	false	Force disable whole-program visibility for debugging.
`wholeprogramdevirt-summary-action`	enum	none	Controls summary interaction: `none`, `import`, `export`. CICC uses `none` (direct resolution on merged module).
`wholeprogramdevirt-read-summary`	string	empty	Read type resolutions from a bitcode/YAML file.
`wholeprogramdevirt-write-summary`	string	empty	Write type resolutions to a bitcode/YAML file.
`wholeprogramdevirt-skip`	string list	empty	Comma-separated list of function names to exclude from devirtualization.
`wholeprogramdevirt-check`	enum	none	Runtime checking mode: `none`, `trap` (abort on incorrect devirt), `fallback` (fall back to indirect call).
`wholeprogramdevirt-keep-unreachable-function`	bool	true	Keep unreachable functions as possible devirt targets (conservative default).
`wholeprogramdevirt-print-index-based`	bool	false	Print index-based devirtualization messages for debugging.
`wholeprogramdevirt-cutoff`	signed	-1	Maximum number of devirtualization actions to perform. -1 = unlimited. Useful for bisecting devirtualization-induced miscompiles.
`abort-on-max-devirt-iterations-reached`	bool	false	When `DevirtSCCRepeatedPass` at `sub_2284BC0` hits its iteration limit, abort instead of warning. Registered at constructor 378.

Complexity

Operation	Complexity	Notes
Hash table insert/lookup	O(1) amortized, O(n) worst case	Linear probing with sentinel-based open addressing
Type hierarchy lookup	O(log n)	Red-black tree keyed by type name hash, with `memcmp` verification
Per-type call resolution	O(call_sites * candidates)	For each type, check every call site against every candidate target
Branch funnel emission	O(vtable_entries) per site	Linear in number of possible targets
String hash (`sub_B2F650`)	O(name_length)	One-pass hash of the type name string
Total pass	O(T * S * C * log T)	T = types, S = call sites per type, C = candidates. Typically sparse on GPU.

Function Map

Function	Address	Size	Role
`WholeProgramDevirtPass::run`	`sub_2703170`	13,077	Pass entry point
`buildTypeTestInfo`	`sub_2702830`	~2,600	Build type test records from metadata
`growHashTable`	`sub_2702540`	~740	Grow and rehash the type test hash table
`resolveVirtualCall`	`sub_26FEE10`	~3,200	Attempt single-target resolution for a call site
`rewriteCallToDirectCall`	`sub_26F9AB0`	~1,600	Strategy 1: replace indirect call with direct call
`rewriteToUniqueMember`	`sub_26F9080`	~640	Strategy 2: unique member dispatch rewrite
`finalizeUniqueMember`	`sub_26FAF90`	~1,700	Strategy 2: call-site-specific fixup
`createBranchFunnelMeta`	`sub_26F78E0`	~1,100	Strategy 3: create branch funnel metadata
`buildBranchFunnel`	`sub_BCF480`	~6,400	Strategy 3: build conditional dispatch structure
`emitIndirectBranch`	`sub_BA8C10`	~8,200	Strategy 3: emit indirect branch sequence
`emitDispatchCheck`	`sub_26F9120`	~500	Branch funnel byte/bit offset check
`finalizeBranchFunnel`	`sub_26FB610`	~1,800	Branch funnel finalization
`destroyTypeHierarchy`	`sub_26F92C0`	~400	Release type hierarchy data structures
`releaseResolutionData`	`sub_26F75B0`	~300	Free per-entry resolution data
`attachFunctionName`	`sub_26F69E0`	~240	Attach function name to optimization remark
`branchFunnelHelper`	`sub_2700B00`	~9,800	Branch funnel main helper (called from `sub_2703170`)
`summaryIO`	`sub_2706490`	~7,600	WPD summary read/write (`-wholeprogramdevirt-read-summary`)
`DevirtSCCRepeatedPass::run`	`sub_2284BC0`	16,000	CGSCC devirtualization iteration loop
`getNamedMetadata`	`sub_B6AC80`	~200	Fetch named metadata node from module
`getTypeInfoName`	`sub_B91420`	~300	Compute type info name string
`stringHash`	`sub_B2F650`	~180	Hash a type name string (xxHash-style)
`createRemarkHeader`	`sub_B17560`	~250	Create optimization remark header
`appendRemarkBody`	`sub_B18290`	~200	Append body text to remark
`createNamedAttribute`	`sub_B16430`	~200	Create named attribute for remark
`publishRemark`	`sub_1049740`	~100	Publish remark to diagnostic handler

Cross-References

NVModuleSummary Builder -- produces the type_test metadata consumed by this pass; records devirtualization-relevant type GUIDs in per-function summaries via sub_D7D4E0.
Inliner Cost Model -- devirtualized direct calls become inlining candidates with a 20,000-unit budget; the entire value of devirtualization on GPU depends on the inliner subsequently eliminating the call.
ThinLTO Function Import -- in ThinLTO mode the pass would operate in export/import phases, but CICC primarily uses regular LTO for device code.
Pipeline & Ordering -- WPD is registered at pipeline parser slot 121 as a module pass; it runs during the LTO phase after summary construction and before GlobalDCE.
NVPTX Call ABI -- describes the .param-space calling convention that makes indirect calls so expensive (opcodes 510-513: CallDirect, CallDirectNoProto, CallIndirect, CallIndirectNoProto).
LazyCallGraph & CGSCC -- devirtualization converts ref edges to call edges in the call graph, triggering SCC re-computation via switchInternalEdgeToCall. The DevirtSCCRepeatedPass at sub_2284BC0 wraps the CGSCC pipeline in a fixed-point loop.
GPU Execution Model -- explains why indirect calls are so expensive on GPU (warp divergence, scheduling barriers, register spilling to local memory).
Hash Infrastructure -- the type test hash table uses the same sentinel-based open-addressing pattern as CICC's universal DenseMap infrastructure.

GPU Execution Model

This page is the single authoritative reference for the GPU hardware properties that drive cicc's optimization decisions. Every other wiki page that mentions register pressure, occupancy cliffs, memory coalescing, warp divergence, or the .param calling convention should cross-reference this page rather than re-explaining the concepts inline. The page exists because these properties shape literally every pass in the compiler, from SROA (which exists to avoid .local memory) through register allocation (which trades register count for occupancy) to LTO inlining (which eliminates .param marshaling). Understanding the execution model is a prerequisite for understanding any cicc optimization decision that differs from upstream LLVM.

The material below describes the hardware model as cicc sees it -- the properties that are visible in the binary through TTI hooks, threshold constants, cost model comparisons, and diagnostic strings. Where specific numbers vary by SM generation, the sm_70+ (Volta through Blackwell) values are given unless otherwise noted.

SIMT Warp Execution

NVIDIA GPUs execute threads in groups of 32 called warps. All 32 threads in a warp share a single program counter under the SIMT (Single Instruction, Multiple Threads) model. The hardware issues one instruction per clock to all 32 threads simultaneously -- there is no per-thread instruction decode, fetch, or issue overhead. Each thread has its own register state and can execute a different data path, but they all advance through the program in lockstep.

This is not SIMD in the CPU sense. On a CPU with AVX-512, the programmer (or compiler) explicitly packs 16 floats into a vector register and issues a single vector instruction. On a GPU, the programmer writes scalar code for one thread, and the hardware transparently replicates it across 32 threads. The distinction matters for cicc because vectorization on GPU does not fill SIMD lanes -- it produces wide loads (ld.v2, ld.v4) within a single thread's scalar stream to improve memory transaction width and reduce instruction count. The VF returned by TTI::getRegisterBitWidth(Vector) is 32 bits (one scalar register), not 512 or 1024.

Divergence

When a branch condition evaluates differently across threads in a warp, the hardware serializes both paths. First the "taken" subset executes while the others are masked off; then the "not-taken" subset executes. The warp reconverges at a point determined by the hardware's reconvergence stack (pre-Volta) or independent thread scheduling (Volta+). Both paths execute regardless of how many threads take each side, so a divergent branch in a hot loop can halve throughput even if only one thread disagrees.

Divergence is the primary reason cicc includes the StructurizeCFG pass (which converts irreducible control flow to reducible form), the CSSA pass (which repairs SSA across divergent join points), the Loop Index Split pass (which eliminates index-dependent branches that cause per-iteration divergence), and the Branch Distribution pass (which separates uniform from divergent computation).

The constant warpSize = 32 is hardcoded in cicc's SCEV range analysis (intrinsic ID ~370, range [32, 33)) and is the architectural constant behind every power-of-two factor enforcement in the loop unroller and loop vectorizer.

Register Pressure and Occupancy

The register file is the single most constrained resource on an NVIDIA GPU and the single most important factor in cicc's optimization heuristics. Understanding the relationship between register count, occupancy, and performance is essential to understanding why cicc makes the decisions it does.

The Register Budget

Each Streaming Multiprocessor (SM) has a fixed 32-bit register file:

SM Generation	Registers per SM	Max Registers per Thread
SM 70 (Volta)	65,536	255
SM 75 (Turing)	65,536	255
SM 80 (Ampere)	65,536	255
SM 86 (Ampere GA10x)	65,536	255
SM 89 (Ada)	65,536	255
SM 90 (Hopper)	65,536	255
SM 100 (Blackwell)	65,536	255

These 65,536 registers are shared among all resident threads. The hardware partitions them at kernel launch time based on the per-thread register count reported by ptxas. The partition is coarse-grained -- registers are allocated in units of warp groups, not individual threads.

Occupancy Cliffs

The relationship between per-thread register count and achievable occupancy is a step function with sharp discontinuities:

Registers/thread    Max warps/SM    Max threads/SM    Occupancy
      32                64              2048            100%
      33-40             48              1536             75%
      41-48             32              1024             50%   <-- cliff
      49-64             32              1024             50%
      65-80             24               768            37.5%  <-- cliff
      81-96             20               640            31.3%
      97-128            16               512             25%   <-- cliff
     129-168            12               384            18.8%
     169-255             8               256            12.5%  <-- cliff

(Exact thresholds vary by SM generation and block size; these are representative for sm_70+ with standard block configurations.)

Adding a single register -- from 32 to 33 registers per thread -- drops maximum occupancy from 64 warps to 48 warps, a 25% reduction. These are the occupancy cliffs that cicc's heuristics are designed to avoid. The cost is asymmetric: the 33rd register provides trivial benefit (one fewer spill), but the occupancy loss costs 25% of the SM's latency-hiding capacity.

This is why:

The loop unroller uses conservative thresholds that balance ILP against register growth
The loop vectorizer limits VF to 2 or 4 even though wider vectors are legal
LSR has an lsr-rp-limit knob that hard-rejects formulae exceeding a register pressure ceiling
LICM runs twice -- once to hoist, once to sink back values whose extended live ranges hurt occupancy
The rematerialization pass recomputes values rather than keeping them live across long ranges
The register allocator uses -maxreg (default 70) as a pressure cap rather than a physical assignment constraint

The cicc binary contains no explicit occupancy table -- it delegates final register assignment and occupancy computation to ptxas. But the thresholds in the optimization passes (LSR's lsr-rp-limit, the unroller's PartialThreshold, the vectorizer's register-pressure-bounded interleave count) are all calibrated to stay below known cliff boundaries.

PTX Virtual Registers

PTX has no fixed physical register file from the compiler's perspective. cicc emits virtual registers in nine typed classes (%p, %rs, %r, %rd, %f, %fd, %h, %hh, %rq -- see Register Classes). The ptxas assembler performs the actual register allocation from virtual to physical registers, using the SM's register file as the constraint. cicc's job is to minimize the number of simultaneously live virtual registers so that ptxas can produce a low register-count assignment.

The typed register model means that a 32-bit integer (%r) and a 32-bit float (%f) occupy separate register namespaces -- they never alias. A 64-bit value (%rd, %fd) occupies two 32-bit register slots. An Int128Regs value (%rq) occupies four. This is why the type legalization pass aggressively scalarizes vector types and the IV demotion pass narrows 64-bit induction variables to 32-bit: every bit of width reduction directly saves register pressure.

Memory Hierarchy

GPU memory is organized into physically disjoint address spaces with radically different performance characteristics. On a CPU, the entire address space is a flat virtual memory with uniform-latency cache hierarchy. On a GPU, choosing the wrong address space for an access can cost 100x in latency. This section summarizes the performance-relevant properties; for complete address space encoding, aliasing rules, and data layout strings, see Address Spaces.

Latency Table

Memory	LLVM AS	PTX Qualifier	Latency (cycles)	Scope	Capacity
Registers	--	`%r`, `%f`, etc.	0	Per-thread	255 per thread (SM 70+)
Shared	3	`.shared`	20-30	Per-CTA (block)	48-228 KB per SM
Constant cache	4	`.const`	4-8 (hit)	Read-only, device-wide	64 KB per SM
Parameter	101	`.param`	4-8	Per-kernel launch	Mapped to constant bank
Local (L1 hit)	5	`.local`	~30	Per-thread stack	L1 partition
Local (L2 hit)	5	`.local`	~200	Per-thread stack	L2 partition
Global (L2 hit)	1	`.global`	32-128	Device-wide	L2 cache
Global (DRAM)	1	`.global`	200-800	Device-wide	Device DRAM
Generic	0	`.generic`	+4-8 over resolved	Virtual	Runtime-resolved
Shared cluster	7	`.shared::cluster`	30-50	Cross-CTA (SM 90+)	Cluster shared pool

The 200-800 cycle range for global DRAM access is the defining constraint of GPU performance. It means that a single cache-missing load stalls the executing warp for hundreds of cycles. The hardware hides this latency through warp-level multithreading (see next section), but only if enough warps are resident -- which brings us back to register pressure and occupancy.

Why Each Memory Matters for cicc

Registers vs. .local: Every alloca that SROA fails to promote becomes a .local allocation backed by DRAM. A .local access that misses L1 costs 200-400 cycles versus zero for a register. This is why SROA runs twice in the pipeline and why cicc's inline budget (20,000 vs upstream 225) is so aggressive -- inlining eliminates allocas from byval parameter copies.

Shared memory (AS 3): On-chip SRAM with 20-30 cycle latency, shared across all threads in a CTA (thread block). Uses 32-bit pointers (when +sharedmem32bitptr is active), saving one register per pointer compared to 64-bit global pointers. This is why LSR has disable-lsr-for-sharedmem32-ptr -- strength-reducing a 32-bit shared pointer can produce 64-bit intermediates that defeat the optimization.

Constant memory (AS 4): Hardware-cached read-only memory with 4-8 cycle latency on cache hit. The NVVM AA marks AS 4 as NoModRef, enabling LICM to hoist constant loads without checking for intervening stores.

.param space (AS 101): Used for function argument passing (see the calling convention section below). Read-only from device code. Mapped to the constant cache path, so reads are 4-8 cycles.

Generic (AS 0): The performance killer. A generic pointer forces a runtime address-space lookup (+4-8 cycles per access) and destroys alias analysis precision (every generic pointer MayAlias with everything). This is why MemorySpaceOpt exists -- resolving generic pointers to specific address spaces is one of the highest-impact optimizations in cicc.

Memory Coalescing

The GPU memory subsystem services warp-wide requests in 128-byte transactions (or 32-byte sectors on some architectures). When 32 threads in a warp access 32 consecutive 4-byte values (128 bytes total), the hardware coalesces the 32 individual requests into a single transaction. This is the stride-1 access pattern -- the ideal case.

Thread 0  loads addr+0    ┐
Thread 1  loads addr+4    │
Thread 2  loads addr+8    │  One 128-byte transaction
...                       │
Thread 31 loads addr+124  ┘

When threads access non-consecutive addresses (stride > 1, scattered, or misaligned), the hardware must issue multiple transactions to satisfy the warp's requests. In the worst case (32 threads accessing 32 different cache lines), a single warp load generates 32 separate transactions, reducing effective bandwidth by 32x.

Coalescing is why the loop vectorizer targets VF=2 or VF=4 on GPU: vectorizing a per-thread loop with ld.v4.f32 loads four consecutive elements per thread in a single wide transaction, improving bytes-per-transaction. It is also why the loop unroller enforces power-of-two factors -- non-power-of-two unroll factors create asymmetric access patterns that interact poorly with the 128-byte transaction boundary.

The memory coalescing model also explains why cicc's SLP vectorizer pairs adjacent scalar loads into ld.v2 / ld.v4 instructions -- not for SIMD parallelism (there is none) but for transaction width optimization.

No Out-of-Order Execution

GPU warps execute instructions strictly in program order. There is no out-of-order execution, no speculative execution, no branch prediction, and no reorder buffer. A warp that encounters a long-latency operation (global memory load, texture fetch) simply stalls until the result is available.

The sole latency-hiding mechanism is warp-level multithreading. Each SM maintains multiple warps in flight simultaneously. When one warp stalls on a memory access, the hardware switches to another ready warp in the same clock cycle (zero-cost context switch, because each warp has its own register state). This is why occupancy matters -- more resident warps means more opportunities to hide latency through interleaving.

The absence of OOO execution has profound implications for cicc:

ILP must be compiler-created. On a CPU, the hardware reorder buffer discovers and exploits instruction-level parallelism dynamically. On a GPU, the compiler (cicc + ptxas) must explicitly schedule independent instructions adjacent to each other so the hardware can overlap them. This is why loop unrolling is so valuable on GPU -- it creates independent instructions from different iterations that the scheduler can interleave -- and why the interleave count in the loop vectorizer exists (it replicates the vectorized body to expose more ILP).

Every stall is a stall. There is no store buffer to absorb write latency, no load queue to speculatively issue reads. The scheduling passes (instruction scheduling, block placement) must model this accurately.

Instruction issue width bounds throughput. Each SM has a fixed number of instruction schedulers (typically 4 per SM on sm_70+), each issuing one instruction per clock to one warp. The total instruction throughput of an SM is schedulers * clock_rate. The TTI scheduling info at TTI+56 (issue width at +32, latency at +36 within the sub-structure) encodes this model and feeds the vectorizer's interleave count cap.

The `.param` Calling Convention

Function calls on NVIDIA GPUs are expensive in a way that has no CPU equivalent. On x86, a function call pushes arguments to registers or the stack (a cached memory region), executes CALL, and the callee reads them back. Total overhead: 5-20 cycles. On GPU, there is no hardware call stack for registers. The PTX calling convention works through the .param address space:

Call Sequence

// Caller side:
.param .align 8 .b8 param0[16];           // DeclareParam
st.param.b64 [param0+0], %rd1;            // Store arg 0, field 0
st.param.b64 [param0+8], %rd2;            // Store arg 0, field 1
.param .b32 param1;                        // DeclareScalarParam
st.param.b32 [param1+0], %r5;             // Store arg 1
call.uni (retval0), callee, (param0, param1);  // The actual call

// Callee side:
ld.param.b64 %rd10, [param0+0];           // Load arg 0, field 0
ld.param.b64 %rd11, [param0+8];           // Load arg 0, field 1
ld.param.b32 %r20,  [param1+0];           // Load arg 1
// ... function body ...
st.param.b32 [retval0+0], %r30;           // Store return value
ret;

// Back in caller:
ld.param.b32 %r6, [retval0+0];            // Load return value

Each function call generates O(n) st.param + O(n) ld.param instructions where n is the total number of argument fields (not just argument count -- structs are marshaled field-by-field). A function with 8 struct arguments containing 4 fields each generates 32 stores + 32 loads + the call instruction itself. At shared/constant-cache latency (4-8 cycles per access), this is 256-512 cycles of pure marshaling overhead.

Additionally:

Call boundaries destroy scheduling freedom. The hardware cannot overlap instructions across a call/return boundary.
Call boundaries force register save/restore. If the callee needs more registers than are available in the caller's allocation, the hardware spills to .local memory (DRAM, 200-800 cycles).
Indirect calls are catastrophic. An indirect call (call.uni through a register) prevents all of the above from being optimized statically. No inlining, no cross-function register allocation, no dead argument elimination.

This is why:

cicc's custom inliner uses a 20,000-unit budget (89x upstream LLVM's 225) -- the .param marshaling cost for a typical function easily exceeds the 225-unit threshold
LTO is dramatically more valuable on GPU than on CPU -- cross-module inlining eliminates .param overhead for functions in separate translation units
Whole-program devirtualization is critical -- converting indirect calls to direct calls enables inlining and eliminates the worst-case register spill scenario
60% of the NVIDIA custom inliner's code computes type-size comparisons for argument coercion cost, because the .param marshaling cost dominates the inlining decision

The SelectionDAG Encoding

The SelectionDAG backend uses opcodes DeclareParam (505), DeclareScalarParam (506), StoreV1/V2/V4 (571-573), and LoadRetParam / LoadV1/V2/V4 (515-516, 568-570) for the param passing convention. The .param space is encoded as SelectionDAG code 5 in sub_33B0210. For complete opcode details, see NVPTX Machine Opcodes.

Address Space Semantics

GPU memory is partitioned into physically disjoint hardware regions. Pointers in different non-generic address spaces can never reference the same byte -- a property that NVVM AA exploits for O(1) NoAlias determination. The generic address space (AS 0) is a virtual overlay resolved at runtime by the hardware's address translation unit, which tests whether the address falls in the shared, local, or global window.

The following properties have direct optimization impact:

Property	Global (AS 1)	Shared (AS 3)	Local (AS 5)	Constant (AS 4)
Pointer width	64-bit	32-bit*	32-bit (effective)	64-bit
Read-only	No	No	No	Yes
Cross-CTA visible	Yes	No	No	Yes
Hardware addressing modes	Base + offset	Base + offset, banked	Frame pointer + offset	Indexed constant cache
Coalescing	128-byte transactions	32 banks, 4-byte stride	Per-thread (no coalescing)	Broadcast to warp

* 32-bit when +sharedmem32bitptr target feature is active (the default for sm_70+).

The 32-bit pointer optimization for shared memory saves one register per shared-memory pointer and reduces all address arithmetic from 64-bit to 32-bit operations. This is encoded in the NVPTX data layout string as p3:32:32:32 and is the reason the IV Demotion pass exists -- it narrows 64-bit induction variables to 32-bit when the loop operates entirely in shared memory.

For the complete address space reference -- including aliasing rules, the MemorySpaceOpt bitmask encoding, cvta intrinsic mapping, isspacep folding, and per-SM shared memory sizes -- see Address Spaces.

Compiler Implications Summary

Every major cicc optimization decision traces back to one or more of the properties above. The following table maps each hardware property to the compiler passes it shapes:

Hardware Property	Compiler Impact	Key Passes
Warp divergence serializes both paths	Minimize control flow in hot loops	StructurizeCFG, CSSA, Loop Index Split, Branch Distribution
Register count determines occupancy	All transforms must minimize live values	Register Allocation, LSR, LICM, Rematerialization, IV Demotion
Occupancy cliffs are discrete	Threshold-driven heuristics with cliff awareness	Loop Unroll, Loop Vectorize, LSR `lsr-rp-limit`
No OOO execution	Compiler must create ILP	Loop Unroll (ILP via body replication), Scheduling, vectorizer interleave count
`.local` spill costs 200-800 cycles	Aggressively promote allocas	SROA (runs twice), Inliner (20K budget eliminates byval copies)
`.param` marshaling is O(n) per call	Aggressively inline	Inliner, LTO, Devirtualization
128-byte coalescing transactions	Optimize memory access stride	Loop Vectorize (VF=2/4 for `ld.v2`/`ld.v4`), SLP Vectorizer
Address spaces are disjoint	NoAlias for cross-space pairs	NVVM AA, MemorySpaceOpt
Generic pointers destroy alias precision	Resolve to specific space	MemorySpaceOpt, IPMSP
Shared memory uses 32-bit pointers	Narrow IV and address width	IV Demotion, LSR `disable-lsr-for-sharedmem32-ptr`
Closed-world compilation model	Full-program visibility	LTO, Dead Kernel Elimination, Devirtualization
Constant cache is 4-8 cycles	Hoist constant loads freely	LICM, NVVM AA `NoModRef` for AS 4

What Upstream LLVM Gets Wrong

Upstream LLVM's NVPTX backend correctly implements the PTX virtual register model and the basic address space numbering. But the optimization passes assume CPU-like economics:

Inline threshold of 225 assumes function calls cost 5-20 cycles. GPU calls cost hundreds of cycles due to .param marshaling. NVIDIA overrides to 20,000.
LSR cost model compares formulae by counting registers and instructions with equal weight. On GPU, one extra register can cost 25% occupancy; one extra instruction costs nearly nothing. NVIDIA replaces the formula solver entirely.
LICM assumes hoisting is always profitable. On CPU, moving an operation from loop body to preheader is strictly beneficial. On GPU, it extends the live range of the hoisted value across the entire loop, consuming a register for all iterations. NVIDIA runs LICM twice (hoist then sink) and relies on rematerialization to undo unprofitable hoists.
Vectorization targets SIMD lane width. TTI::getRegisterBitWidth(Vector) returns 256 (AVX2) or 512 (AVX-512) on CPU. NVPTX returns 32 -- there are no SIMD lanes. Vectorization targets memory transaction width, not ALU parallelism.
No occupancy model exists in upstream. CPU register allocation minimizes spill cost. GPU register allocation must minimize total register count to maximize occupancy. These are different objective functions.
Address spaces are an afterthought. Upstream LLVM treats address spaces as metadata annotations. On GPU, they are physically disjoint hardware memory partitions with different pointer widths, latencies, and aliasing properties. Every pass that touches pointers must be address-space-aware.

Cross-References

Address Spaces -- complete encoding, aliasing rules, MemorySpaceOpt bitmask, data layout strings
Register Classes -- nine typed register classes, encoding scheme, coalescing rules
Register Allocation -- greedy RA, -maxreg constraint, pressure tracking
Loop Vectorize -- VF selection, memory coalescing motivation, register-pressure-bounded IC
Loop Unroll -- ILP vs register pressure tradeoff, power-of-two enforcement
LSR (NVIDIA Custom) -- occupancy-aware formula solver, register pressure gating
LICM -- hoist/sink dual invocation, register pressure tension
SROA -- .local elimination, dual-invocation pipeline position
Inliner Cost Model -- 20K budget, .param marshaling cost, four parallel models
LTO & Module Optimization -- closed-world model, dead kernel elimination
MemorySpaceOpt -- generic-to-specific address space resolution
StructurizeCFG -- divergence-safe control flow restructuring
CSSA -- conventional SSA for SIMT divergence correctness
Rematerialization -- register pressure reduction via recomputation
IV Demotion -- 64-bit to 32-bit IV narrowing for shared memory
Instruction Scheduling -- in-order scheduling, MRPA pressure tracking
NVPTX Target Infrastructure -- TTI hooks, data layout, target features

Address Spaces

This page is the single source of truth for NVPTX address space numbering, hardware mapping, pointer widths, aliasing rules, and the internal bitmask encoding used by MemorySpaceOpt. It supersedes all inline address space tables elsewhere in the wiki -- those pages should cross-reference this one rather than maintaining their own copies.

NVPTX defines eight address spaces in cicc v13.0, six of which correspond to physically disjoint hardware memory partitions. The generic (flat) address space is a virtual overlay resolved at runtime by the GPU's address translation unit. The eighth, tensor memory (AS 6), is a Blackwell-era addition accessible only through TMA intrinsics. A ninth, AS 25, is used internally within NVVM IR for device-linkage annotations and never reaches PTX emission. A tenth, AS 53, appears in MemorySpaceOpt initialization as an internal annotation space for global variable tracking.

Master Address Space Table

LLVM AS	Name	PTX Qualifier	Hardware	Pointer Width	Typical Latency	CUDA Qualifier
0	Generic (flat)	`.generic`	Virtual -- address translation unit maps to physical space at runtime	64-bit	+4-8 cycles over resolved (translation overhead)	Default for unresolved pointers
1	Global	`.global`	Device DRAM, L2 cached, optionally L1 cached	64-bit	200-800 cycles (DRAM); 32-128 cycles (L2 hit)	`__device__`, `cudaMalloc`
3	Shared	`.shared`	Per-CTA on-chip scratchpad SRAM (48-228 KB per SM)	32-bit (when `p3:32:32:32` active) or 64-bit	20-30 cycles (bank-conflict-free)	`__shared__`
4	Constant	`.const`	Read-only constant cache (64 KB per SM)	64-bit	4-8 cycles (cache hit); DRAM latency on miss	`__constant__`
5	Local	`.local`	Per-thread private stack in DRAM, L1 cached	32-bit (effective) or 64-bit	Same as global (backed by DRAM)	Stack allocations (`alloca`)
6	Tensor Memory	N/A (TMA intrinsics only)	Blackwell tensor memory (SM 100+)	64-bit	Varies (TMA pipeline)	N/A -- accessed via `cp.async.bulk` intrinsics
7	Shared Cluster	`.shared::cluster`	Distributed shared memory across CTAs in a cluster (SM 90+)	32-bit or 64-bit	~30-50 cycles (cross-CTA penalty over AS 3)	`__shared__` with cluster scope
25	Internal device linkage	N/A	Not a physical memory -- NVVM IR annotation for `__device__` linkage	N/A	N/A	Used internally by module summary for extern device resolution
53	Internal annotation	N/A	Not a physical memory -- used by MemorySpaceOpt for global tracking	N/A	N/A	Internal to cicc pipeline
101	Param	`.param`	Kernel parameter window (mapped into constant bank or global memory)	64-bit	4-8 cycles (constant cache path)	Kernel parameters (`__global__` function args)

Address space 2 is not used by NVPTX. The numbering gap between shared (3) and constant (4) is inherited from upstream LLVM NVPTX conventions. The NVVM verifier's valid-AS check uses the formula (AS + ~2) & 0xFFFFFF) > 2, which accepts AS values 0, 1, and 3 unconditionally; AS 2 is sometimes valid depending on context.

Aliasing Rules

The core property exploited by NVVM AA is hardware address space disjointness: pointers in different non-generic address spaces can never reference the same byte. NVVM AA (nvptx-aa) encodes this as a NoAlias rule for every cross-space pointer pair, with the following exceptions.

Pointer A	Pointer B	Alias Result	Reason
AS 0 (generic)	Any	MayAlias	Generic can map to any physical space at runtime
AS X (same)	AS X (same)	MayAlias	Same space -- further analysis needed (BasicAA, TBAA)
AS 1 (global)	AS 101 (param)	MayAlias	`cvta.param` on SM 70+ makes param addressable as global
AS 3 (shared)	AS 7 (shared cluster)	MayAlias	Cluster shared memory overlaps with regular shared
Any other cross-space pair		NoAlias	Physically disjoint hardware memory partitions

The NVVM AA algorithm (pseudocode from NVPTXAAResult::alias in cicc):

AliasResult alias(Loc1, Loc2):
    AS1 = getAddressSpace(Loc1.Ptr, TraverseLimit)  // walk through casts
    AS2 = getAddressSpace(Loc2.Ptr, TraverseLimit)

    if AS1 == 0 or AS2 == 0:         return MayAlias  // generic kills precision
    if (AS1==3 and AS2==7) or (AS1==7 and AS2==3): return MayAlias
    if AS1 == AS2:                    return MayAlias  // same space, need deeper AA
    return NoAlias                                     // different non-generic spaces

The getAddressSpace helper walks backward through getUnderlyingObject (stripping GEPs, bitcasts, PHIs) up to nvptx-traverse-address-aliasing-limit (default 6) levels deep, resolving generic pointers that were produced by addrspacecast from a specific space.

ModRef Rules

Address Space	ModRef Mask	Meaning
AS 4 (constant)	`NoModRef`	Read-only -- never modified
AS 101 (param)	`NoModRef`	Kernel params are read-only from device code
All others	`ModRef`	May be both read and written

These masks enable DSE to skip constant/param stores entirely, and LICM to hoist loads from constant memory without checking for intervening stores.

MemorySpaceOpt Internal Bitmask

MemorySpaceOpt (sub_1C70910) encodes address spaces as single-bit positions in a byte-wide bitmask for efficient dataflow computation. The mapping is performed in sub_1CA8CD0 via a switch on the LLVM address space ID:

Bit	Value	LLVM AS	Name
0	`0x01`	1	Global
1	`0x02`	3	Shared
2	`0x04`	4	Constant
3	`0x08`	5	Local
4	`0x10`	101	Param
0-3	`0x0F`	N/A	Unknown (union of global + shared + constant + local)

// sub_1CA8CD0 — address space to bitmask
switch (addrspace) {
    case 1:   return 0x01;   // global
    case 3:   return 0x02;   // shared
    case 4:   return 0x04;   // constant
    case 5:   return 0x08;   // local
    case 101: return 0x10;   // param
    default:  return 0x0F;   // unknown = union of all non-param
}

When multiple pointer sources contribute different address spaces (e.g., through PHI nodes or function arguments receiving pointers from different call sites), the bitmask is OR'd. A singleton bit (popcount == 1) means the space is fully resolved; multiple bits set means the pointer is ambiguous and requires either runtime isspacep or a conservative default to global.

Resolution Decision

Once the bitmask is computed for a pointer:

Single bit set: Resolved. The pass inserts an addrspacecast from generic to the target space and replaces all uses.
Multiple bits set, param bit included: If param-always-point-to-global is true (default), resolve to global. The rationale: kernel parameters always point into global device memory.
Multiple bits set, no param: Ambiguous. Emit warning "Cannot tell what pointer points to, assuming global memory space" and default to global.
Zero bits: Unreachable code or analysis error.

Relationship to EDG Frontend Encoding

The EDG frontend uses a separate encoding in the symbol table entry at offset +156/+157:

EDG Bit	Value	Memory Space
+156 bit 0	`0x01`	`__device__` (any device placement)
+156 bit 1	`0x02`	`__shared__`
+156 bit 2	`0x04`	`__constant__`
+156 bit 4	`0x10`	Read-only linkage flag
+157 bit 0	`0x01`	`__managed__`

The EDG memory_space_code at offset +136 maps to LLVM address spaces during IR generation: code 1 (__device__) maps to AS 1, code 2 (__shared__) maps to AS 3, code 3 (__constant__) maps to AS 4.

The Generic Address Space Problem

The generic (flat, AS 0) address space is the fundamental obstacle to alias precision on GPUs. When the EDG frontend or NVVM IR generator cannot determine which physical memory a pointer targets, it emits the pointer in AS 0. The hardware resolves generic addresses at runtime by checking whether the address falls within the shared memory window, the local memory window, or defaults to global -- a process that adds 4-8 cycles of latency per access.

For NVVM AA, a generic pointer forces MayAlias against every other pointer, destroying the disjointness guarantee and blocking optimizations in DSE, LICM, GVN, and MemorySSA. Three mechanisms address this:

1. MemorySpaceOpt (compile-time conversion). The two-phase inter-procedural pass resolves generic pointers by tracing them back to their allocation sites through use-def chains. When a generic pointer always derives from a __shared__ variable, the pass inserts addrspacecast to AS 3 and rewrites all uses. When different call sites disagree on the address space for the same argument, the pass clones the function into space-specialized versions. Every generic pointer resolved gives NVVM AA an additional NoAlias edge. Disabling this pass (-disable-MemorySpaceOptPass) causes 2-20x performance regressions.

2. AA address-space traversal. Even without MemorySpaceOpt, NVVM AA's getAddressSpace helper walks through addrspacecast chains. If %p was produced by addrspacecast i8 addrspace(3)* %s to i8*, the traversal discovers AS 3 despite %p being in AS 0 at the use site.

3. !noalias.addrspace metadata (kind 42). cicc attaches this metadata to instructions when address space information is known but the pointer itself remains generic. The AA evaluator detects this via opcode byte 0x4E ('N') and sets bit 2 in a pointer-tagged value (OR with 4), propagating disambiguation information through to AAResults::alias. This is a cicc-specific extension not found in upstream LLVM.

Data Layout Strings

The NVPTX data layout string encodes pointer widths and alignment for each address space. cicc produces three variants based on pointer width and shared memory pointer mode.

64-bit with shared memory specialization (most common production mode)

e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

64-bit without shared memory specialization

e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

32-bit mode

e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f16:16:16-f32:32:32-f64:64:64-v16:16:16-v32:32:32-n16:32:64

Field-by-Field Breakdown

Field	Meaning	NVIDIA Note
`e`	Little-endian	All NVIDIA GPUs
`p:64:64:64`	Default pointer: 64-bit size, 64-bit ABI align, 64-bit preferred align	Applies to AS 0 (generic), AS 1 (global), AS 4 (constant), AS 101 (param)
`p3:32:32:32`	AS 3 pointer: 32-bit size, 32-bit ABI align, 32-bit preferred align	Shared memory is on-chip, addressable with 32 bits even in 64-bit mode
`i1:8:8`	Booleans stored as 8-bit	Standard
`i128:128:128`	128-bit integers: 128-bit aligned	Used by cmpxchg on global/shared
`n16:32:64`	Native integer widths	PTX has 16-bit, 32-bit, and 64-bit register files
`v16:16:16` / `v32:32:32`	Vector alignment: natural	16-bit vectors at 16-bit, 32-bit vectors at 32-bit

Shared Memory 32-bit Pointer Optimization

The p3:32:32:32 entry is the most impactful NVIDIA delta in the data layout. Shared memory lives in 48-228 KB of on-chip SRAM per SM, addressable with 32-bit pointers even when the rest of the address space is 64-bit. Using 32-bit pointers for shared memory saves register pressure (one register instead of two for every shared pointer) and instruction count (32-bit arithmetic instead of 64-bit for every address calculation).

The optimization is controlled by three knobs that alias the same underlying global (unk_4D0461C):

Knob	Source
`nvptx-short-ptr`	Backend option (`ctor_609_0` at `0x585D30`)
`nvptx-32-bit-smem`	Backend option (same constructor)
`+sharedmem32bitptr`	Target feature string (passed via `-arch` processing)

When any of these is active, the data layout gains the p3:32:32:32 entry, and LLVM's type system treats all addrspace(3)* pointers as 32-bit. This is transparent to the rest of the compiler -- DataLayout queries like getPointerSizeInBits(3) return 32 automatically, and all pointer arithmetic in shared memory is lowered to 32-bit operations.

The same 32-bit treatment applies to local memory (AS 5) in practice: local stack addresses are within the per-thread frame and always fit in 32 bits. However, the data layout does not carry an explicit p5:32:32:32 entry -- the 32-bit treatment is enforced by the SelectionDAG lowering which uses AS 7 for stack operations.

Known-Bits Implications

The 32-bit address spaces have direct implications for the known-bits analysis (sub_BD5420):

Address Space	Pointer Width	Known Bits Effect
AS 0 (generic)	64-bit	Pointer alignment only
AS 1 (global)	64-bit	Low 4 bits often known-zero (16-byte alignment typical)
AS 3 (shared)	32-bit	Low 2 bits known-zero (4-byte minimum), bits [32,63] irrelevant
AS 4 (constant)	64-bit	Low 2 bits known-zero (4-byte alignment)
AS 5 (local)	32-bit effective	Low 2 bits known-zero (stack alignment), bits [32,63] irrelevant

DemandedBits exploits the 32-bit address spaces to eliminate zero-extensions and truncations around shared/local address calculations, keeping all pointer arithmetic in 32-bit ALU operations. This interacts with IV Demotion (sub_18B1DE0), which narrows 64-bit induction variables to 32-bit where shared memory address calculations permit.

Data Layout Validation

The NVVM verifier (sub_2C80C90) validates the data layout string at multiple pipeline points:

If empty: "Empty target data layout, must exist"
If invalid: prints "Example valid data layout:" with reference strings from off_4C5D0A0 (32-bit) and off_4C5D0A8 (64-bit)
A shortened compatibility form e-i64:64-v16:16-v32:32-n16:32:64 is used in the IR linker (sub_106AB30) to verify that two modules being linked share the same NVPTX target data layout.

Address Space Casts

NVPTX has strict rules for addrspacecast instructions, enforced by the NVVM verifier:

At least one side must be generic (AS 0). Casting between two non-generic address spaces is prohibited: "Cannot cast non-generic pointer to different non-generic pointer". You must go through generic: addrspace(3) -> addrspace(0) -> addrspace(1).
Source and target must be valid. The verifier rejects invalid address space IDs with "Invalid target address space" / "Invalid source address space".
Alloca must be in generic. "Allocas are not supported on address spaces except Generic" -- alloca produces AS 0 pointers; MemorySpaceOpt later promotes them to AS 5.
Tensor memory (AS 6) rejects load/store. "Tensor Memory loads/stores are not supported" -- AS 6 memory must be accessed through TMA intrinsics (cp.async.bulk.*), not regular load/store instructions.
cmpxchg is restricted. "cmpxchg pointer operand must point to generic, global, or shared address space" -- atomic compare-exchange only supports AS 0, AS 1, and AS 3, with i32/i64/i128 operand types.

cvta Intrinsic Mapping

The PTX cvta (Convert Virtual Address) instructions are lowered through intrinsic IDs in the EDG frontend (sub_94A030):

Intrinsic ID Range	Direction	Address Space
0xC1 (193)	Generic -> Specific	Shared (AS 3)
0xC2 (194)	Generic -> Specific	Constant (AS 4)
0xC3 (195)	Generic -> Specific	Local (AS 5)
0xC4 (196)	Generic -> Specific	Global (AS 1)
0xC5 (197)	Specific -> Generic	Shared (AS 3)
0xC6 (198)	Specific -> Generic	Constant (AS 4)
0xC7 (199)	Specific -> Generic	Local (AS 5)
0xC8 (200)	Specific -> Generic	Global (AS 1)

The specific-to-generic direction emits addrspacecast (opcode 0x30). The generic-to-specific direction uses a store-to-temp followed by a load with the target address space annotation.

SelectionDAG Address Space Encoding

The SelectionDAG backend uses a secondary address space encoding for the .param passing convention. In sub_33B0210 (intrinsic lowering within the SelectionDAG), pointer arguments use this mapping:

SelectionDAG Code	LLVM AS	PTX Space
1	1 (global)	`.global`
2	3 (shared)	`.shared`
3	4 (constant)	`.const`
4	5 (local)	`.local`
5	--	`.param` (not a real AS, lowered to param window)
7	7 (shared cluster)	`.shared::cluster`

Stack operations (SelectionDAG opcode 16, StackAlloc) explicitly use AS 7 for the .param-like space when lowering stack frames via sub_33FF780(dag, ..., 7, 0, 1, 0).

Internal Address Spaces (Non-Physical)

AS 25 -- Device Linkage Annotation

Address space 25 is used by the module summary pass (sub_1C28690 in p2-H01-nvmodule-summary.txt) to tag functions and variables with __device__ linkage during inter-module resolution. When a function's type resolves to AS 25, it indicates the symbol has device-side linkage and requires device-side extern resolution. This address space never appears in emitted PTX -- it is consumed during linking and stripped before codegen.

AS 53 -- MemorySpaceOpt Global Annotation

During pass initialization (sub_1CAB590), MemorySpaceOpt filters module globals that carry address space 53 and registers them into internal tracking structures. This appears to be an annotation mechanism for marking globals that require special address space analysis. Like AS 25, this address space is internal and does not survive to PTX emission.

Shared Memory Specializations by SM Generation

SM	Shared Memory Size	Cluster Support	AS 7 Available	Shared Memory Pointer
SM 70 (Volta)	96 KB configurable with L1	No	No	32-bit (when `+sharedmem32bitptr`)
SM 80 (Ampere)	164 KB configurable	No	No	32-bit
SM 86 (Ampere GA10x)	100 KB configurable	No	No	32-bit
SM 89 (Ada)	100 KB configurable	No	No	32-bit
SM 90 (Hopper)	228 KB configurable	Yes	Yes	32-bit
SM 100 (Blackwell)	228 KB configurable	Yes	Yes	32-bit

With SM 90+, __shared__ variables accessed with cluster scope use .shared::cluster (AS 7), which provides cross-CTA access within a cooperative thread array cluster. Regular intra-CTA shared access remains on AS 3 (.shared). The EarlyCSE pass (sub_2781BB6) detects AS 7 stores and applies conservative aliasing to prevent CSE across shared cluster barriers.

isspacep Intrinsics

The PTX isspacep instruction tests at runtime whether a generic pointer points to a specific address space. cicc represents these as intrinsics with builtin IDs 0xFD0-0xFD5:

Builtin ID	PTX	Tests for
`0xFD0`	`isspacep.global`	Global (AS 1)
`0xFD1`	`isspacep.shared`	Shared (AS 3)
`0xFD2`	`isspacep.local`	Local (AS 5)
`0xFD3`	`isspacep.const`	Constant (AS 4)
`0xFD4`	`isspacep.shared::cta`	Shared CTA-local (AS 3, SM 90+)
`0xFD5`	`isspacep.shared::cluster`	Shared cluster (AS 7, SM 90+)

MemorySpaceOpt's second-time resolver (sub_1CA9E90) folds these to compile-time constants when the pointer's address space is already known: isspacep.shared(%p) where %p is proven to be AS 3 folds to true. This eliminates runtime address space checks from conditional code patterns like:

if (__isShared(p))
    atomicAdd_shared(p, val);
else
    atomicAdd(p, val);

Configuration Knobs Affecting Address Spaces

Knob	Default	Effect
`nvptx-short-ptr`	--	Enable 32-bit pointers for shared/const/local
`nvptx-32-bit-smem`	--	Same effect as above (alias)
`param-always-point-to-global`	true	Resolve ambiguous param pointers to global
`mem-space-alg`	2	Algorithm selection for MemorySpaceOpt (2 = default, others select alternate impl at `sub_2CBBE90`)
`track-indir-load`	true	Track pointers loaded from memory during address space analysis
`track-int2ptr`	true	Track `inttoptr` casts during analysis
`nvptx-traverse-address-aliasing-limit`	6	Max depth for NVVM AA `getAddressSpace` traversal
`do-clone-for-ip-msp`	-1 (unlimited)	Max function clones for inter-procedural specialization
`process-alloca-always`	true	Treat alloca as definite local (AS 5)

Function Map

Function	Address	Size	Role
MemorySpaceOpt pass entry	`sub_1C70910`	--	Mode dispatch, IP-MSP worklist driver
Per-BB instruction scanner	`sub_1CA8CD0`	--	AS-to-bitmask mapping switch
Use-def chain walker	`sub_1CA5350`	--	Backward pointer origin tracking
First-time resolver	`sub_1CA2920`	--	Conservative address space resolution
Second-time resolver	`sub_1CA9E90`	--	Hash-table-based resolution, isspacep folding
MemorySpaceCloning engine	`sub_2CBBE90`	--	Inter-procedural function cloning (71KB)
IPMSP module pass variant	`sub_1C6A6C0`	--	LIBNVVM path (54KB)
EDG cvta lowering	`sub_94A030`	--	Address space cast intrinsic generation
EDG decl-side memspace processing	`sub_6582F0`	--	CUDA attribute to memory space code resolution
EDG def-side memspace processing	`sub_65F400`	--	Definition validation and initializer handling
NVVMModuleVerifier	`sub_2C80C90`	--	Data layout and address space validation
NVVMIntrinsicVerifier	`sub_2C7B6A0`	--	Per-intrinsic address space constraint checking
SelectionDAG intrinsic lowering	`sub_33B0210`	--	Backend AS mapping for param passing
getPointerAlignmentBits	`sub_BD5420`	--	Known-bits for address space pointer widths
NVIDIA intrinsic known-bits oracle	`sub_F0C4B0`	--	Special register ranges

Cross-References

Memory Space Optimization -- Two-phase address space resolver, bitmask dataflow, function cloning
IPMSP -- Inter-procedural memory space propagation, worklist algorithm
Alias Analysis & NVVM AA -- Address space disjointness, AA chain, !noalias.addrspace
NVPTX Target Infrastructure -- Data layout strings, +sharedmem32bitptr feature, TTI hooks
KnownBits & DemandedBits -- Address space pointer width in known-bits, DemandedBits narrowing
NVVM Verifier -- addrspacecast rules, tensor memory restriction, cmpxchg constraints
EDG Frontend -- CUDA memory space attributes (__shared__, __constant__, __device__)
SelectionDAG -- Backend address space encoding for param passing
IV Demotion -- Exploits 32-bit shared memory pointers for induction variable narrowing
EarlyCSE -- Shared cluster (AS 7) store handling

NVPTX Register Classes

This page is the single authoritative reference for the nine NVPTX register classes used throughout cicc v13.0. Register class tables previously duplicated in Register Allocation, Register Coalescing, PTX Emission, and AsmPrinter are consolidated here. When those pages reference register classes, they should cross-reference this page rather than maintaining inline copies.


Register encoding	`sub_21583D0` (4.6KB)
PTX type suffix map	`sub_2163730` (1.7KB)
PTX prefix map	`sub_21638D0` (1.6KB)
Copy opcode dispatch	`sub_2162350` (3.0KB)
Register info init (legacy)	`sub_2163AB0` / `sub_2149CD0`
Register info init (new PM)	`sub_30590F0` / `sub_301F0C0`
Register decl emission	`sub_2158E80` (17KB)
Internal-only class vtable	`off_4A026E0`

The Nine Register Classes

NVPTX defines nine register classes that participate in PTX code generation. Each class is identified at runtime by its vtable pointer, which sub_2163730 and sub_21638D0 use as a switch key to produce the PTX type suffix and register prefix respectively. The encoding function sub_21583D0 maps each class to a 4-bit tag that occupies bits [31:28] of the 32-bit encoded register ID.

Tag	Vtable	Class Name	PTX Type	Prefix	Encoded ID	Width	Description
1	`off_4A027A0`	Int1Regs	`.pred`	`%p`	`0x10000000`	1	Predicate (boolean)
2	`off_4A02720`	Int16Regs	`.b16`	`%rs`	`0x20000000`	16	Short integer
3	`off_4A025A0`	Int32Regs	`.b32`	`%r`	`0x30000000`	32	General-purpose integer
4	`off_4A024A0`	Int64Regs	`.b64`	`%rd`	`0x40000000`	64	Double-width integer
5	`off_4A02620`	Float32Regs	`.f32`	`%f`	`0x50000000`	32	Single-precision float
6	`off_4A02520`	Float64Regs	`.f64`	`%fd`	`0x60000000`	64	Double-precision float
7	`off_4A02760`	Int16HalfRegs	`.b16`	`%h`	`0x70000000`	16	Half-precision float (f16, bf16)
8	`off_4A026A0`	Int32HalfRegs	`.b32`	`%hh`	`0x80000000`	32	Packed pair (v2f16, v2bf16, v2i16, v4i8)
9	`off_4A02460`	Int128Regs	`.b128`	`%rq`	`0x90000000`	128	128-bit wide (tensor core)

Naming Discrepancy

Two naming conventions exist in the codebase, depending on whether the name was recovered from the emission functions or from the register allocator context:

Vtable	Emission name (sub_2163730/sub_21638D0)	RA-context name (sub_2162350)	Resolution
`off_4A02760`	Int16HalfRegs	Float16Regs	Same class. The emission functions use the TableGen-derived name `Int16HalfRegs`; the RA raw report uses the semantic alias `Float16Regs`. Both refer to `off_4A02760`.
`off_4A026A0`	Int32HalfRegs	Float16x2Regs	Same class. `Int32HalfRegs` is the TableGen name; `Float16x2Regs` is the semantic alias. Both refer to `off_4A026A0`.
`off_4A02460`	Int128Regs	SpecialRegs	Different raw reports assigned different names to `off_4A02460`. The emission report identifies it as `Int128Regs` (based on `.b128` type and `%rq` prefix). The earlier RA sweep report labeled it `SpecialRegs`. The emission-derived name `Int128Regs` is more accurate: `.b128` / `%rq` is used for 128-bit tensor-core values (i128 on SM 70+), not for special/environment registers.

The tenth vtable off_4A026E0 is present in the binary but returns "!Special!" from both sub_2163730 and sub_21638D0. It is never assigned an encoded ID and never participates in register declaration emission. It is an internal-only sentinel class used within NVPTXRegisterInfo initialization (string "ENVREG10" at register info offset +72).

Throughout this wiki, the emission-derived names (Int16HalfRegs, Int32HalfRegs, Int128Regs) are canonical. Pages written before this consolidation may use the RA-context aliases.

Register Encoding Scheme -- sub_21583D0

Every virtual register in the NVPTX backend is encoded as a 32-bit value that packs the register class and a per-class index into a single integer. The encoding function at sub_21583D0 (4.6KB) implements this:

encoded_register = class_tag | (register_index & 0x0FFFFFFF)

The bit layout:

 31  28 27                             0
+------+-------------------------------+
| class|       register index          |
| tag  |       (28 bits)               |
+------+-------------------------------+

Bits [31:28] -- 4-bit class tag, values 0x1 through 0x9 as listed in the table above.
Bits [27:0] -- 28-bit register index within that class, supporting up to 268 million registers per class.

The function operates in two modes:

Physical register (register_id >= 0): Returns the raw index directly (low 28 bits). Physical registers on NVPTX are a vestigial concept -- the target has no fixed register file -- but LLVM's infrastructure requires them for reserved registers like %SP and %SPL.
Virtual register (register_id < 0, i.e., bit 31 set in LLVM's internal convention): Looks up the register class from the MachineRegisterInfo register map, matches the class vtable against the nine known vtable addresses, and returns class_encoded_id | (register_index & 0x0FFFFFFF).

If the vtable does not match any of the nine known classes, the function triggers a fatal error:

"Bad register class"

This is a hard abort, not a recoverable diagnostic. It indicates that either a new register class was added without updating the encoding function, or memory corruption produced an invalid vtable pointer.

Why Bits [31:28] and Not Bits [31:29]

LLVM's standard convention uses bit 31 (0x80000000) to distinguish physical from virtual registers internally. The NVPTX encoding reclaims this bit as part of the class tag because after encoding, the distinction between physical and virtual is no longer meaningful -- all registers in emitted PTX are virtual. Tag value 0x8 (Int32HalfRegs) has bit 31 set, which would collide with LLVM's virtual-register marker. This works because the encoding is applied only during emission, after register allocation is complete and the physical/virtual distinction is irrelevant.

Complete Class Separation

The nine register classes are completely disjoint. There is no cross-class interference: an Int32Regs register (%r) never conflicts with a Float32Regs register (%f) even though both are 32 bits wide. This is a fundamental consequence of PTX's typed register model. In PTX, .reg .b32 %r0 and .reg .f32 %f0 are distinct storage locations from ptxas's perspective. Two implications follow:

No cross-class coalescing. The register coalescer at sub_34AF4A0 enforces a same-class check on every coalescing candidate. Cross-class copies (e.g., a bitcast from i32 to f32) must survive as explicit mov instructions in the emitted PTX.
Per-class pressure accounting. The greedy register allocator at sub_2F5A640 tracks register pressure per class independently. The -maxreg limit bounds total live registers across all classes combined, but interference within any single class never spills over to another.

This is unlike CPU targets (x86, AArch64) where integer and floating-point registers can alias through sub-register relationships, or where a single physical register appears in multiple register classes.

Copy Opcodes -- sub_2162350

The function sub_2162350 (3.0KB, "Copy one register into another with a different width") dispatches copy instruction emission based on the source and destination register classes. Each class has two opcodes: one for same-class copies (e.g., mov.b32 %r1, %r0) and one for cross-class copies (e.g., bitcasting between Int32Regs and Float32Regs):

Class	Same-Class Opcode	Cross-Class Opcode	Notes
Int1Regs	39424	39424	No distinct cross-class path
Int16Regs	39296	39296	No distinct cross-class path
Int32Regs	39552	10816	Cross = `mov.b32` bitcast to float
Int64Regs	39680	11008	Cross = `mov.b64` bitcast to double
Float32Regs	30656	10880	Cross = `mov.b32` bitcast to integer
Float64Regs	30784	11072	Cross = `mov.b64` bitcast to integer
Int16HalfRegs	30528	10688	Cross = `mov.b16` half-to-short
Int32HalfRegs	39552	39552	Uses same opcode as Int32Regs same-class
Int128Regs	39168	39168	No distinct cross-class path

Classes where both opcodes are identical (Int1Regs, Int16Regs, Int32HalfRegs, Int128Regs) have no meaningful cross-class copy path. For predicates (Int1Regs), this is because there is no other 1-bit type. For 128-bit registers, tensor-core values have no peer class to bitcast into. The Int32HalfRegs class shares its same-class opcode (39552) with Int32Regs because both emit .b32 copies -- the packed v2f16 value is simply treated as a 32-bit bitpattern for copying.

The five classes with distinct cross-class opcodes (Int32Regs, Int64Regs, Float32Regs, Float64Regs, Int16HalfRegs) are exactly those that participate in bitcast operations between integer and floating-point interpretations of the same bit width.

Register Declaration Emission -- sub_2158E80

During function body emission, sub_2158E80 (17KB) emits .reg declarations for every register class used by the function. The process:

Iterate the register map at this+800 in the AsmPrinter state.
Deduplicate classes using a hash table at this+808..832.
Track the maximum index per class across all virtual registers.
Emit one declaration per class in the format:

.reg .pred  %p<5>;       // 5 predicate registers (indices 0..4)
.reg .b16   %rs<12>;     // 12 short integer registers
.reg .b32   %r<47>;      // 47 general-purpose 32-bit
.reg .b64   %rd<8>;      // 8 double-width integer
.reg .f32   %f<20>;      // 20 single-precision float
.reg .f64   %fd<3>;      // 3 double-precision float
.reg .b16   %h<4>;       // 4 half-precision float
.reg .b32   %hh<2>;      // 2 packed-pair registers
.reg .b128  %rq<1>;      // 1 tensor-core 128-bit register

The count for each class is max_register_index + 1. The PTX declaration syntax %prefix<N> declares registers %prefix0 through %prefix(N-1).

Note that Int16HalfRegs and Int16Regs share the same PTX type suffix (.b16) but have different prefixes (%h vs %rs). Similarly, Int32HalfRegs and Int32Regs share .b32 but use %hh vs %r. The PTX assembler ptxas treats these as completely separate register namespaces -- the prefix, not the type, determines the namespace.

Stack pointer registers (%SP, %SPL) are emitted before the class declarations when the function has a non-zero local frame. These use .b64 in 64-bit mode or .b32 in 32-bit mode.

Per-Class Detail

Int1Regs -- Predicates

Property	Value
Vtable	`off_4A027A0`
PTX type	`.pred`
Prefix	`%p`
Tag	`0x1`
Width	1 bit
Legal MVTs	`i1`
Same-class copy	39424

Predicate registers hold boolean values used for conditional branches (@%p1 bra target), select instructions (selp), and set-predicate results (setp). They are the only 1-bit registers in PTX. There is no cross-class copy path because no other class holds 1-bit values. The coalescer excludes predicates from cross-class analysis entirely.

Int16Regs -- Short Integers

Property	Value
Vtable	`off_4A02720`
PTX type	`.b16`
Prefix	`%rs`
Tag	`0x2`
Width	16 bits
Legal MVTs	`i16`
Same-class copy	39296

Short integer registers hold 16-bit integer values. PTX .param space widens all scalars below 32 bits to .b32, so %rs registers appear primarily in computation, not in function signatures. The prefix %rs (register-short) distinguishes these from %h (Int16HalfRegs) even though both declare as .b16.

Int32Regs -- General-Purpose 32-bit

Property	Value
Vtable	`off_4A025A0`
PTX type	`.b32`
Prefix	`%r`
Tag	`0x3`
Width	32 bits
Legal MVTs	`i32`
Same-class copy	39552
Cross-class copy	10816

The workhorse register class. Holds 32-bit integers, addresses in 32-bit mode, loop indices, and general computation results. Cross-class copy opcode 10816 handles bitcast to Float32Regs (%f).

Int64Regs -- Double-Width Integer

Property	Value
Vtable	`off_4A024A0`
PTX type	`.b64`
Prefix	`%rd`
Tag	`0x4`
Width	64 bits
Legal MVTs	`i64`
Same-class copy	39680
Cross-class copy	11008

Holds 64-bit integers and device pointers in 64-bit mode (the common case). Cross-class copy opcode 11008 handles bitcast to Float64Regs (%fd).

Float32Regs -- Single-Precision Float

Property	Value
Vtable	`off_4A02620`
PTX type	`.f32`
Prefix	`%f`
Tag	`0x5`
Width	32 bits
Legal MVTs	`f32`
Same-class copy	30656
Cross-class copy	10880

Holds IEEE 754 single-precision floats. Note the .f32 type suffix rather than .b32 -- PTX distinguishes float from bitwise register types even at the same width. Cross-class copy opcode 10880 handles bitcast to Int32Regs (%r).

Float64Regs -- Double-Precision Float

Property	Value
Vtable	`off_4A02520`
PTX type	`.f64`
Prefix	`%fd`
Tag	`0x6`
Width	64 bits
Legal MVTs	`f64`
Same-class copy	30784
Cross-class copy	11072

Holds IEEE 754 double-precision floats. Cross-class copy opcode 11072 handles bitcast to Int64Regs (%rd).

Int16HalfRegs -- Half-Precision Float

Property	Value
Vtable	`off_4A02760`
PTX type	`.b16`
Prefix	`%h`
Tag	`0x7`
Width	16 bits
Legal MVTs	`f16`, `bf16`
Same-class copy	30528
Cross-class copy	10688

Despite the Int16 in the TableGen-derived name, this class holds half-precision floating-point values (f16 and bf16). The .b16 PTX type (bitwise 16-bit) is used rather than a hypothetical .f16 because PTX's type system uses .b16 for all 16-bit values that are not short integers. The %h prefix distinguishes these registers from %rs (Int16Regs). Cross-class copy opcode 10688 handles conversion to Int16Regs.

The semantic alias Float16Regs appears in some wiki pages and is equally valid.

Int32HalfRegs -- Packed Half-Precision Pairs

Property	Value
Vtable	`off_4A026A0`
PTX type	`.b32`
Prefix	`%hh`
Tag	`0x8`
Width	32 bits
Legal MVTs	`v2f16`, `v2bf16`, `v2i16`, `v4i8`
Same-class copy	39552
Cross-class copy	39552

This is the only register class for vector types on NVPTX. It holds exactly 32 bits of packed data: two f16 values, two bf16 values, two i16 values, or four i8 values. The %hh prefix distinguishes it from %r (Int32Regs). Both same-class and cross-class copy opcodes are 39552 (identical to Int32Regs same-class), because copies of packed values are simple 32-bit bitwise moves.

All vector types wider than 32 bits (v4f32, v2f64, v8i32, etc.) are illegal on NVPTX and must be split or scalarized during type legalization. See the vector legalization documentation for the split/scalarize dispatch.

The semantic alias Float16x2Regs appears in some wiki pages.

Int128Regs -- 128-bit Tensor Core Values

Property	Value
Vtable	`off_4A02460`
PTX type	`.b128`
Prefix	`%rq`
Tag	`0x9`
Width	128 bits
Legal MVTs	`i128` (SM 70+)
Same-class copy	39168
Cross-class copy	39168

The widest register class, introduced for tensor core operations on Volta (SM 70) and later architectures. Holds 128-bit values used as operands and accumulators in mma and wmma instructions. The %rq prefix stands for "register quad" (4x32 bits). There is no cross-class copy path because no other class holds 128-bit values.

During register coalescing, 128-bit values are tracked as wide register pairs (two 64-bit halves). The coalescer at sub_3497B40 handles paired-register decomposition: when coalescing the low half, the high half inherits corresponding constraints.

An earlier raw report (p2c.5-01-register-alloc.txt) labeled off_4A02460 as SpecialRegs. This was an error in that report's identification. The vtable off_4A02460 emits .b128 / %rq, which is the 128-bit class for tensor core values, not a class for special/environment registers.

The Internal-Only Class -- off_4A026E0

Property	Value
Vtable	`off_4A026E0`
PTX type	`"!Special!"`
Prefix	`"!Special!"`
Encoded ID	None

A tenth vtable address appears in the register info initialization path (sub_2163AB0). Both sub_2163730 and sub_21638D0 return the sentinel string "!Special!" for this vtable. It has no encoded ID, no PTX declaration, and never produces emitted registers. The string "ENVREG10" at register info offset +72 (alongside "Int1Regs" at offset +80) suggests this class is associated with environment registers -- hardware-defined read-only registers like %tid, %ctaid, %ntid, etc. These are emitted by dedicated special-register emission functions (sub_21E86B0, sub_21E9060) rather than through the register class encoding path.

Register Info Initialization

NVPTXRegisterInfo objects are created by two factory functions corresponding to the two pass manager generations:

	Legacy PM	New PM
Factory	`sub_2149CD0`	`sub_301F0C0`
Init	`sub_2163AB0`	`sub_30590F0`
Object size	224 bytes	248 bytes

Both call sub_1F4A910 (TargetRegisterInfo::InitMCRegisterInfo) with the register descriptor table at off_49D26D0 and register unit data at unk_4327AF0. Key fields in the initialized structure:

Offset	Content
+44	`NumRegs` (total register count)
+72	`"ENVREG10"` (environment register class name)
+80	`"Int1Regs"` (first register class name)
+96	`numRegClasses` (initially 1, expanded during init)

Coalescing Constraints

The register coalescer imposes these constraints based on register class:

Class	Coalesceable	Constraint Flag (offset +3, mask 0x10)
Int1Regs	Same class only	Set
Int16Regs	Same class only	Set
Int32Regs	Same class only	Set (type code 12)
Int64Regs	Same class only	Set (type code 13)
Float32Regs	Same class only	Set (type code 15)
Float64Regs	Same class only	Set
Int16HalfRegs	Same class only	Set
Int32HalfRegs	Same class only	Set
Int128Regs	Never coalesced	Cleared

Int128Regs (the class at off_4A02460, previously mislabeled SpecialRegs in the coalescing page) has its constraint flag cleared, excluding it from the coalescing worklist entirely. This makes sense: tensor-core 128-bit values have specific register-pair relationships that the coalescer must not disturb.

Cross-class copies between Int32Regs/Float32Regs and between Int64Regs/Float64Regs are bitcasts that the coalescer never eliminates -- they must survive as explicit PTX mov instructions because the source and destination live in different register namespaces.

Differences from Upstream LLVM NVPTX

The upstream LLVM NVPTX backend (as of LLVM 20.0.0) defines these register classes in NVPTXRegisterInfo.td:

Int1Regs, Int16Regs, Int32Regs, Int64Regs -- identical.
Float16Regs, Float16x2Regs -- upstream names for cicc's Int16HalfRegs / Int32HalfRegs. The rename reflects NVIDIA's preference for the TableGen-derived integer-typed names.
Float32Regs, Float64Regs -- identical.
Int128Regs -- present in upstream, matches cicc.
No SpecialRegs class in upstream. Special registers are handled through dedicated physical registers, not a register class.
No off_4A026E0 internal-only class in upstream.

The encoding scheme (4-bit tag in [31:28], 28-bit index in [27:0]) and the fatal "Bad register class" error path are NVIDIA additions not present in upstream LLVM's NVPTX backend, which relies on standard MCRegisterInfo encoding.

Function Map

Function	Address	Size	Role
Register class encoding (class tag OR index)	`sub_21583D0`	4.6KB	--
Register class -> PTX type suffix (`.pred`, `.b32`, `.f32`, ...)	`sub_2163730`	1.7KB	--
Register class -> PTX prefix (`%p`, `%r`, `%f`, ...)	`sub_21638D0`	1.6KB	--
Copy opcode dispatch by register class	`sub_2162350`	3.0KB	--
Stack frame + register declaration emission	`sub_2158E80`	17KB	--
NVPTXRegisterInfo init (legacy PM)	`sub_2163AB0`	1.1KB	--
NVPTXRegisterInfo factory (legacy PM)	`sub_2149CD0`	--	--
NVPTXRegisterInfo init (new PM)	`sub_30590F0`	--	--
NVPTXRegisterInfo factory (new PM)	`sub_301F0C0`	--	--
TargetRegisterInfo::InitMCRegisterInfo	`sub_1F4A910`	--	--
Special register emission (%tid, %ctaid, %ntid, %nctaid)	`sub_21E86B0`	--	--
Cluster register emission (SM 90+)	`sub_21E9060`	--	--

Cross-References

Register Allocation -- greedy RA that operates on these classes; pressure tracking and -maxreg constraint
Register Coalescing -- same-class-only coalescing policy, copy opcode classification
PTX Emission -- function header orchestrator that calls the register declaration emitter
AsmPrinter -- per-instruction emission that calls the encoding function
Type Legalization -- vector type legalization driven by the Int32HalfRegs-only vector model
NVPTX Target Infrastructure -- NVPTXTargetMachine that owns the register info objects

NVPTX Machine Opcode Reference

This page is the master reference for NVPTX MachineInstr opcodes as they exist in cicc v13.0. These are the target-specific opcode numbers assigned during instruction selection and consumed by register allocation, instruction scheduling, the AsmPrinter, and every other machine-level pass. They are distinct from both LLVM IR opcodes (which live in the Instruction hierarchy) and from ISD/NVPTXISD SelectionDAG node opcodes (which exist only during lowering and are erased by ISel). A MachineInstr's opcode field is the 16-bit value at MachineInstr offset +68, and it indexes into the MCInstrDesc table to obtain operand counts, constraint classes, implicit defs/uses, and scheduling information.


Constraint table	`word_3F3E6C0` (static `.data` array of 16-bit entries)
Constraint emitter	`sub_B612D0` (104KB, 179-case switch)
Copy-type mapper	`sub_3494EA0` (12.7KB, maps opcodes 1--0x12 to families 440--503)
Register class builder	`sub_B5BA00` (21KB, 111 cases)
Operand type classifier	`sub_34961A0` (26.6KB, reads `byte_444C4A0`)
ISel entry	`sub_3090F90` (91KB, `NVPTXDAGToDAGISel::Select`)
Intrinsic lowering switch	`sub_33B0210` (343KB, hundreds of NVVM intrinsics)

Opcode Numbering Scheme

Opcodes 0--approximately 430 correspond to generic LLVM TargetOpcode values and standard LLVM machine pseudo-instructions (COPY, PHI, IMPLICIT_DEF, INLINEASM, etc.). These are identical to upstream LLVM 20.0.0. NVPTX target-specific opcodes begin around opcode 440 and extend into the thousands. The highest confirmed opcode numbers are in the 4900+ range (tcgen05 tensor core instructions for Blackwell).

The opcode numbering is generated by TableGen from the NVPTX .td instruction definitions and compiled into the MCInstrDesc table. Since cicc is a stripped binary, the symbolic names are lost. The identifications below come from behavioral analysis: matching the constraint table patterns, AsmPrinter string emission, and SelectionDAG lowering code against known PTX instruction semantics.

The Constraint Table: `word_3F3E6C0`

Every NVPTX machine opcode has an entry in the global constraint table at word_3F3E6C0. This is a flat array of 16-bit words, indexed by (opcode - 1). Each word packs two fields:

Bits	Field	Purpose
`[7:0]` (low byte)	`constraint_class`	Index into the 179-case switch in `sub_B612D0`
`[15:8]` (high byte)	`register_class_id`	Target register class for the instruction's primary result

The access pattern, decompiled from sub_B612D0:

uint16_t entry = word_3F3E6C0[opcode - 1];
uint8_t constraint_class = entry & 0xFF;         // low byte
uint8_t register_class   = (entry >> 8) & 0xFF;  // high byte

switch (constraint_class) {
    case 0x00: ...  // simple 2-input ALU
    case 0x01: ...  // 3-input FMA
    ...
    case 0xB2: ...  // maximum observed class
}

The constraint class determines how many operands the instruction has, what register class each operand belongs to, and which operands are tied. Each case in the switch constructs a stack-allocated array of 16-byte constraint descriptors (see Pattern Database for the full descriptor layout) and calls sub_A78010 to emit them.

179 Constraint Classes

The constraint classes range from 0x00 through 0xB2 (179 values). Each class represents a distinct operand signature. Representative patterns:

Class Range	Pattern	Descriptor Count	Typical Instructions
0x00--0x0F	Simple ALU (2 inputs, 1 output)	3	add, sub, mul, and, or, xor
0x10--0x1F	Ternary (3 inputs, 1 output)	4	fma, madc, selp
0x20--0x3F	Load/store variants	2--5	ld, st with address space and vector width
0x40--0x5F	Conversion and move	2--3	cvt, mov, bitcast
0x60--0x7F	Atomic and barrier	3--6	atom.*, membar, fence
0x80--0x9F	Texture/surface	4--12	tex., sust., suld.*
0xA0--0xAF	Tensor core (MMA)	6--16	hmma, imma, wmma, mma
0xB0	Maximum operand (17 inputs)	18	Complex intrinsic (opcode 176)
0xB1--0xB2	Miscellaneous high-operand-count	variable	Specialized instructions

The maximum observed operand count is 17 (constraint class 0xB0, associated with opcode 176), requiring 18 descriptor entries (17 inputs + 1 output) and 288 bytes of stack space in the constraint emitter's frame.

Register Class IDs in the High Byte

The high byte of each word_3F3E6C0 entry identifies the register class for the instruction's result. These IDs map to NVPTX's typed virtual register files:

ID	Register Class	PTX Type	PTX Prefix	Vtable Address
14	`Int32Regs`	`.b32`	`%r`	`off_4A025A0`
22	`Int16Regs`	`.b16`	`%rs`	`off_4A02720`
40	`Float32Regs`	`.f32`	`%f`	`off_4A02620`
43	`Float16Regs`	`.b16`	`%h`	`off_4A02760`
50	`Int64Regs`	`.b64`	`%rd`	`off_4A024A0`
51	`Float64Regs`	`.f64`	`%fd`	`off_4A02520`
52	`Int128Regs`	`.b128`	`%rq`	`off_4A02460`
78	`PredRegs`	`.pred`	`%p`	`off_4A027A0`
86	`SpecialRegs`	(varies)	(varies)	`off_4A026E0`

Additional register class IDs observed in the constraint table (24, 27, 29, 32, 36, 39, 41, 67, 72, 76) likely correspond to sub-classes or aliased classes (e.g., Int32HalfRegs with ID related to 32 and prefix %hh), but their exact mappings have not been recovered. Instructions that produce no register result (stores, barriers, calls) have a zero or don't-care value in the high byte.

Identified Opcode Families

The following sections catalog every opcode range where the binary-to-PTX mapping has been confirmed. Opcodes are grouped by functional family. Where an opcode's identity is uncertain, it is marked with a question mark.

Copy and Move Family (440--503)

These are the NVPTX-specific copy instructions that the NVPTX register coalescer at sub_34AF4A0 processes. The standard LLVM RegisterCoalescer handles only the generic COPY pseudo (a generic TargetOpcode, not in this range); the NVPTX coalescer handles these target-specific copy families in a second pass.

The mapping function sub_3494EA0 contains a switch statement that classifies internal opcode IDs (1--0x12) into copy families:

Opcode Range	Family	Description
440--443	Type-preserving moves	Same-class copies: i32-to-i32, i64-to-i64, f32-to-f32, f64-to-f64. These map from operand type codes 12, 13, 15 in the `byte_444C4A0` classification table.
444--470 (approx.)	Cross-class moves	Bitcasting copies between register classes (e.g., i32 to f32). These survive coalescing as explicit `mov` instructions in PTX because the source and destination register types differ.
471--490 (approx.)	Paired/wide moves	128-bit register pair copies for tensor core paths. The low and high halves are tracked jointly by `sub_3497B40`.
491--503 (approx.)	ABI parameter copies	`.param`-related copies at call boundaries. These arise from the calling convention and are prime targets for coalescing.

The byte_444C4A0 operand-type classification table (16-byte entries, indexed by MVT enum) feeds the coalescer's type check:

struct OperandTypeEntry {    // 16 bytes at byte_444C4A0[16 * mvt - 16]
    uint8_t type_code;       // +0: 12=i32, 13=i64, 15=f32, etc.
    uint8_t size_class;      // +1: size in register-width units
    uint8_t register_bank;   // +2: bank identifier
    uint8_t constraint_flags; // +3: bit 0x10 = participates in coalescing
    uint8_t reserved[12];    // +4: padding
};

The constraint flag at offset +3 (mask 0x10) gates whether an operand participates in coalescing. Operands without this bit set (e.g., SpecialRegs) are excluded from the coalescer's worklist entirely.

Call ABI Family (505--573)

These opcodes implement the PTX .param-space calling convention. They are emitted by NVPTXTargetLowering::LowerCall (sub_3040BF0, 88KB) and form the backbone of every device-function call sequence.

Opcode	Name	PTX Equivalent	Operands
315	`CallSeqBegin`	(pseudo) call frame setup	chain, seq_id, zero
316	`CallSeqEnd_Outer`	(pseudo) outer call frame teardown	chain, glue, callee_ref, callee_ref_hi
505	`DeclareParam`	`.param .align A .b8 param[N]`	chain, alignment, param_index, byte_size
506	`DeclareScalarParam`	`.param .bW paramN`	chain, alignment, param_index, widened_size
507	`DeclareRetParam`	`.param .align A .b8 retval[N]`	chain, alignment, byte_size, zero
508	`DeclareRetScalarParam`	`.param .bW retval`	chain, 1, widened_size, zero
510	`CallDirect`	`call (retval), func, (params)`	chain, callee, params...
511	`CallDirectNoProto`	`call func, (params)` (old-style)	chain, callee, params...
512	`CallIndirect`	`call (retval), %rd, (params)`	chain, func_ptr, params...
513	`CallIndirectNoProto`	`call %rd, (params)`	chain, func_ptr, params...
514	`CallStart`	(pseudo) actual call emission point	CallProto result
515	`LoadRetParam`	`ld.param.bW retvalN`	call_result, 1, element_index
516	`LoadRetParamLast`	`ld.param.bW retvalN` (last)	call_result, 1, element_index
517	`CallSeqEnd`	(pseudo) inner call frame teardown	last_load, chain, flag
518	`CallProto`	`.callprototype`	chain, callee, proto_string
521	`DeclareRetParam_Ext`	`.param` for return (ext path)	CallSeqEnd result, seq_id
527	`StoreCalleeRetAddr`	(pseudo) callee return addr	chain, proto_symbol
528	`StoreRetValToParam`	`st.param.bW retvalN` (return)	chain, value, offset

The call sequence follows a strict emission order:

CallSeqBegin(315)
  for each argument:
    DeclareParam(505) or DeclareScalarParam(506)
    StoreV1/V2/V4(571/572/573) — store argument values
  DeclareRetParam(507) or DeclareRetScalarParam(508)  [if callee returns]
  CallProto(518)
  CallStart(514)                                       [actual call point]
  for each return value:
    LoadRetParam(515) or LoadRetParamLast(516)
  CallSeqEnd(517)
  DeclareRetParam_Ext(521)                             [if prototype present]
CallSeqEnd_Outer(316)

Vector Load/Store Family (568--573)

These opcodes handle vectorized .param-space data movement, emitted during argument passing and return value extraction:

Opcode	Name	PTX Equivalent	Vector Width
568	`LoadV1`	`ld.param.b32` / `ld.param.b64`	1 element
569	`LoadV2`	`ld.param.v2.b32` / `ld.param.v2.b64`	2 elements
570	`LoadV4`	`ld.param.v4.b32` / `ld.param.v4.b64`	4 elements
571	`StoreV1`	`st.param.b32` / `st.param.b64`	1 element
572	`StoreV2`	`st.param.v2.b32` / `st.param.v2.b64`	2 elements
573	`StoreV4`	`st.param.v4.b32` / `st.param.v4.b64`	4 elements

The vector width selection logic in LowerCall (sub_3040BF0, lines 1429--1440):

accumulated_operand_count == 3  ->  StoreV1 (571), width=1
accumulated_operand_count == 4  ->  StoreV2 (572), width=2
accumulated_operand_count == 6  ->  StoreV4 (573), width=4
other                           ->  fatal error (unreachable)

The same pattern applies to LoadV1/V2/V4 on the return path. These opcodes are also used for by-value struct argument decomposition, where the struct is stored element-by-element into .param space using 8-byte chunks via StoreV1(571).

Atomic Family (294--317, 462)

Atomic opcodes are emitted by sub_20BED60 during DAG legalization and emitted as PTX by sub_21E5E70 (base) and sub_21E6420 (L2-hinted variant for SM 80+):

Opcode Range	PTX Instruction	Types
294--297	`atom.add`	f32, f64, i32, i64
302--305	`atom.min`	s32, s64, u32, u64
314--317	`atom.max`	s32, s64, u32, u64
462	`atom.cas`	generic (compare-and-swap)

Within the PTX emission layer, the atomic operation is encoded in a packed operand word:

Bits	Field	Values
`[7:4]`	scope	0=gpu (default), 1=cta, 2=sys
`[23:16]` (BYTE2)	operation	0x00=exch, 0x01=add.u, 0x03=and, 0x05=or, 0x06=xor, 0x07=max.s, 0x08=min.s, 0x09=max.u, 0x0A=min.u, 0x0B=add.f, 0x0C=inc, 0x0D=dec, 0x0E=cas

Note that operation codes 0x02 and 0x04 are absent -- there is no signed atomic add or a second OR variant, matching the PTX ISA specification.

On Ampere (SM 80+), each atomic operation has an L2 cache-hinted variant emitted by sub_21E6420. The PTX format becomes atom[.scope].op.L2::cache_hint.type, instructing the GPU to retain or evict data in L2 after the atomic completes.

Barrier and Fence Family (287--290)

Opcode	PTX Instruction	Scope
287	`membar.gpu`	GPU
288	`membar.cta`	CTA (thread block)
289	`membar.sys`	System
290	`fence.sc.cluster`	Cluster (SM 90+)

The emission function sub_21E94F0 dispatches on the low 4 bits of the operand word. The fence.sc.cluster instruction requires SM 90 (Hopper) and provides sequentially-consistent fence semantics at cluster scope.

Cluster barrier instructions (SM 90+, emitted by sub_21E8EA0):

Operand Encoding	PTX Instruction
`bits[3:0]=0, bits[7:4]=0`	`barrier.cluster.arrive`
`bits[3:0]=0, bits[7:4]=1`	`barrier.cluster.arrive.relaxed`
`bits[3:0]=1, bits[7:4]=0`	`barrier.cluster.wait`
`bits[3:0]=1, bits[7:4]=1`	`barrier.cluster.wait.relaxed`

NVPTXISD Custom DAG Opcodes (22--499)

These are SelectionDAG-level opcodes used during lowering. After instruction selection, they are replaced by concrete MachineInstr opcodes. They are documented here because the DAG opcode numbers appear in the binary's lowering functions and serve as the conceptual identity of each instruction family:

DAG Opcode	Identity	Notes
22	`NVPTXISD::TargetAddr`	Data pointer computation
24	`NVPTXISD::Wrapper`	Global address wrapping
149	`NVPTXISD::ATOMIC_LOAD`	Atomic load (lowered from IR atomic)
152	`NVPTXISD::SELECT_CC`	Conditional select (ternary)
189	`NVPTXISD::MoveParam`	Thread index and parameter moves
193--196	`NVPTXISD::MIN/MAX`	Min/max variants (2- and 3-source)
197	`NVPTXISD::CTPOP`	Population count
198--204	`NVPTXISD::ConstantPool`	Constant pool entry variants
208	`NVPTXISD::CMPXCHG`	Compare-and-exchange
213--214	`NVPTXISD::STORE_SIGNED`	Store with sign-extension flag
215	`NVPTXISD::AddrSpaceCast`	Address space conversion (within lowering)
230	`NVPTXISD::DeclareLocal`	Declare local variable / address of param
233--234	`NVPTXISD::AddrSpaceCast` pair	Two-step address space cast
245--274	`NVPTXISD::MathOp_RN/RZ/RM/RP`	Rounded math (add, mul, sqrt, div, fma)
310	`NVPTXISD::Annotation`	PTX `.pragma` annotation
321	`NVPTXISD::StackRestore`	Stack pointer restore
322	`NVPTXISD::StackAlloc`	Dynamic stack allocation
330	`NVPTXISD::FunctionAddr`	Function address (for indirect calls)
335	`NVPTXISD::BinaryArith`	Two-operand arithmetic
371	`NVPTXISD::DynAreaOffset`	Dynamic alloca offset
499	`NVPTXISD::ConditionalBranch`	Conditional branch with `.param` alloc

The rounded math opcodes (245--274) follow a systematic pattern. The intrinsic lowering switch at sub_33B0210 maps NVVM intrinsic IDs to NVPTXISD opcodes:

Intrinsic ID	NVPTXISD Opcode	PTX Operation
63	249	`add.rz`
64	255	`mul.rz`
89	267	`fma.rz`
170	245	`add.rm`
172	274	`mul.rm`
250	271	`fma.rm`
308	270	`add.rp`
309	272	`mul.rp`
310	273	`fma.rp`
325	248	`sqrt.rz`
328	254	`sqrt.rm`
335	246	`sqrt.rp`
348	250	`div.rz`
349	256	`div.rm`
355	269	`div.rp`

MMA / Tensor Core Opcodes

Tensor core MachineInstr opcodes occupy a large range and are organized by generation. The central MMA instruction builder at sub_21E74C0 reads a packed 64-bit descriptor to determine the specific instruction variant.

Pre-Blackwell (SM 70--90) families:

Function	Family	PTX Base	Min SM
`sub_21E0360`	HMMA load A/B	`wmma.load.a` / `wmma.load.b`	70
`sub_21E0630`	HMMA load C	`wmma.load.c`	70
`sub_21DFBF0`	HMMA store C	`wmma.store.c`	70
`sub_21E0870`	HMMA MMA	`wmma.mma` / `mma`	70
`sub_21E1280`	IMMA load A/B	`wmma.load.a` (int)	72
`sub_21E15D0`	IMMA load C	`wmma.load.c` (int)	72
`sub_21E1830`	IMMA store C	`wmma.store.c` (int)	72
`sub_21E1D20`	IMMA MMA	`mma` (integer, with saturation)	72
`sub_21E2280`	BMMA MMA	`mma` (binary, `b1.and.popc` / `b1.xor.popc`)	75

Each family exists in two copies: the AsmPrinter-side at 0x21Dxxxx--0x21Exxxx and the NVPTX backend-side at 0x36Exxxx.

Blackwell tcgen05 (SM 100+):

Opcodes 4905--4940 cover 10 shape variants of tcgen05.mma. The packed descriptor encodes:

Bit	Field	Values
0	scaleD	0 or 1
1	negA	0=positive, 1=negative
2	negB	0=positive, 1=negative
3	transA	0=normal, 1=transposed
4	transB	0=normal, 1=transposed
5	sparsity	structured sparsity enable
`[8:6]`	type encoding	mxf4nvf4, i8, mxf8f6f4, f16, tf32, fp4, mxf4, bf16

Modifiers include block_scale, weight_stationary, and scaleInputAccumulator. The architecture gate is subtarget+340 >= 0x3E8 (SM 100 decimal).

MMA Shape and Type Encoding

The MMA instruction builder uses enumerated shape and type codes embedded in the packed descriptor:

Shape codes (bits [39:32]):

Code	Shape	PTX String	Min SM
0x01	m8n8k4	`"m8n8k4"`	70
0x02	m8n8k16	`"m8n8k16"`	72
0x03	m8n8k32	`"m8n8k32"`	75
0x04	m8n8k64	`"m8n8k64"`	75
0x05	m8n8k128	`"m8n8k128"`	75
0x10	m16n8k4	`"m16n8k4"`	80
0x11	m16n8k8	`"m16n8k8"`	75
0x12	m16n8k16	`"m16n8k16"`	80
0x13	m16n8k32	`"m16n8k32"`	75
0x14	m16n8k64	`"m16n8k64"`	75
0x15	m16n8k128	`"m16n8k128"`	75
0x16	m16n8k256	`"m16n8k256"`	75
0x17	m16n16k16	`"m16n16k16"`	90
0x18	m32n8k16	`"m32n8k16"`	90?
0x19	m16n16k8	`"m16n16k8"`	70

Data type codes (in aty/bty fields):

Code	Type	Bits	PTX
1	b1	1	`"b1"`
2	s4	4	`"s4"`
3	u4	4	`"u4"`
4	s8	8	`"s8"`
5	u8	8	`"u8"`
6	f16	16	`"f16"`
7	bf16	16	`"bf16"`
8	tf32	19	`"tf32"`
9	f64	64	`"f64"`
10	f32	32	`"f32"`
11	s32	32	`"s32"`

Special Register Access

Special register read instructions map to PTX special registers. The AsmPrinter function sub_21E86B0 dispatches on a single-byte operand:

Operand	Register	Description
0x26	`%tid.x`	Thread ID, X
0x27	`%tid.y`	Thread ID, Y
0x28	`%tid.z`	Thread ID, Z
0x29	`%ntid.x`	Block dimension, X
0x2A	`%ntid.y`	Block dimension, Y
0x2B	`%ntid.z`	Block dimension, Z
0x2C	`%ctaid.x`	Block ID, X
0x2D	`%ctaid.y`	Block ID, Y
0x2E	`%ctaid.z`	Block ID, Z
0x2F	`%nctaid.x`	Grid dimension, X
0x30	`%nctaid.y`	Grid dimension, Y
0x31	`%nctaid.z`	Grid dimension, Z
0x5E	(dynamic)	`%warpid` / `%laneid` (via `sub_3958DA0`)
0x5F	(dynamic)	`%nwarpid` or similar (via `sub_3958DA0`)

Cluster special registers (SM 90+, sub_21E9060) add 15 registers: %is_explicit_cluster, %cluster_ctarank, %cluster_nctarank, %cluster_ctaid.{x,y,z}, %cluster_nctaid.{x,y,z}, %clusterid.{x,y,z}, %nclusterid.{x,y,z}.

Address Space Conversion

The cvta instruction family is emitted by sub_21E7FE0:

Operand Value	Suffix	Full Instruction
0	(none)	`cvta` (generic)
1	`.global`	`cvta.to.global` / `cvta.global`
3	`.shared`	`cvta.to.shared` / `cvta.shared`
4+	`.local`	`cvta.to.local` / `cvta.local`

Direction is determined by a separate operand: value 0 emits "a" (to-generic), value 1 emits "b" (to-specific).

Constraint Emission Pipeline

The full path from opcode to emitted constraint:

sub_B612D0(emitter_state, opcode):
    // Step 1: Table lookup
    entry = word_3F3E6C0[opcode - 1]
    reg_class = entry >> 8
    constraint_class = entry & 0xFF

    // Step 2: Build descriptor array on stack
    switch (constraint_class):
        case 0x00:
            // Simple 2-input ALU: {op0=RC, op1=RC, result=RC}
            desc[0] = {kind=0, value=sub_A778C0(state, reg_class, flags)}
            desc[1] = {kind=1, value=sub_A778C0(state, reg_class, flags)}
            desc[2] = {kind=-1, value=sub_B5BA00(state, reg_class)}
            sub_A78010(state, desc, 3)
        case 0x01:
            // Ternary FMA: {op0, op1, op2, result}
            desc[0..2] = three input constraints
            desc[3] = {kind=-1, value=sub_B5BA00(state, reg_class)}
            sub_A78010(state, desc, 4)
        ...
        case 0xB0:
            // 17-input complex: 17 input constraints + 1 output
            for i in 0..16:
                desc[i] = {kind=i, value=...}
            desc[17] = {kind=-1, value=sub_B5BA00(state, reg_class)}
            sub_A78010(state, desc, 18)

Key helper functions:

Address	Function	Purpose
`sub_A778C0`	`createRegClassConstraint(state, regclass, flags)`	Build input operand constraint for a specific register class
`sub_A77AD0`	`createAnyRegConstraint(state, flags)`	Build an unconstrained ("any register") input constraint
`sub_A79C90`	`composeConstraints(state, desc, N)`	Merge N descriptors into a single composite constraint
`sub_B5BA00`	`createOutputConstraint(state, regclass_id)`	Build the output/result constraint
`sub_A78010`	`emitConstraint(state, desc_array, N)`	Finalize and emit the constraint with N entries
`sub_B612D0`	`emitInstrConstraint(state, opcode)`	Top-level entry: table lookup + switch + emit

The constraint descriptors are purely stack-allocated within sub_B612D0's approximately 0x160-byte frame. No heap allocation occurs during constraint emission.

Complete Identified Opcode Summary

The following table consolidates every opcode where the binary-to-PTX mapping has been confirmed or strongly inferred. This represents a partial inventory -- the total opcode space extends to at least 4940, and many opcodes in the gaps (particularly in the load/store, texture, surface, and extended intrinsic ranges) remain unidentified.

Opcode	Identity	Family	Evidence Source
0--~430	Generic LLVM `TargetOpcode`	LLVM standard	upstream LLVM 20.0.0
440--443	Type-preserving moves	Copy	register coalescer (`sub_3494EA0`)
444--503	Cross-class / wide / ABI copies	Copy	register coalescer (`sub_3494EA0`)
294--297	`atom.add` (f32/f64/i32/i64)	Atomic	DAG legalization (`sub_20BED60`)
302--305	`atom.min` (s32/s64/u32/u64)	Atomic	DAG legalization (`sub_20BED60`)
314--317	`atom.max` (s32/s64/u32/u64)	Atomic	DAG legalization (`sub_20BED60`)
315	`CallSeqBegin`	Call ABI	`LowerCall` (`sub_3040BF0`)
316	`CallSeqEnd_Outer`	Call ABI	`LowerCall`
462	`atom.cas`	Atomic	DAG legalization
499	`ConditionalBranch`	Control	intrinsic lowering
505	`DeclareParam`	Call ABI	`LowerCall`
506	`DeclareScalarParam`	Call ABI	`LowerCall`
507	`DeclareRetParam`	Call ABI	`LowerCall`
508	`DeclareRetScalarParam`	Call ABI	`LowerCall`
510	`CallDirect`	Call ABI	`LowerCall`
511	`CallDirectNoProto`	Call ABI	`LowerCall`
512	`CallIndirect`	Call ABI	`LowerCall`
513	`CallIndirectNoProto`	Call ABI	`LowerCall`
514	`CallStart`	Call ABI	`LowerCall`
515	`LoadRetParam`	Call ABI	`LowerCall`
516	`LoadRetParamLast`	Call ABI	`LowerCall`
517	`CallSeqEnd`	Call ABI	`LowerCall`
518	`CallProto`	Call ABI	`LowerCall`
521	`DeclareRetParam_Ext`	Call ABI	`LowerCall`
527	`StoreCalleeRetAddr`	Call ABI	`LowerCall`
528	`StoreRetValToParam`	Call ABI	`LowerCall`
568	`LoadV1`	Vector Param	`LowerCall`
569	`LoadV2`	Vector Param	`LowerCall`
570	`LoadV4`	Vector Param	`LowerCall`
571	`StoreV1`	Vector Param	`LowerCall`
572	`StoreV2`	Vector Param	`LowerCall`
573	`StoreV4`	Vector Param	`LowerCall`
4905--4940	`tcgen05.mma` (10 shape variants)	Tensor Core	Blackwell emission (`sub_21E8CD0`)

Gaps and Unknown Ranges

The following opcode ranges are known to contain NVPTX instructions but have not been fully mapped:

Range	Likely Contents	Evidence
430--439	Transition zone (generic-to-target boundary)	Adjacent to copy family
574--~800	Global/shared/local loads and stores	Large gap between param-store and first identified general opcode
800--~1500	Texture and surface instructions	`sub_33B0210` intrinsic switch references hundreds of tex/surf intrinsics
1500--~3000	Shuffle, vote, match, redux	Warp-level intrinsic families
3000--~4000	WGMMA, TMA, bulk operations	Hopper-era instruction families
4000--4904	Additional tensor/cluster instructions	Bridging pre-Blackwell and tcgen05

Recovering these ranges requires systematic analysis of the sub_33B0210 intrinsic lowering switch (343KB, the single largest function in the binary) and correlation with the AsmPrinter's printInstruction dispatch table.

Function Map

Function	Address	Size	Role
Constraint emission (179-case switch on `word_3F3E6C0`)	`sub_B612D0`	104KB	--
Register class set builder (111 cases)	`sub_B5BA00`	21KB	--
Operand type decoder (101 cases)	`sub_B6B200`	44KB	--
`createRegClassConstraint(state, regclass, flags)`	`sub_A778C0`	--	--
`createAnyRegConstraint(state, flags)`	`sub_A77AD0`	--	--
`composeConstraints(state, desc, N)`	`sub_A79C90`	--	--
`emitConstraint(state, desc_array, N)`	`sub_A78010`	--	--
Opcode-to-copy-type mapping (switch, families 440--503)	`sub_3494EA0`	12.7KB	--
Operand-type classification (reads `byte_444C4A0`)	`sub_34961A0`	26.6KB	--
Register-pair decomposition (wide/paired registers)	`sub_3497B40`	16.5KB	--
`NVPTXTargetLowering::LowerCall` (call ABI opcodes)	`sub_3040BF0`	88KB	--
Intrinsic lowering switch (NVVM intrinsic to opcode)	`sub_33B0210`	343KB	--
`NVPTXDAGToDAGISel::Select` (ISel entry)	`sub_3090F90`	91KB	--
MMA instruction builder (packed descriptor)	`sub_21E74C0`	17KB	--
Atomic operation PTX emission (base)	`sub_21E5E70`	--	--
L2 cache-hinted atomic PTX emission (SM 80+)	`sub_21E6420`	--	--
Memory barrier PTX emission	`sub_21E94F0`	--	--
Cluster barrier PTX emission (SM 90+)	`sub_21E8EA0`	--	--
Special register PTX emission	`sub_21E86B0`	--	--
Cluster special register PTX emission (SM 90+)	`sub_21E9060`	--	--
Address space conversion (cvta) PTX emission	`sub_21E7FE0`	--	--
tcgen05 Blackwell MMA emission (SM 100+)	`sub_21E8CD0`	--	--
Register class to encoded ID mapping	`sub_21583D0`	--	--
Register class to PTX type suffix	`sub_2163730`	--	--
Register class to PTX register prefix	`sub_21638D0`	--	--

Global Data References

Symbol	Address	Purpose
`word_3F3E6C0`	`0x3F3E6C0`	Constraint table (16-bit entries, indexed by opcode-1)
`byte_444C4A0`	`0x444C4A0`	MVT/operand type table (16-byte entries, indexed by MVT enum)
`word_4456340`	`0x4456340`	MVT to vector element count (16-bit entries)
`word_4456580`	`0x4456580`	MVT to scalarized MVT (16-bit entries)
`byte_3F252E0`	`0x3F252E0`	Constraint type classification table
`qword_502A920`	`0x502A920`	SM processor table (45 entries, stride-2)

Cross-References

Pattern Database -- detailed constraint descriptor layout and emission sub-functions
Register Coalescing -- the NVPTX-specific coalescer that processes copy family opcodes 440--503
Code Generation -- pipeline overview including ISel, RA, and machine-level passes
InstrEmitter -- how SDNodes become MachineInstrs with these opcodes
Register Allocation -- greedy RA that consumes constraint table data
AsmPrinter -- the PTX emission layer that converts these opcodes to text

CLI Flag Inventory

cicc v13.0 accepts approximately 111 unique flag keys across five parsing sites, expanding to ~142 flag+value combinations when counting value variants, and ~169 when including all architecture triplets. Flags are parsed in sub_8F9C90 (real main), sub_900130 (LibNVVM path A), sub_12CC750/sub_9624D0 (LibNVVM option processors), and sub_12C8DD0 (flag catalog builder with 65 registered configurations).

The flag system is architecturally split into two layers: a hardcoded dispatch layer in the top-level parsers (sub_8F9C90, sub_900130, sub_12CC750/sub_9624D0) that handles mode selection, pass-through, LTO, and structural flags via strcmp/prefix-match chains; and a BST-backed catalog layer (sub_12C8DD0 + sub_95EB40/sub_12C8B40) that handles all flags whose effect is purely "store a value and forward strings to output vectors."

The Four Output Vectors

Every flag ultimately routes its effects into one or more of four output std::vector<std::string> buffers. These vectors are the sole interface between the CLI parser and the downstream pipeline stages:

Vector	Seed	Output args	Downstream stage
`v324` (lnk)	`"lnk"`	`a5`/`a6`	Phase 1: Linker / IR-link (`sub_906xxx`)
`v327` (opt)	`"opt"`	`a7`/`a8`	Phase 2: Optimizer (LLVM opt / `sub_12E54A0`)
`v330` (lto)	(none)	`a9`/`a10`	Phase 3: LTO passes
`v333` (llc)	`"llc"`	`a11`/`a12`	Phase 4: LLC codegen

Each vector element is a 32-byte std::string with SSO. At function exit (lines ~1462-1553 of sub_9624D0), each vector is serialized: count = (end - begin) >> 5, then malloc(8 * count) for the char** array, with each string individually malloc(len+1) + memcpy + null-terminated.

The lto vector receives no seed string and is only populated by explicit LTO flags (-Xlto, -olto, -gen-lto, -link-lto, --device-c, --force-device-c, host-ref flags) and the architecture string.

Mode Selection

The top-level entry point sub_8F9C90 sets a mode variable v263 that selects the compilation pipeline:

Flag	Mode	Description
`-lgenfe`	1	EDG C++ frontend (legacy genfe path)
`-libnvvm`	2	LibNVVM API path
`-lnk`	3	Linker path (forces `keep=true`)
`-opt`	4	Optimizer-only path (forces `keep=true`)
`-llc`	6	LLC backend-only path

Within the LibNVVM option processors (sub_12CC750/sub_9624D0), the first argument is checked as a 4-byte or 8-byte integer for phase routing. Phase routing is stored at a1+240:

argv[0] hex	String	Phase ID	a1+240
`0x6B6E6C2D`	`-lnk`	1	1
`0x74706F2D`	`-opt`	2	2
`0x636C6C2D`	`-llc`	3	3
`0x63766E2D`	`-nvc`	3	3 (alias)
`0x6D76766E62696C2D`	`-libnvvm`	4	4

When phase routing is active (a1+240 != 0), sub_95C880(phase_id, argc, argv, &count, &mode_flags) returns the allocated argv array for that single phase, stored directly into the corresponding output pair. When a1+240 == 0, mode flags default to 7 (all phases), and the full multi-phase option parsing loop runs.

The BST-Backed Flag Catalog

Catalog construction: `sub_95EB40` / `sub_12C8DD0`

The function sub_95EB40(a1, cl_mode_flag) (standalone path) or sub_12C8DD0 (LibNVVM path) builds a std::map<std::string, OptionEntry> at a1+248. The underlying data structure is a C++ red-black tree (the standard library std::map implementation), with the tree root at a1+248, the sentinel/end node at a1+256, and the node count at a1+288.

Registration is performed by 65 calls to sub_95E8B0 + sub_95BF90 (standalone) or sub_12C8B40 (LibNVVM). Each call inserts one BST node.

BST node layout (168 bytes)

Each node in the red-black tree has the following layout:

Offset	Size	Content
+0	24	RB-tree metadata (color, parent, left, right pointers)
+32	32	Key: flag name string (`std::string` with SSO)
+64	32	lnk forwards: space-separated flags for lnk vector
+96	32	opt forwards: space-separated flags for opt vector
+128	32	llc forwards: space-separated flags for llc vector
+160	8	Value pointer: points to the offset in the options structure where the flag's current value is stored

BST lookup: `sub_95D600` / `sub_12C8530`

When the main parsing loop encounters a flag string, it calls sub_95D600 (standalone) or sub_12C8530 (LibNVVM) to perform a standard std::map::lower_bound-style traversal of the red-black tree. The lookup compares the input flag string against registered key strings at node offset +32 using strcmp semantics. On match, the node's three forwarding strings (lnk/opt/llc) are split on spaces and appended to their respective output vectors.

Duplicate detection

Each BST node's value pointer points into the options structure. If the value storage already has a non-zero sentinel (the QWORD immediately following the 32-byte STR32 slot), the flag was already set. On duplicate:

"libnvvm : error: <flag> defined more than once"

Flags NOT in the catalog

The following flag categories are handled by hardcoded strcmp/prefix-match chains in the main parsing loop BEFORE the catalog lookup, and therefore bypass the BST entirely:

Mode selection flags (-lnk, -opt, -llc, -nvc, -libnvvm)
-Ofast-compile=<level> (parsed at lines ~690-833)
Pass-through flags (-Xopt, -Xllc, -Xlnk, -Xlto)
LTO flags (-lto, -gen-lto, -gen-lto-and-llc, -link-lto, -olto, -gen-opt-lto, --trace-lto)
Device compilation flags (--device-c, --force-device-c, --partial-link)
Host reference flags (-host-ref-{ec,eg,ek,ic,ig,ik})
-maxreg=<N> (has its own duplicate-check logic at a1+1200)
-split-compile=<N>, -split-compile-extended=<N> (at a1+1480/a1+1488)
-opt-passes=<pipeline> (at a1+1512/a1+1520)
-discard-value-names=<0|1> (complex multi-phase interaction)
-time-passes (must be sole flag; unsupported in LibNVVM API path)
-cl-mode (sets v278=1, affects routing for -prec-div, -fast-math, -prec-sqrt)
-jump-table-density=<N> (forwarded directly to llc)
-jobserver (forwarded to opt)
--emit-optix-ir (disables ip-msp + licm, sets a13=0x43)
--nvvm-64, --nvvm-32 (handled in sub_95C230)

If none of the hardcoded checks match and the BST lookup also fails, the flag falls through to the catchall entry at options structure offset +1256, which triggers:

"libnvvm : error: <flag> is an unsupported option"

Complete Flag-to-Pipeline Vector Routing Table

The table below documents every flag's routing from user input to the four output vectors. "Store" indicates the options structure offset where the value is recorded. Flags marked with [BST] are registered in the catalog; flags marked with [HC] are hardcoded in the parsing loop.

Architecture Flags [BST]

All 24 architecture entries share options structure offset +552 and follow the same 3-column pattern:

User flag	lnk vector	opt vector	llc vector
`-arch=compute_75`	`-R __CUDA_ARCH=750`	`-opt-arch=sm_75`	`-mcpu=sm_75`
`-arch=compute_80`	`-R __CUDA_ARCH=800`	`-opt-arch=sm_80`	`-mcpu=sm_80`
`-arch=compute_86`	`-R __CUDA_ARCH=860`	`-opt-arch=sm_86`	`-mcpu=sm_86`
`-arch=compute_87`	`-R __CUDA_ARCH=870`	`-opt-arch=sm_87`	`-mcpu=sm_87`
`-arch=compute_88`	`-R __CUDA_ARCH=880`	`-opt-arch=sm_88`	`-mcpu=sm_88`
`-arch=compute_89`	`-R __CUDA_ARCH=890`	`-opt-arch=sm_89`	`-mcpu=sm_89`
`-arch=compute_90`	`-R __CUDA_ARCH=900`	`-opt-arch=sm_90`	`-mcpu=sm_90`
`-arch=compute_90a`	`-R __CUDA_ARCH=900`	`-opt-arch=sm_90a`	`-mcpu=sm_90a`
`-arch=compute_100`	`-R __CUDA_ARCH=1000`	`-opt-arch=sm_100`	`-mcpu=sm_100`
`-arch=compute_100a`	`-R __CUDA_ARCH=1000`	`-opt-arch=sm_100a`	`-mcpu=sm_100a`
`-arch=compute_100f`	`-R __CUDA_ARCH=1000`	`-opt-arch=sm_100f`	`-mcpu=sm_100f`
`-arch=compute_103`	`-R __CUDA_ARCH=1030`	`-opt-arch=sm_103`	`-mcpu=sm_103`
`-arch=compute_103a`	`-R __CUDA_ARCH=1030`	`-opt-arch=sm_103a`	`-mcpu=sm_103a`
`-arch=compute_103f`	`-R __CUDA_ARCH=1030`	`-opt-arch=sm_103f`	`-mcpu=sm_103f`
`-arch=compute_110`	`-R __CUDA_ARCH=1100`	`-opt-arch=sm_110`	`-mcpu=sm_110`
`-arch=compute_110a`	`-R __CUDA_ARCH=1100`	`-opt-arch=sm_110a`	`-mcpu=sm_110a`
`-arch=compute_110f`	`-R __CUDA_ARCH=1100`	`-opt-arch=sm_110f`	`-mcpu=sm_110f`
`-arch=compute_120`	`-R __CUDA_ARCH=1200`	`-opt-arch=sm_120`	`-mcpu=sm_120`
`-arch=compute_120a`	`-R __CUDA_ARCH=1200`	`-opt-arch=sm_120a`	`-mcpu=sm_120a`
`-arch=compute_120f`	`-R __CUDA_ARCH=1200`	`-opt-arch=sm_120f`	`-mcpu=sm_120f`
`-arch=compute_121`	`-R __CUDA_ARCH=1210`	`-opt-arch=sm_121`	`-mcpu=sm_121`
`-arch=compute_121a`	`-R __CUDA_ARCH=1210`	`-opt-arch=sm_121a`	`-mcpu=sm_121a`
`-arch=compute_121f`	`-R __CUDA_ARCH=1210`	`-opt-arch=sm_121f`	`-mcpu=sm_121f`

Note: the a and f sub-variants share the base SM number for __CUDA_ARCH (e.g., sm_100a and sm_100f both emit __CUDA_ARCH=1000) but get distinct -opt-arch= and -mcpu= strings. The architecture string is also stored into the lto vector via sub_95D700, preserving the full -arch=compute_XX string.

Architecture validation bitmask

Architecture is validated at a1+8 using bitmask 0x60081200F821:

offset = SM_number - 75
if (offset > 0x2E || !_bittest64(&0x60081200F821, offset))
    -> ERROR: "is an unsupported option"

Valid bit positions:

Bit	SM	Generation
0	75	Turing
5	80	Ampere
11	86	Ampere
12	87	Jetson Orin
13	88	Ada
14	89	Ada Lovelace
15	90	Hopper
25	100	Blackwell
28	103	Blackwell+
35	110	Post-Blackwell
45	120	Next-gen
46	121	Next-gen

Maximum offset: 0x2E = 46 (SM 121). All pre-Turing architectures (SM 70 and below) are rejected.

Architecture specification forms

Architecture can be specified in many forms, all converging to a numeric SM value. Trailing a or f suffixes are stripped before numeric parsing. On parse failure: "Unparseable architecture: <val>".

Form	Example	Source
`-arch <val>`	`-arch sm_90`	`sub_8F9C90`
`-arch<val>`	`-archsm_90`	`sub_8F9C90` (compact)
`--nv_arch <val>`	`--nv_arch sm_100a`	`sub_8F9C90`
`-mcpu=sm_<N>`	`-mcpu=sm_90`	LLVM-style
`-opt-arch=sm_<N>`	`-opt-arch=sm_90`	Optimizer
`-arch=compute_<N>`	`-arch=compute_100`	Compute capability
`__CUDA_ARCH=<N>`	`__CUDA_ARCH=900`	Raw define

Hex-encoded flag checks in sub_8F9C90:

0x6D733D7570636D2D = -mcpu=sm
0x6372612D74706F2D = -opt-arc
0x6F633D686372612D = -arch=co
0x6372615F766E2D2D = --nv_arc

Optimization Level Flags

User flag	Type	Store	lnk	opt	llc	Default
`-opt=0`	[BST]	+392	--	--	--
`-opt=1`	[BST]	+392	--	--	--
`-opt=2`	[BST]	+392	--	--	--
`-opt=3`	[BST]	+392	--	--	--	default
`-Osize`	[BST]	+488	--	`-Osize`	`-Osize`	off
`-Om`	[BST]	+520	--	`-Om`	`-Om`	off
`-disable-allopts`	[BST]	+424	`-lnk-disable-allopts`	`-opt-disable-allopts`	`-llc-disable-allopts`	off
`-disable-llc-opts`	[BST]	+840	--	--	--	off

The -opt=<N> flags do not directly emit to any vector at registration time. Instead, at the routing stage (lines 1444-1563 of sub_9624D0), the optimization level drives one of three code paths:

Custom pipeline set (a1+1520 != 0): emits -passes=<pipeline_string> to opt vector
Normal mode (a1+1520 == 0, a1+1640 == 0): emits -O<level> to opt vector
Fast-compile mode (a1+1640 != 0): emits -optO<level> + -llcO2 to llc vector

Floating Point Control Flags

User flag	Type	Store	lnk	opt	llc	Default
`-ftz=0`	[BST]	+584	--	--	--	default
`-ftz=1`	[BST]	+584	`-R __CUDA_FTZ=1`	`-nvptx-f32ftz`	`-nvptx-f32ftz`
`-prec-sqrt=0`	[BST]	+616	--	--	`-nvptx-prec-sqrtf32=0`	CL default
`-prec-sqrt=1`	[BST]	+616	`-R __CUDA_PREC_SQRT=1`	--	`-nvptx-prec-sqrtf32=1`	CUDA default
`-prec-div=0` (CL)	[BST]	+648	--	`-opt-use-prec-div=false`	`-nvptx-prec-divf32=0`
`-prec-div=0` (CUDA)	[BST]	+648	--	`-opt-use-prec-div=false`	`-nvptx-prec-divf32=1`
`-prec-div=1` (CL)	[BST]	+648	--	`-opt-use-prec-div=true`	`-nvptx-prec-divf32=1`
`-prec-div=1` (CUDA)	[BST]	+648	`-R __CUDA_PREC_DIV=1`	`-opt-use-prec-div=true`	`-nvptx-prec-divf32=2`	default
`-prec-div=2`	[BST]	+648	--	--	`-nvptx-prec-divf32=3`
`-fma=0`	[BST]	+680	--	--	`-nvptx-fma-level=0`
`-fma=1`	[BST]	+680	--	--	`-nvptx-fma-level=1`	default
`-enable-mad`	[BST]	+712	--	--	`-nvptx-fma-level=1`	off
`-opt-fdiv=0`	[BST]	+456	--	`-opt-fdiv=0`	--	default
`-opt-fdiv=1`	[BST]	+456	--	`-opt-fdiv=1`	--
`-no-signed-zeros`	[BST]	+1160	--	`-opt-no-signed-zeros`	--	off

Note on -prec-div: the CUDA vs CL distinction is controlled by the magic cookie a4 (0xABBA = CUDA, 0xDEED = OpenCL). CUDA -prec-div=1 maps to -nvptx-prec-divf32=2 (IEEE-correct division), while CL maps to level 1 (software approximation). When -prec-div=0 is set under CUDA, it still maps to -nvptx-prec-divf32=1 (not 0), because CUDA never drops below software approximation.

Fast Math Aggregate Flags

User flag	Type	Store	lnk	opt	llc
`-unsafe-math`	[BST]	+744	`-R FAST_RELAXED_MATH=1` `-R __CUDA_FTZ=1`	`-opt-use-fast-math` `-nvptx-f32ftz`	`-nvptx-fma-level=1` `-nvptx-f32ftz`
`-fast-math` (CL)	[BST]	+776	`-R FAST_RELAXED_MATH=1` `-R __CUDA_FTZ=1`	`-opt-use-fast-math` `-nvptx-f32ftz`	`-nvptx-f32ftz`
`-fast-math` (CUDA)	[BST]	+776	`-R __CUDA_USE_FAST_MATH=1`	`-opt-use-fast-math`	--

-unsafe-math always sets FTZ in the backend (-nvptx-f32ftz), while CUDA -fast-math does not touch the backend FTZ flag -- it only sets the preprocessor define and the optimizer flag.

Debug and Diagnostic Flags

User flag	Type	Store	lnk	opt	llc	Default
`-g`	[BST]	+296	`-debug-compile`	`-debug-compile`	--	off
`-generate-line-info`	[BST]	+328	--	`-generate-line-info`	--	off
`-no-lineinfo-inlined-at`	[BST]	+360	--	--	`-line-info-inlined-at=0`	off
`-show-src`	[BST]	+808	--	--	`-nvptx-emit-src`	off
`-enable-verbose-asm`	[BST]	+1224	--	--	`-asm-verbose`	off
`-w`	[BST]	+872	--	`-w`	`-w`	off
`-Werror`	[BST]	+904	--	`-Werror`	`-Werror`	off
`-debug-compile`	[BST]	+296	--	`-debug-compile`	--	off
`-line-info-inlined-at=0`	alias	--	--	--	`-line-info-inlined-at=0`	off
`-inline-info`	[HC]	--	--	`-pass-remarks=inline` `-pass-remarks-missed=inline` `-pass-remarks-analysis=inline`	--	off

Inlining and Function Flags

User flag	Type	Store	lnk	opt	llc	Default
`-disable-inlining`	[BST]	+1064	--	`-disable-inlining`	--	off
`-aggressive-inline`	[BST]	+1608	--	`-inline-budget=40000`	--	off
`-restrict`	[BST]	+1096	--	--	`-nvptx-kernel-params-restrict`	off
`-allow-restrict-in-struct`	[BST]	+1128	--	`-allow-restrict-in-struct`	`-allow-restrict-in-struct`	off
`-enable-opt-byval`	[BST]	+1032	--	`-enable-opt-byval`	--	off

Optimization Control Flags

User flag	Type	Store	lnk	opt	llc	Default
`-opt-disable-allopts`	derived	--	--	`-opt-disable-allopts`	--	off
`-lnk-disable-allopts`	derived	--	`-lnk-disable-allopts`	--	--	off
`-llc-disable-allopts`	derived	--	--	--	`-llc-disable-allopts`	off

These three are emitted by -disable-allopts (see above); they do not exist as independent user flags.

Rematerialization Flags

User flag	Type	Store	lnk	opt	llc
`-vasp-fix`	[BST]	+1352	--	--	`-vasp-fix1=true -vasp-fix2=true`
`-new-nvvm-remat`	[BST]	+1384	--	--	`-enable-new-nvvm-remat=true -nv-disable-remat=true -rp-aware-mcse=true`
`-disable-new-nvvm-remat`	[BST]	+1416	--	--	`-enable-new-nvvm-remat=false -nv-disable-remat=false -rp-aware-mcse=false`
`-disable-nvvm-remat`	[BST]	+1448	--	--	`-enable-new-nvvm-remat=false -nv-disable-remat=true -rp-aware-mcse=false`

These are multi-flag compound emissions. Note the subtle difference: -disable-nvvm-remat sets -nv-disable-remat=true (disables classic remat) but -enable-new-nvvm-remat=false (also disables new remat), while -disable-new-nvvm-remat disables both new remat AND classic remat AND register-pressure-aware MCSE.

Analysis and Transform Control Flags

User flag	Type	Store	lnk	opt	llc
`-no-aggressive-positive-stride-analysis`	[BST]	+1544	--	`-aggressive-positive-stride-analysis=false`	--
`disable-load-select-transform`	[BST]	+1576	--	`-disable-load-select-transform=true`	--

Note: disable-load-select-transform is registered WITHOUT a leading - in the catalog.

Pass-Through (Forwarding) Flags [HC]

Flag	Target vector	Special handling
`-Xopt <arg>`	opt	If `<arg>` starts with `-opt-discard-value-names=`, extracts value; if `"1"`, sets `v276=false`
`-Xllc <arg>`	llc	None
`-Xlnk <arg>`	lnk	If `<arg>` starts with `-lnk-discard-value-names=`, extracts value; if `"1"`, sets `v275=false`
`-Xlto <arg>`	lto	If `<arg>` starts with `-lto-discard-value-names=`, extracts value; if `"1"`, sets `v282=false`

Each consumes the next argument from argv.

LTO Flags [HC]

User flag	a13 bitmask effect	lto vector	Notes
`-lto`	`(a13 & 0x300) \| 0x23`	--	Full LTO mode
`-gen-lto`	`(a13 & 0x300) \| 0x21`	`-gen-lto`	Emit LTO bitcode
`-gen-lto-and-llc`	`a13 \|= 0x20`	`-gen-lto`	Emit LTO + run LLC
`-link-lto`	`(a13 & 0x300) \| 0x26`	`-link-lto`	Link LTO modules
`-olto`	--	`-olto` + argv[i+1]	Takes next arg as LTO opt level
`-gen-opt-lto`	sets `v280=1`	--	Affects lowering at end of parsing
`--trace-lto`	--	`--trace`	LTO tracing

Device Compilation Flags [HC]

User flag	lto vector
`--device-c`	`--device-c`
`--force-device-c`	`--force-device-c`
`--partial-link`	(no-op, consumed but not forwarded)

Host Reference Flags [HC]

User flag	lto vector
`-host-ref-ek=<val>`	`-host-ref-ek=<val>`
`-host-ref-ik=<val>`	`-host-ref-ik=<val>`
`-host-ref-ec=<val>`	`-host-ref-ec=<val>`
`-host-ref-ic=<val>`	`-host-ref-ic=<val>`
`-host-ref-eg=<val>`	`-host-ref-eg=<val>`
`-host-ref-ig=<val>`	`-host-ref-ig=<val>`
`-has-global-host-info`	`-has-global-host-info`

Pipeline Control Flags [HC]

User flag	Store	Routing	Default
`-opt-passes=<pipeline>`	+1512	opt: `-passes=<pipeline>` (overrides `-O<N>`)	unset
`-passes=<pipeline>`	--	opt: `-passes=<pipeline>` (`sub_9624D0` only)	unset
`-lsa-opt=0`	--	opt: `-lsa-opt=0`	generated by `-Ofast-compile=max` or CL-mode
`-memory-space-opt=0`	--	opt: `-memory-space-opt=0`	generated by `-Ofast-compile=max`
`-memory-space-opt=1`	--	opt: `-memory-space-opt=1`	generated when opt level allows
`-rox-opt=0`	--	opt: `-rox-opt=0`	generated when `-prec-div=0` or `-prec-sqrt=0` (non-CL)
`-do-ip-msp=<0\|1>`	--	opt: `-do-ip-msp=<val>`
`-do-licm=<0\|1>`	--	opt: `-do-licm=<val>`
`-optimize-unused-variables`	--	lto: `-optimize-unused-variables`	off

Ofast-compile Levels [HC]

Stored at a1+1640. Only ONE -Ofast-compile= is allowed; a second triggers "libnvvm : error: -Ofast-compile specified more than once".

Level string	a1+1640	Description	Side effects
`"0"`	1 (then reset to 0)	Disabled	opt: fast-compile=off string
`"min"`	4	Minimal speedup	opt: `-fast-compile=min`
`"mid"`	3	Medium speedup	opt: `-fast-compile=mid` + second flag
`"max"`	2	Maximum speedup	opt: `-fast-compile=max`; forces `-lsa-opt=0`, `-memory-space-opt=0`

When -Ofast-compile is active (level >= 1), the -passes=/-O routing is bypassed. Instead: -optO<level> and -llcO2 are emitted to the llc vector (lines 1453-1460).

Miscellaneous Flags [HC]

User flag	Store	Routing	Notes
`-maxreg=<N>`	+1192	opt: `-maxreg=<N>`, llc: `-maxreg=<N>`	Error on duplicate
`-split-compile=<N>`	+1480	opt: `-split-compile=<N>`	Error on duplicate
`-split-compile-extended=<N>`	+1480	opt: `-split-compile-extended=<N>`, sets `a1+1644=1`	Same storage as -split-compile
`-jump-table-density=<N>`	--	llc: `-jump-table-density=<N>`
`-jobserver`	--	opt: `-jobserver`
`-cl-mode`	--	No forwarding; sets `v278=1`	Affects `-prec-div`, `-prec-sqrt`, `-fast-math` routing
`-time-passes`	--	Unsupported in LibNVVM API (error if `a14 != NULL`)	Must be sole flag
`--emit-optix-ir`	--	opt: `-do-ip-msp=0`, opt: `-do-licm=0`; `a13 = (a13 & 0x300) \| 0x43`
`--nvvm-64`	--	`a13 \|= 0x100`	64-bit NVVM mode
`--nvvm-32`	--	`a13 \|= 0x200`	32-bit NVVM mode

Discard-Value-Names [HC]

This flag has the most complex interaction logic in the parser. Seven boolean tracking variables control its behavior:

Variable	Meaning
`v275`	lnk-discard-value-names override (from `-Xlnk`)
`v276`	opt-discard-value-names override (from `-Xopt`)
`v277`	global discard-value-names flag was used
`v278`	CL-mode detected
`v279`	`-Xlnk` was used for discard-value-names
`v281`	`-Xlto` was used for discard-value-names
`v282`	lto-discard-value-names override (from `-Xlto`)
`v283`	`-Xopt` was used for discard-value-names

When a4 == 0xABBA (CUDA) and no explicit -discard-value-names:

Default: discard (a1+232 = 1)
Emits: -lnk-discard-value-names=1 to lnk, -opt-discard-value-names=1 to opt, -lto-discard-value-names=1 to lto
UNLESS overridden by per-phase -X flags

When a4 == 0xDEED (OpenCL): only applies if (a13 & 0x20) is set.

Error on conflicting definitions: "libnvvm : error: -discard-value-names defined more than once, or defined for both libnvvm and sub-phase".

I/O and General Flags

Flag	Effect
`-o <file>`	Output file (fatal if missing)
`-v`	Verbose mode
`-dryrun`	Do not execute compilation
`-keep`	Keep intermediate files
`-irversion`	Print IR version and exit
`-nvvmir-library <f>`	NVVM IR library file (also `=` form)
`-m64`	64-bit mode flag (sets `*a8 = 1`)

Recognized input extensions: .bc, .ci, .i, .cup, .optixir, .ii. The .cup extension triggers --orig_src_path_name / --orig_src_file_name handling.

Options Structure Layout

The options structure passed as a1 to sub_9624D0/sub_12CC750 is ~1,644 bytes. Key offsets:

Offset	Size	Content	Default
+8	DWORD	SM architecture number	75
+232	BYTE	discard-value-names master (0=keep, 1=discard)	0
+240	DWORD	Phase routing mode (0=full, 1-4=single)	0
+248	PTR	BST root (`std::map` red-black tree)
+256	PTR	BST sentinel/end node
+288	QWORD	BST node count
+296	STR32	`-g` / `-debug-compile` value
+328	STR32	`-generate-line-info` value
+360	STR32	`-no-lineinfo-inlined-at` value
+392	STR32	Optimization level (0/1/2/3)	`"3"`
+400	QWORD	opt-level already-set sentinel
+424	STR32	`-disable-allopts` value
+456	STR32	`-opt-fdiv` value	`"0"`
+464	QWORD	opt-fdiv already-set sentinel
+488	STR32	`-Osize` value
+520	STR32	`-Om` value
+552	STR32	Architecture defines	`compute_75`
+560	QWORD	arch already-set sentinel
+584	STR32	`-ftz` value	`"0"`
+592	QWORD	ftz already-set sentinel
+616	STR32	`-prec-sqrt` value	`"1"` (CUDA) / `"0"` (CL)
+624	QWORD	prec-sqrt already-set sentinel
+648	STR32	`-prec-div` value	`"1"`
+656	QWORD	prec-div already-set sentinel
+680	STR32	`-fma` value	`"1"`
+688	QWORD	fma already-set sentinel
+712	STR32	`-enable-mad` value
+744	STR32	`-unsafe-math` value
+776	STR32	`-fast-math` value
+808	STR32	`-show-src` value
+840	STR32	`-disable-llc-opts` value
+872	STR32	`-w` value
+904	STR32	`-Werror` value
+1032	STR32	`-enable-opt-byval` value
+1064	STR32	`-disable-inlining` value
+1096	STR32	`-restrict` value
+1128	STR32	`-allow-restrict-in-struct` value
+1160	STR32	`-no-signed-zeros` value
+1192	STR32	`-maxreg` value string
+1200	QWORD	maxreg already-set sentinel
+1224	STR32	`-enable-verbose-asm` value
+1256	STR32	Catchall (unrecognized flag)
+1352	STR32	`-vasp-fix` value
+1384	STR32	`-new-nvvm-remat` value
+1416	STR32	`-disable-new-nvvm-remat` value
+1448	STR32	`-disable-nvvm-remat` value
+1480	STR32	`-split-compile` value
+1488	QWORD	split-compile already-set sentinel
+1512	STR32	`-opt-passes` pipeline string
+1520	QWORD	opt-passes already-set sentinel
+1544	STR32	`-no-aggressive-positive-stride-analysis`
+1576	STR32	`disable-load-select-transform`
+1608	STR32	`-aggressive-inline` value
+1640	DWORD	Ofast-compile level (0-4)	0
+1644	BYTE	split-compile-extended flag	0

Each STR32 is a 32-byte std::string with SSO (small string optimization). The QWORD "already-set sentinel" fields serve as duplicate-detection guards.

Compilation Mode Bitmask (a13)

The a13 parameter is an in/out bitmask that controls which pipeline phases execute and what LTO mode is active:

Bit/Mask	Meaning
`0x07`	Phase control (default = 7 = all phases)
`0x10`	Debug compile or line-info enabled
`0x20`	LTO generation enabled
`0x21`	gen-lto mode
`0x23`	Full LTO mode
`0x26`	link-lto mode
`0x43`	emit-optix-ir mode
`0x80`	gen-opt-lto lowering flag
`0x100`	`--nvvm-64` (64-bit mode)
`0x200`	`--nvvm-32` (32-bit mode)
`0x300`	Mask for 64/32-bit mode bits

Value	Meaning	Effects
`0xABBA` (43962)	CUDA compilation	`-prec-div` routing uses CUDA levels; `-fast-math` uses CUDA defines; discard-value-names defaults to on
`0xDEED` (57069)	OpenCL compilation	`-prec-sqrt` defaults to 0; `-fast-math`/`-prec-div` use CL routing; `-cl-mode` scanning active

Default Values When Flags Are Absent

When a registered flag is not found in the user's arguments, sub_9624D0 checks whether the stored-value sentinel is zero and applies defaults:

Flag	Sentinel	Default applied
`-opt=`	`a1+400 == 0`	`-opt=3` (optimization level 3)
`-arch=compute_`	`a1+560 == 0`	`-arch=compute_75` (SM 75 Turing)
`-ftz=`	`a1+592 == 0`	`-ftz=0` (no flush-to-zero)
`-prec-sqrt=`	`a1+624 == 0`	`-prec-sqrt=1` (CUDA) or `-prec-sqrt=0` (CL)
`-prec-div=`	`a1+656 == 0`	`-prec-div=1` (precise division)
`-fma=`	`a1+688 == 0`	`-fma=1` (FMA enabled)
`-opt-fdiv=`	`a1+464 == 0`	`-opt-fdiv=0`

Differences Between sub_12CC750 and sub_9624D0

The two option processors are near-identical. Key differences:

Aspect	`sub_12CC750`	`sub_9624D0`
Binary size	87KB decompiled	75KB decompiled
`-memory-space-opt` default	`0`	`1`
`-passes=` flag	absent	present
`-disable-struct-lowering`	present	absent
`-prec-sqrt` CL default	`0`	`1`
Pipeline	LibNVVM entry path	Standalone/generic path
Companion builder	`sub_12C8DD0`	`sub_95EB40`
BST lookup	`sub_12C8530`	`sub_95D600`

Error Handling

All error strings follow the pattern "libnvvm : error: <message>":

Error	Trigger
`<flag> is an unsupported option`	Flag not matched by hardcoded checks or BST lookup
`<flag> defined more than once`	Duplicate `-maxreg`, or duplicate BST-registered flag
`-arch=compute_<N> is an unsupported option`	Architecture fails bitmask validation
`-Ofast-compile specified more than once`	Second `-Ofast-compile=` encountered
`-Ofast-compile called with unsupported level, only supports 0, min, mid, or max`	Invalid level string
`split compilation defined more than once`	Duplicate `-split-compile` or `-split-compile-extended`
`-discard-value-names defined more than once, or defined for both libnvvm and sub-phase`	Conflicting discard-value-names
`<value> is an unsupported value for option: <flag>`	From `sub_95C230` extended parser

Function Address Map

Address	Function	Role
`0x8F9C90`	`sub_8F9C90`	Real main entry point (argc/argv from OS)
`0x900130`	`sub_900130`	LibNVVM Path A CLI parser
`0x9624D0`	`sub_9624D0`	LibNVVM option processor (standalone variant)
`0x9685E0`	`sub_9685E0`	Pipeline orchestrator (wraps `sub_9624D0`)
`0x967070`	`sub_967070`	Post-option-parse pipeline setup
`0x95EB40`	`sub_95EB40`	BST option map builder (standalone)
`0x95E8B0`	`sub_95E8B0`	Flag template registration (standalone)
`0x95D600`	`sub_95D600`	BST option map lookup (standalone)
`0x95CB50`	`sub_95CB50`	Prefix-match string comparison
`0x95CA80`	`sub_95CA80`	Value extraction after `=`
`0x95C880`	`sub_95C880`	Single-phase delegator
`0x95C230`	`sub_95C230`	Extended flag parser (`--nvvm-64`/`--nvvm-32`)
`0x95BF90`	`sub_95BF90`	BST node insertion helper
`0x95BC80`	`sub_95BC80`	String storage into options struct
`0x12CC750`	`sub_12CC750`	LibNVVM option processor (LibNVVM variant)
`0x12C8DD0`	`sub_12C8DD0`	BST option map builder (LibNVVM, 65 entries)
`0x12C8B40`	`sub_12C8B40`	Individual flag registration (LibNVVM)
`0x12C8530`	`sub_12C8530`	BST option map lookup (LibNVVM)
`0x12C7B30`	`sub_12C7B30`	Pass name registration into pipeline ordering
`0x12C6E90`	`sub_12C6E90`	Sub-argument splitter for mode flags
`0x12C6910`	`sub_12C6910`	Flag filter (`-debug-compile`, `-g`, `-generate-line-info`)
`0x8FD0D0`	`sub_8FD0D0`	Key-value parser (used by `sub_900130`)
`0x8FD6D0`	`sub_8FD6D0`	String concatenation builder

Cross-References

Optimization Levels -- O-level pipeline builders and fast-compile tiers
Configuration Knobs -- 1,496 cl::opt knobs set by the flags documented here
NVVMPassOptions -- 222-slot struct that receives CLI-routed values
Environment Variables -- environment-based configuration (parallel to CLI)
Pipeline Overview -- how the four output vectors feed into pipeline stages
nvcc Interface -- how nvcc constructs the argv passed to cicc
Architecture Targets -- SM feature gating driven by -arch=compute_<N>

Optimization Levels

cicc v13.0 supports four standard optimization levels (O0 through O3) and three fast-compile tiers (Ofcmin, Ofcmid, Ofcmax). These are mutually exclusive with the custom --passes= interface. The pipeline name is selected in the new-PM driver sub_226C400 and assembled by sub_12E54A0. The full optimization pipeline builder is sub_12DE330, with tier-specific insertion handled by sub_12DE8F0.

Pipeline Name Selection

The new-PM driver at sub_226C400 selects a pipeline name string based on boolean flags in the config struct:

Config Offset	Flag	Pipeline Name
byte[888]	O0	`nvopt<O0>`
byte[928]	O1	`nvopt<O1>`
byte[968]	O2	`nvopt<O2>`
byte[1008]	O3	`nvopt<O3>`
qw[131..132]	fc="max"	`nvopt<Ofcmax>`
qw[131..132]	fc="mid"	`nvopt<Ofcmid>`
qw[131..132]	fc="min"	`nvopt<Ofcmin>`

Selection logic in sub_226C400 (lines 828--874):

if (O1_flag)       -> "nvopt<O1>"
else if (O2_flag)  -> "nvopt<O2>"
else if (O3_flag)  -> "nvopt<O3>"
else if (fc_len == 3) {
  if (fc == "max") -> "nvopt<Ofcmax>"
  if (fc == "mid") -> "nvopt<Ofcmid>"
  if (fc == "min") -> "nvopt<Ofcmin>"
}
else               -> "nvopt<O0>"

Combining -O# with --passes= is an error:

"Cannot specify -O#/-Ofast-compile=<min,mid,max> and --passes=/--foo-pass, use -passes='default<O#>,other-pass'"

The pipeline name is passed to sub_2277440 (new-PM text parser), which constructs the actual PassManager. The nvopt prefix is registered as a pipeline element in sub_225D540 (new PM) and sub_12C35D0 (legacy PM), with vtables at 0x4A08350 / 0x49E6A58.

Fast-Compile Level Encoding

The fast-compile level is stored as an integer at offset 1640 (or 1648 in the clone) of the compilation context:

Value	CLI Source	Behavior
0	(no flag, or `-Ofast-compile=0`)	Normal O-level pipeline
1	`-Ofast-compile=0`	Forwarded then reset to 0
2	`-Ofast-compile=max` / `-Ofc=max`	Minimal pipeline, fastest compile
3	`-Ofast-compile=mid` / `-Ofc=mid`	Medium pipeline
4	`-Ofast-compile=min` / `-Ofc=min`	Close to full optimization

Any other value produces: "libnvvm : error: -Ofast-compile called with unsupported level".

When level=1, the flag is forwarded to the optimizer phase as a pass argument and then the level is reset to 0 at offset 1640 (so it becomes normal O-level optimization). When level=2 (max), the optimizer arg string -Ofast-compile=max is appended. When level=3 (mid), -Ofast-compile=mid is appended. When level=4 (min), -Ofast-compile=min is appended.

Tier Summary

Pipeline	Approx Passes	LSA-Opt	MemSpaceOpt	Compile Speed
`nvopt<O0>`	5--8	off	off	Fastest (no opt)
`nvopt<Ofcmax>`	12--15	forced 0	forced 0	Fast
`nvopt<Ofcmid>`	25--30	normal	enabled	Medium
`nvopt<Ofcmin>`	30--35	normal	enabled	Slower
`nvopt<O1>`	~40 + tier-1	normal	enabled	Normal
`nvopt<O2>`	~40 + tier-1/2	normal	enabled	Normal
`nvopt<O3>`	~40 + tier-1/2/3	normal	enabled	Slowest

Pipeline Architecture: Tier 0 + Tiers 1/2/3

O1/O2/O3 share a common pipeline construction path. The key insight is that optimization happens in layers:

Tier 0 (sub_12DE330): The full base pipeline of ~40 passes. Fires for ALL of O1, O2, and O3 when opts[4224] (optimization-enabled) is set.
Tier 1 (sub_12DE8F0(PM, 1, opts)): Additional passes gated by opts[3528]. Fires for O1, O2, and O3.
Tier 2 (sub_12DE8F0(PM, 2, opts)): Additional passes gated by opts[3568]. Fires for O2 and O3 only.
Tier 3 (sub_12DE8F0(PM, 3, opts)): Additional passes gated by opts[3608]. Fires for O3 only.

The tier control fields in the NVVMPassOptions struct at 4512 bytes:

Offset	Type	Meaning
3528	bool	Tier 1 enable (O1+)
3532	int	Tier 1 phase threshold
3568	bool	Tier 2 enable (O2+)
3572	int	Tier 2 phase threshold
3608	bool	Tier 3 enable (O3+)
3612	int	Tier 3 phase threshold
4224	bool	Tier 0 enable (any O-level)
4228	int	Tier 0 phase threshold

The assembler loop in sub_12E54A0 (lines 481--553) iterates over the plugin/external pass list at opts[4488]. Each entry has a phase_id; when the phase_id exceeds a tier's threshold, that tier fires:

for each entry in opts[4488..4496]:
  phase_id = entry[8..12]
  if (opts[4224] && phase_id > opts[4228]):
    sub_12DE330(PM, opts)   // Tier 0
    opts[4224] = 0          // one-shot
  if (opts[3528] && phase_id > opts[3532]):
    sub_12DE8F0(PM, 1, opts) // Tier 1
    opts[3528] = 0
  if (opts[3568] && phase_id > opts[3572]):
    sub_12DE8F0(PM, 2, opts) // Tier 2
    opts[3568] = 0
  if (opts[3608] && phase_id > opts[3612]):
    sub_12DE8F0(PM, 3, opts) // Tier 3
    opts[3608] = 0
  AddPass(PM, entry->createPass())

After the loop, any remaining unfired tiers fire unconditionally.

Tier 0: Full Base Pipeline (sub_12DE330)

sub_12DE330 at 0x12DE330 is called for all O1/O2/O3 compilations. It constructs the ~40-pass base pipeline:

#	Factory	Pass	Guard	Notes
1	`sub_1654860(1)`	VerifierPass	always
2	`sub_1A62BF0(1,0,0,1,0,0,1)`	CGSCC/Inliner	always	Pipeline EP 1, 1 iteration
3	`sub_1B26330()`	NVVMReflect	always
4	`sub_185D600()`	SROA	always
5	`sub_1C6E800()`	NVVMLowerArgs	always
6	`sub_1C6E560()`	NVVMLowerAlloca	always
7	`sub_1857160()`	SimplifyCFG	always
8	`sub_1842BC0()`	InstCombine	always
9	`sub_17060B0(1,0)`	GVN	`opts[3160]`	Debug-dump enabled
10	`sub_12D4560()`	NVVMVerify	always
11	`sub_18A3090()`	LoopRotate	always
12	`sub_184CD60()`	LICM	always
13	`sub_1869C50(1,0,1)`	IndVarSimplify	`!opts[1040]`
14	`sub_1833EB0(3)`	LoopUnroll	always	Factor = 3
15	`sub_17060B0(1,0)`	GVN	always
16	`sub_1952F90(-1)`	LoopIndexSplit/SCCP	always	Threshold = -1 (unlimited)
17	`sub_1A62BF0(1,0,0,1,0,0,1)`	CGSCC/Inliner	always
18	`sub_1A223D0()`	DSE	always
19	`sub_17060B0(1,0)`	GVN	always
20	`sub_1A7A9F0()`	MemCpyOpt	always
21	`sub_1A62BF0(1,0,0,1,0,0,1)`	CGSCC/Inliner	always
22	`sub_1A02540()`	ADCE	always
23	`sub_198DF00(-1)`	JumpThreading/CVP	always	Threshold = -1
24	`sub_1C76260()`	NVVMDivergenceLowering	`!opts[1320]`
25	`sub_195E880(0)`	Reassociate	`opts[2880]`	Default on (slot 143)
26	`sub_19C1680(0,1)`	SpeculativeExecution	`!opts[1360]`
27	`sub_17060B0(1,0)`	GVN	`opts[3160]`	Debug-dump enabled
28	`sub_19401A0()`	SCCP	always
29	`sub_1968390()`	GlobalDCE/ConstantProp	always
30	`sub_196A2B0()`	GlobalOpt	always
31	`sub_19B73C0(2,-1,-1,-1,-1,-1,-1)`	LoopVectorize/SLP	always	Width=2, thresholds=-1
32	`sub_17060B0(1,0)`	GVN	always
33	`sub_190BB10(0,0)`	EarlyCSE	always
34	`sub_1A13320()`	TailCallElim	always
35	`sub_17060B0(1,1)`	GVN (verified)	`opts[3160]`	Verify mode
36	`sub_18F5480()`	NewGVN	always
37	`sub_18DEFF0()`	Sink	always
38	`sub_1A62BF0(1,0,0,1,0,0,1)`	CGSCC/Inliner	always
39	`sub_18B1DE0()`	Sinking2	always	NVIDIA custom
40	`sub_1841180()`	LoopSimplify/LCSSA	always

After sub_12DE330 returns, opts[4224] is cleared (one-shot).

Tiers 1/2/3: Phase-Specific Sub-Pipeline (sub_12DE8F0)

sub_12DE8F0 at 0x12DE8F0 is a single function called with tier in {1, 2, 3}. The tier value is stored into qword_4FBB410 (phase tracker). When tier==3 and qword_4FBB370 byte4 is 0, the feature flags are set to 6 (enabling advanced barrier opt + memory space opt gates).

The following table lists every pass in sub_12DE8F0 with its tier-dependent guard condition. A pass runs only when ALL conditions in its Guard column are satisfied.

#	Factory	Pass	Guard	O1	O2	O3
1	`sub_1CB4E40(1)`	NVVMIntrinsicLowering	`!opts[2000]`	Y	Y	Y
2	`sub_1A223D0()`	NVVMIRVerification	`!opts[2600]`	Y	Y	Y
3	`sub_1CB4E40(1)`	NVVMIntrinsicLowering	`!opts[2000]`	Y	Y	Y
4	`sub_18E4A00()`	NVVMBarrierAnalysis	`opts[3488]`	Y	Y	Y
5	`sub_1C98160(0)`	NVVMLowerBarriers	`opts[3488]`	Y	Y	Y
6	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160] && !opts[1080]`	Y	Y	Y
7	`sub_12D4560()`	NVVMVerifier	`!opts[600]`	Y	Y	Y
8	`sub_185D600()`	IPConstPropagation	`opts[3200] && !opts[920]`	Y	Y	Y
9	`sub_1857160()`	NVVMReflect	`opts[3200] && !opts[880]`	Y	Y	Y
10	`sub_18A3430()`	NVVMPredicateOpt	`opts[3200] && !opts[1120]`	Y	Y	Y
11	`sub_1842BC0()`	SCCP	`opts[3200] && !opts[720]`	Y	Y	Y
12	`sub_17060B0(1,0)`	PrintModulePass	`!opts[1080]`	Y	Y	Y
13	`sub_12D4560()`	NVVMVerifier	`!opts[600]`	Y	Y	Y
14	`sub_18A3090()`	NVVMPredicateOpt variant	`opts[3200] && !opts[2160]`	Y	Y	Y
15	`sub_184CD60()`	ConstantMerge	`opts[3200] && !opts[1960]`	Y	Y	Y
16	`sub_190BB10(1,0)`	SimplifyCFG	`tier!=1` `&& !opts[1040] && !opts[1200]`	-	Y	Y
17	`sub_1952F90(-1)`	LoopIndexSplit	(same as #16) `&& !opts[1160]`	-	Y	Y
18	`sub_12D4560()`	NVVMVerifier	(same as #16) `&& !opts[600]`	-	Y	Y
19	`sub_17060B0(1,0)`	PrintModulePass	(same as #16) `&& !opts[1080]`	-	Y	Y
20	`sub_195E880(0)`	LICM	`opts[3704] && opts[2880] && !opts[1240]`	Y	Y	Y
21	`sub_1C8A4D0(v12)`	EarlyCSE	always; `v12=1 if opts[3704]`	Y	Y	Y
22	`sub_1869C50(1,0,1)`	Sink	`tier!=1` `&& !opts[1040]`	-	Y	Y
23	`sub_1833EB0(3)`	TailCallElim	`tier==3` `&& !opts[320]`	-	-	Y
24	`sub_1CC3990()`	NVVMUnreachableBlockElim	`!opts[2360]`	Y	Y	Y
25	`sub_18EEA90()`	CorrelatedValuePropagation	`opts[3040]`	Y	Y	Y
26	`sub_12D4560()`	NVVMVerifier	`!opts[600]`	Y	Y	Y
27	`sub_1A223D0()`	NVVMIRVerification	`!opts[2600]`	Y	Y	Y
28	`sub_1CB4E40(1)`	NVVMIntrinsicLowering	`!opts[2000]`	Y	Y	Y
29	`sub_1C4B6F0()`	Inliner	`!opts[440] && !opts[480]`	Y	Y	Y
30	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160] && !opts[1080]`	Y	Y	Y
31	`sub_1A7A9F0()`	InstructionSimplify	`!opts[2720]`	Y	Y	Y
32	`sub_12D4560()`	NVVMVerifier	`!opts[600]`	Y	Y	Y
33	`sub_1A02540()`	GenericToNVVM	`!opts[2200]`	Y	Y	Y
34	`sub_198DF00(-1)`	LoopSimplify	`!opts[1520]`	Y	Y	Y
35	`sub_1C76260()`	ADCE	`!opts[1320] && !opts[1480]`	Y	Y	Y
36	`sub_17060B0(1,0)`	PrintModulePass	(same as #35) `&& !opts[1080]`	Y	Y	Y
37	`sub_12D4560()`	NVVMVerifier	(same as #35) `&& !opts[600]`	Y	Y	Y
38	`sub_195E880(0)`	LICM	`opts[2880] && !opts[1240]`	Y	Y	Y
39	`sub_1C98160(0/1)`	NVVMLowerBarriers	`opts[3488]`	Y	Y	Y
40	`sub_19C1680(0,1)`	LoopUnroll	`!opts[1360]`	Y	Y	Y
41	`sub_17060B0(1,0)`	PrintModulePass	`!opts[1080]`	Y	Y	Y
42	`sub_19401A0()`	InstCombine	`!opts[1000]`	Y	Y	Y
43	`sub_196A2B0()`	EarlyCSE	`!opts[1440]`	Y	Y	Y
44	`sub_1968390()`	SROA	`!opts[1400]`	Y	Y	Y
45	`sub_19B73C0(tier,...)`	LoopVectorize/SLP (1st)	`tier!=1`; params vary by SM	-	Y	Y
46	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160] && !opts[1080]`	Y	Y	Y
47	`sub_19B73C0(tier,...)`	LoopVectorize/SLP (2nd)	`!opts[2760]`	Y	Y	Y
48	`sub_1A62BF0(1,...)`	LLVM standard pipeline	`!opts[600]`	Y	Y	Y
49	`sub_1A223D0()`	NVVMIRVerification	`!opts[2600]`	Y	Y	Y
50	`sub_1CB4E40(1)`	NVVMIntrinsicLowering	`!opts[2000]`	Y	Y	Y
51	`sub_17060B0(1,0)`	PrintModulePass	`!opts[1080]`	Y	Y	Y
52	`sub_190BB10(0,0)`	SimplifyCFG	`!opts[960]`	Y	Y	Y
53	`sub_1922F90()`	NVIDIA loop pass	`opts[3080]`	Y	Y	Y
54	`sub_195E880(0)`	LICM	`opts[2880] && !opts[1240]`	Y	Y	Y
55	`sub_1A13320()`	NVVMRematerialization	`!opts[2320]`	Y	Y	Y
56	`sub_1968390()`	SROA	`!opts[1400]`	Y	Y	Y
57	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160] && !opts[1080]`	Y	Y	Y
58	`sub_18EEA90()`	CorrelatedValuePropagation	`opts[3040]`	Y	Y	Y
59	`sub_18F5480()`	DSE	`!opts[760]`	Y	Y	Y
60	`sub_18DEFF0()`	DCE	`!opts[280]`	Y	Y	Y
61	`sub_1A62BF0(1,...)`	LLVM standard pipeline	`!opts[600]`	Y	Y	Y
62	`sub_1AAC510()`	NVIDIA-specific pass	`!opts[520] && !opts[560]`	Y	Y	Y
63	`sub_1A223D0()`	NVVMIRVerification	`!opts[2600]`	Y	Y	Y
64	`sub_1CB4E40(1)`	NVVMIntrinsicLowering	`!opts[2000]`	Y	Y	Y
65	`sub_1C8E680()`	MemorySpaceOpt	`!opts[2680]`; param from `opts[3120]`	Y	Y	Y
66	`sub_1A223D0()`	NVVMIRVerification	`opts[3120] && !opts[2600]`	Y	Y	Y
67	`sub_17060B0(1,0)`	PrintModulePass	`!opts[1080]`	Y	Y	Y
68	`sub_1CC71E0()`	NVVMGenericAddrOpt	`!opts[2560]`	Y	Y	Y
69	`sub_1C98270(1,opts[2920])`	NVVMLowerBarriers variant	`opts[3488]`	Y	Y	Y
70	`sub_17060B0(1,0)`	PrintModulePass	`opts[3160] && !opts[1080]`	Y	Y	Y
71	`sub_1C6FCA0()`	ADCE	`opts[2840] && !opts[1840]`	Y	Y	Y
72	`sub_18B1DE0()`	LoopOpt/BarrierOpt	`opts[3200] && !opts[2640]`	Y	Y	Y
73	`sub_1857160()`	NVVMReflect (late)	`opts[3200] &&` `tier==3` `&& !opts[880]`	-	-	Y
74	`sub_1841180()`	FunctionAttrs	`opts[3200] && !opts[680]`	Y	Y	Y
75	`sub_1C46000()`	NVVMLateOpt	`tier==3` `&& !opts[360]`	-	-	Y
76	`sub_1841180()`	FunctionAttrs (2nd)	`opts[3200] && !opts[680]`	Y	Y	Y
77	`sub_1CBC480()`	NVVMLowerAlloca	`!opts[2240] && !opts[2280]`	Y	Y	Y
78	`sub_1CB73C0()`	NVVMBranchDist	`!opts[2080] && !opts[2120]`	Y	Y	Y
79	`sub_1C7F370(1)`	NVVMWarpShuffle	`opts[3328] && !opts[1640]`	Y	Y	Y
80	`sub_1CC5E00()`	NVVMReduction	`opts[3328] && !opts[2400]`	Y	Y	Y
81	`sub_1CC60B0()`	NVVMSinking2	`opts[3328] && !opts[2440]`	Y	Y	Y
82	`sub_1CB73C0()`	NVVMBranchDist (2nd)	`opts[3328] && !opts[2080] && !opts[2120]`	Y	Y	Y
83	`sub_17060B0(1,0)`	PrintModulePass	`opts[3328] && !opts[1080]`	Y	Y	Y
84	`sub_1B7FDF0(3)`	Reassociate	`opts[3328] && !opts[1280]`	Y	Y	Y
85	`sub_17060B0(1,0)`	PrintModulePass (final)	`opts[3160] && !opts[1080]`	Y	Y	Y

O1 vs O2 vs O3: Complete Diff

The three O-levels differ through exactly five mechanisms. Every pass that is NOT listed here runs identically at all three levels.

1. Tier guard: `tier!=1` (O2/O3 only)

These passes are present in sub_12DE8F0 but skip when tier==1 (O1):

Pass	Factory	Effect of skipping at O1
SimplifyCFG	`sub_190BB10(1,0)`	No inter-tier CFG cleanup
LoopIndexSplit	`sub_1952F90(-1)`	No inter-tier loop splitting
NVVMVerifier (post-split)	`sub_12D4560()`	No verification after split
Sink	`sub_1869C50(1,0,1)`	No inter-tier instruction sinking
LoopVectorize/SLP (1st call)	`sub_19B73C0(tier,...)`	No aggressive vectorization

At O1, the base pipeline (Tier 0) already includes one instance of LoopVectorize with sub_19B73C0(2,-1,-1,-1,-1,-1,-1) -- width 2, all thresholds at -1 (unlimited). The tier!=1 guard blocks a SECOND, more aggressive vectorization pass with SM-dependent parameters.

2. Tier guard: `tier==3` (O3 only)

These passes run exclusively at O3:

Pass	Factory	Purpose
TailCallElim	`sub_1833EB0(3)`	Additional tail call optimization pass
NVVMReflect (late)	`sub_1857160()`	Second-round __nvvm_reflect resolution
NVVMLateOpt	`sub_1C46000()`	O3-exclusive NVIDIA custom late optimization

sub_1C46000 (NVVMLateOpt) is the most significant O3-exclusive pass. It runs only when !opts[360] (not disabled) and only at tier==3. This is a dedicated NVIDIA optimization pass that performs additional transformations after the main pipeline is complete.

3. Feature flag `qword_4FBB370` escalation

When tier==3 and qword_4FBB370 byte4 is 0, the function sets qword_4FBB370 = 6 (binary 110). This enables two feature gates:

Advanced barrier optimization (bit 1)
Memory space optimization extensions (bit 2)

These gates affect behavior in downstream passes that read qword_4FBB370, such as sub_12EC4F0 (the machine pass pipeline executor).

4. LoopVectorize/SLP parameter differences

sub_19B73C0 is called with different parameters depending on context:

Call site	Parameters	Tier
Tier 0 (`sub_12DE330` #31)	`(2, -1, -1, -1, -1, -1, -1)`	All O1/O2/O3
Tier 1/2/3, 1st call (#45)	`(tier, ...)` SM-dependent	O2/O3 only
Tier 1/2/3, 2nd call (#47)	`(tier, ...)`	All tiers
Ofcmid language path	`(3, -1, -1, 0, 0, -1, 0)`	Fast-compile

The 7 parameters to sub_19B73C0 control:

arg1: Vector width factor (2 at Tier 0, tier at higher tiers)
arg2..arg7: Thresholds for cost model, trip count, and SLP width. Value -1 means unlimited/auto; value 0 means conservative/disabled.

At O2, sub_19B73C0(2, ...) provides moderate vectorization. At O3, sub_19B73C0(3, ...) increases the vector width factor, enabling wider SIMD exploration. The SM-architecture-dependent parameters are resolved at runtime based on the target GPU.

5. CGSCC iteration count

sub_1A62BF0 is the CGSCC (Call Graph SCC) pass manager factory. The first argument is the pipeline extension point / iteration count:

Context	Call	Iterations
Tier 0 (all O-levels)	`sub_1A62BF0(1,0,0,1,0,0,1)`	1
Ofcmid path	`sub_1A62BF0(5,0,0,1,0,0,1)`	5
Language "mid" path	`sub_1A62BF0(8,0,0,1,1,0,1)`	8, with extra opt flag

O1/O2/O3 all use 1-iteration CGSCC in their shared Tier 0 pipeline. The iteration count differences appear in the fast-compile and language-specific paths, not between O-levels.

Complete O-Level Comparison Matrix

Feature	O0	O1	O2	O3
Tier 0 base pipeline (~40 passes)	-	Y	Y	Y
Tier 1 sub-pipeline	-	Y	Y	Y
Tier 2 sub-pipeline	-	-	Y	Y
Tier 3 sub-pipeline	-	-	-	Y
LoopVectorize (base, width=2)	-	Y	Y	Y
LoopVectorize (tier, SM-dependent)	-	-	Y	Y
SimplifyCFG (inter-tier)	-	-	Y	Y
LoopIndexSplit (inter-tier)	-	-	Y	Y
Sink (inter-tier)	-	-	Y	Y
TailCallElim (extra)	-	-	-	Y
NVVMReflect (late round)	-	-	-	Y
NVVMLateOpt (`sub_1C46000`)	-	-	-	Y
Feature flags escalation (6)	-	-	-	Y
NVVMDivergenceLowering	-	Y	Y	Y
SpeculativeExecution	-	Y	Y	Y
MemorySpaceOpt	-	Y	Y	Y
NVVMWarpShuffle	-	Y	Y	Y
NVVMReduction	-	Y	Y	Y
NVVMRematerialization	-	Y	Y	Y
NVVMBranchDist	-	Y	Y	Y
LSA optimization	off	on	on	on

O0 Pipeline (Minimal)

When no O-level flag is set and no fast-compile level is active, the assembler falls through to LABEL_159 which calls:

sub_1C8A4D0(0)   -- NVVMFinalCleanup or similar minimal pass

Then the common tail at LABEL_84 adds:

MemorySpaceOpt (conditional, skipped at O0 since opts[3488] is typically unset)
sub_1CEBD10() -- NVVMFinal / cleanup
sub_1654860(1) -- VerifierPass
sub_12DFE00() -- Codegen pass setup

The O0 pipeline does NOT call sub_12DE330 or sub_12DE8F0. It runs only the infrastructure passes (TargetLibraryInfo, TargetTransformInfo, BasicAA, AssumptionCacheTracker, ProfileSummaryInfo) plus minimal canonicalization.

Ofcmax Pipeline (Fastest Compile)

Ofcmax bypasses the full pipeline entirely. It forces two optimizer flags:

-lsa-opt=0 (disables LSA optimization)
-memory-space-opt=0 (disables MemorySpaceOpt pass)

This forcing happens in BOTH sub_9624D0 (line 1358--1361) and sub_12CC750 (line 2025--2079). The condition is:

if (!compare(lsa_opt_flag, "0") || fc_level == 2):
  append("-lsa-opt=0")
  append("-memory-space-opt=0")

Additionally, when fc_level == 2 AND lsa_opt is NOT already "0", the libnvvm path also injects -lsa-opt=0, mem2reg, -memory-space-opt=0.

The minimal pass sequence:

#	Factory	Pass
1	`sub_18B3080(1)`	Sinking2Pass (fast mode, flag=1)
2	`sub_1857160()`	SimplifyCFG
3	`sub_19CE990()`	LoopStrengthReduce (if applicable)
4	`sub_1B26330()`	NVVMReflect
5	`sub_12D4560()`	NVVMVerify
6	`sub_184CD60()`	LICM
7	`sub_1C4B6F0()`	LowerSwitch
8	`sub_12D4560()`	NVVMVerify

Ofcmid Pipeline (Medium)

Ofcmid runs ~25--30 passes without forcing LSA or MemorySpaceOpt off. The pass sequence from sub_12E54A0 (lines 814--861):

#	Factory	Pass	Guard
1	`sub_184CD60()`	LICM	`!opts[1960]`
2	`sub_1CB4E40(0)`	AnnotationCleanup	always
3	`sub_1B26330()`	NVVMReflect	always
4	`sub_198E2A0()`	CorrelatedValuePropagation	always
5	`sub_1CEF8F0()`	NVVMPeephole	always
6	`sub_215D9D0()`	NVVMPeephole2/TcgenAnnotation	always
7	`sub_17060B0(1,0)`	GVN	`!opts[1080]`
8	`sub_198DF00(-1)`	JumpThreading/CVP	always
9	`sub_17060B0(1,0)`	GVN	`!opts[1080]`
10	`sub_1C6E800()`	NVVMLowerArgs	always
11	`sub_1832270(1)`	LoopSimplify	always
12	`sub_1A62BF0(5,0,0,1,0,0,1)`	CGSCC (5 iterations)	always
13	`sub_1CB4E40(0)`	AnnotationCleanup	always
14	`sub_18FD350(0)`	DCE	always
15	`sub_1841180()`	LCSSA	always
16	`sub_18DEFF0()`	Sink	always
17	`sub_17060B0(1,0)`	GVN	always
18	`sub_184CD60()`	LICM	always
19	`sub_195E880(0)`	Reassociate	always
20	`sub_190BB10(0,0)`	EarlyCSE	always
21	`sub_19B73C0(3,-1,-1,0,0,-1,0)`	LoopVectorize (conservative)	always
22	`sub_1A223D0()`	DSE	always
23	`sub_1C98160(0)`	MemorySpaceOpt	always
24	`sub_1C8E680(0)`	MemorySpaceOpt2	always
25	`sub_1B7FDF0(3)`	BranchFolding/CFGSimplify	always
26	`sub_18B1DE0()`	Sinking2	always

Key differences from the O1+ pipeline: Ofcmid uses 5-iteration CGSCC (vs 1 at O1+), includes NVVMPeephole/Peephole2 early, uses conservative LoopVectorize parameters (3,-1,-1,0,0,-1,0) with some thresholds zeroed, and skips NVVMDivergenceLowering, SpeculativeExecution, NVVMBranchDist, NVVMRematerialization, and the entire tier sub-pipeline.

Ofcmin Pipeline (Closest to Full Optimization)

Ofcmin takes the same path as Ofcmid through LABEL_297 in sub_12E54A0 but with the v238 flag set differently, enabling more aggressive settings. The pipeline is essentially the Ofcmid sequence with:

More aggressive loop optimizer thresholds
Additional CGSCC framework passes
Closer parameter alignment to the O2 full pipeline

Ofcmin does NOT force -lsa-opt=0 or -memory-space-opt=0. Like Ofcmid, it still skips the tier 1/2/3 sub-pipeline entirely, keeping compile time lower than O1.

Post-Optimization Common Tail

Regardless of pipeline tier, sub_12E54A0 always appends at LABEL_84 (lines 640--653):

#	Factory	Pass	Guard
1	`sub_1C98160(opts[2920]!=0)`	MemorySpaceOpt	`!v244 && opts[3488]`
2	`sub_1CEBD10()`	NVVMFinal / cleanup	always
3	`sub_1654860(1)`	VerifierPass	`!opts[2800] && !opts[4464]`
4	`sub_12DFE00(PM, v253, opts)`	Codegen pass dispatch	always

sub_12DFE00 (codegen dispatch) reads the optimization level from opts[200] to determine codegen aggressiveness. When opts[200] > 1, full dependency tracking is enabled across all codegen passes.

Always-Added Analysis Passes

Before any optimization, the pipeline assembler inserts (lines 396--420):

#	Factory	Pass
1	`sub_149CCE0` (368 bytes alloc)	TargetLibraryInfoWrapperPass
2	`sub_1BFB520` (208 bytes alloc)	TargetTransformInfoWrapperPass
3	`sub_14A7550()`	VerifierPass / BasicAliasAnalysis
4	`sub_1361950()`	AssumptionCacheTracker
5	`sub_1CB0F50()`	ProfileSummaryInfoWrapperPass

These five passes run at ALL optimization levels including O0.

NVVMPassOptions Offset-to-Guard Map

The passes gated by NVVMPassOptions boolean flags (opts struct at 4512 bytes). Slot defaults from sub_12D6300:

Offset	Slot	Default	Controls	Used By
280	15	off	DCE disable	Tier 0 #37, Tier 1/2/3 #60
320	17	off	TailCallElim disable	Tier 1/2/3 #23 (O3 only)
360	19	on	NVVMLateOpt disable	Tier 1/2/3 #75 (O3 only)
440	23	off	Inliner flag A disable	Tier 1/2/3 #29
480	25	on	Inliner flag B disable	Tier 1/2/3 #29
600	31	off	NVVMVerifier disable	Tier 1/2/3 #7,13,18,26,32,37
680	35	off	FunctionAttrs disable	Tier 1/2/3 #74,76
720	37	off	SCCP disable	Tier 1/2/3 #11
760	39	off	DSE disable	Tier 1/2/3 #59
880	45	off	NVVMReflect disable	Tier 1/2/3 #9,73
920	47	off	IPConstPropagation disable	Tier 1/2/3 #8
960	49	off	SimplifyCFG disable	Tier 1/2/3 #52
1000	51	off	InstCombine disable	Tier 1/2/3 #42
1040	53	off	Sink/SimplifyCFG disable	Tier 0 #13, Tier 1/2/3 #16,22
1080	55	off	PrintModulePass disable	many
1120	57	off	NVVMPredicateOpt disable	Tier 1/2/3 #10
1160	59	off	LoopIndexSplit disable	Tier 1/2/3 #17
1200	61	off	SimplifyCFG tier guard	Tier 1/2/3 #16
1240	63	off	LICM disable	Tier 1/2/3 #20,38,54
1280	65	off	Reassociate disable	Tier 1/2/3 #84
1320	65	off	NVVMDivergenceLow disable	Tier 0 #24, Tier 1/2/3 #35
1360	67	off	LoopUnroll disable	Tier 0 #26, Tier 1/2/3 #40
1400	69	off	SROA disable	Tier 1/2/3 #44,56
1440	71	off	EarlyCSE disable	Tier 1/2/3 #43
1480	73	off	ADCE extra guard	Tier 1/2/3 #35
1520	75	off	LoopSimplify disable	Tier 1/2/3 #34
1640	81	off	NVVMWarpShuffle disable	Tier 1/2/3 #79
1760	87	off	MemorySpaceOpt disable	Common tail, language paths
1840	91	off	ADCE variant disable	Tier 1/2/3 #71
1960	97	off	ConstantMerge disable	Tier 1/2/3 #15
2000	101	off	NVVMIntrinsicLowering disable	Tier 1/2/3 #1,3,28,50,64
2080	103	off	NVVMBranchDist disable A	Tier 1/2/3 #78,82
2120	105	off	NVVMBranchDist disable B	Tier 1/2/3 #78,82
2200	109	off	GenericToNVVM disable	Tier 1/2/3 #33
2240	111	off	NVVMLowerAlloca A disable	Tier 1/2/3 #77
2280	113	off	NVVMLowerAlloca B disable	Tier 1/2/3 #77
2320	115	off	NVVMRematerialization disable	Tier 1/2/3 #55
2360	117	on	NVVMUnreachableBlockElim disable	Tier 1/2/3 #24
2400	119	off	NVVMReduction disable	Tier 1/2/3 #80
2440	121	off	NVVMSinking2 disable	Tier 1/2/3 #81
2560	127	off	NVVMGenericAddrOpt disable	Tier 1/2/3 #68
2600	129	off	NVVMIRVerification disable	Tier 1/2/3 #2,27,49,63,66
2640	131	off	LoopOpt/BarrierOpt disable	Tier 1/2/3 #72
2680	133	off	MemorySpaceOpt (2nd) disable	Tier 1/2/3 #65
2720	135	off	InstructionSimplify disable	Tier 1/2/3 #31
2760	137	off	LoopVectorize 2nd disable	Tier 1/2/3 #47
2840	141	on	ADCE enable (reversed)	Tier 1/2/3 #71
2880	143	on	LICM enable (reversed)	Tier 0 #25, Tier 1/2/3 #20,38,54
2920	145	off	LowerBarriers parameter	Common tail
3000	151	on	Early pass guard	Pre-opt phase
3040	153	off	CorrelatedValueProp enable	Tier 1/2/3 #25,58
3080	155	on	NVIDIA loop pass enable	Tier 1/2/3 #53
3120	155	on	MemorySpaceOpt(2nd) enable	Tier 1/2/3 #65,66
3160	157	on	PrintModulePass enable	Tier 0 #9,27,35; Tier 1/2/3 many
3200	159	on	Advanced NVIDIA passes group	Tier 1/2/3 #8-11,14-15,72-76
3328	165	on	SM-specific late passes block	Tier 1/2/3 #79-84
3488	173	off	NVVMBarrierAnalysis enable	Tier 1/2/3 #4,5,39,69
3528	175	off	Tier 1 enable	Pipeline assembler
3568	177	off	Tier 2 enable	Pipeline assembler
3608	179	off	Tier 3 enable	Pipeline assembler
3648	181	""	Language/fc-level string ptr	Pipeline name selection
3704	183	off	Late optimization flag	Tier 1/2/3 #20,21; Pipeline B
3904	192	off	Debug/naming mode flag	BB naming loop
4064	201	off	Concurrent compilation flag	Thread count decision
4104	203	-1	Thread count (integer)	`sub_12E7E70`
4224	209	off	Tier 0 enable (opt active)	Pipeline assembler loop
4304	213	off	Device-code / additional opt	Pipeline B; fc dispatch
4384	217	off	Fast-compile bypass flag	Pipeline A vs B branch
4464	221	off	Late CFG cleanup guard	Common tail #3

Codegen Optimization Level Propagation

The -optO and -llcO flags propagate the optimization level to the backend code generator. In sub_12E54A0 (lines 1451--1460):

if (lsa_opt == "0" && some_flag == "1"):
  append("-optO<level>")
  append("-llcO2")

The codegen dispatch sub_12DFE00 reads opts[200] (the integer optimization level):

opts[200] == 0: Minimal codegen (no dependency tracking)
opts[200] >= 1: Standard codegen
opts[200] >= 2: Full dependency tracking enabled (v121 = true)

Cross-References

NVVMPassOptions System -- complete 222-slot struct layout
Pipeline Pass Registration -- 526-pass registration table
Optimizer Architecture -- two-phase model, AddPass mechanism
CLI Flags -- -O#, -Ofc=, --passes= routing
Knobs Reference -- all 1496 cl::opt knobs
Concurrent Compilation -- Phase I/II threading model

NVVMPassOptions

NVVMPassOptions is NVIDIA's proprietary per-pass configuration system -- a 4,512-byte flat struct containing 221 option slots that controls every aspect of the NVVM optimization pipeline. It has no upstream LLVM equivalent. Where LLVM uses scattered cl::opt<T> globals that each pass reads independently, NVIDIA consolidates all pass configuration into a single contiguous struct that is allocated once and threaded through the entire pipeline assembler as a parameter. This design allows the pipeline to make pass-enable decisions through simple byte reads at known offsets rather than hash-table lookups, and it ensures that the complete configuration state can be copied between Phase I and Phase II of the two-phase compilation model.

The struct is populated by a single 125KB function (sub_12D6300) that reads from a PassOptionRegistry hash table and flattens the results into 221 typed slots. The pipeline assembler (sub_12E54A0) and its sub-pipeline builders (sub_12DE330, sub_12DE8F0) then read individual slots by offset to decide which passes to insert and how to configure them.


Initializer	`sub_12D6300` (125KB, 4,786 lines)
Struct size	4,512 bytes (`sub_22077B0(4512)`)
Slot count	221 (1-based index: 1--221)
Slot types	5: STRING (24B), BOOL_COMPACT (16B), BOOL_INLINE (16B), INTEGER (16B), STRING_PTR (28B)
Type breakdown	114 string + 83 bool compact + 17 bool inline + 6 integer + 1 string pointer
Registry lookup	`sub_12D6170` (hash table at registry+120)
PassDef resolver	`sub_1691920` (64-byte stride table)
Bool parser	`sub_12D6240` (triple: lookup + lowercase + char test)
Callers	`sub_12E7E70` (Phase orchestrator), `sub_12F4060` (TargetMachine creation)
Consumers	`sub_12E54A0`, `sub_12DE330`, `sub_12DE8F0`, `sub_12DFE00`

Struct Layout

The struct is heap-allocated as a single 4,512-byte block. The first 16 bytes contain header fields, followed by 221 option slots packed contiguously, and a 32-byte zero trailer:

Offset  Size   Field
──────  ────   ─────
0       4      int opt_level (copied from registry+112)
4       4      (padding)
8       8      qword ptr to PassOptionRegistry
16      ~4464  221 option slots (variable-size, packed)
4480    32     zero trailer (4 qwords, sentinel)

Slot offsets are deterministic -- they depend on the type sequence hard-coded into sub_12D6300. String slots consume 24 bytes, boolean and integer slots consume 16 bytes, and the unique string-pointer slot at index 181 consumes 28 bytes. The initializer writes each slot at a compile-time-constant offset; there is no dynamic layout calculation.

Slot Types

Type A: String Option (24 bytes) -- `sub_12D6090`

114 slots. Stores a string value (pass name or parametric value) along with flags, optimization level, and pass ID.

struct StringOption {       // 24 bytes, written by sub_12D6090
    char*    value;         // +0:  pointer to string data
    int32_t  option_index;  // +8:  1-based slot index
    int32_t  flags;         // +12: from PassDef byte 40
    int32_t  opt_level;     // +16: from header opt_level
    int32_t  pass_id;       // +20: resolved via sub_1691920
};

Type B: Boolean Compact (16 bytes) -- `sub_12D6100`

83 slots. The most common boolean representation. The helper encapsulates the lookup-parse-resolve sequence.

struct BoolCompactOption {  // 16 bytes, written by sub_12D6100
    uint8_t  value;         // +0:  0 or 1
    uint8_t  pad[3];        // +1:  padding
    int32_t  option_index;  // +4:  1-based slot index
    int32_t  flags;         // +8:  from PassDef byte 40
    int32_t  pass_id;       // +12: resolved via sub_1691920
};

Type C: Boolean Inline (16 bytes) -- direct write

17 slots. Identical layout to Type B, but written directly by sub_12D6300 rather than through the sub_12D6100 helper. These correspond to option pairs where the boolean resolution requires checking PassDef+36 (has_overrides byte) and resolving via sub_1691920 inline. The 17 inline boolean slots are: 7, 11, 13, 49, 53, 55, 59, 61, 95, 103, 119, 127, 151, 159, 169, 177, 211.

struct BoolInlineOption {   // 16 bytes, same layout as Type B
    uint8_t  value;         // +0:  0 or 1
    uint8_t  pad[3];        // +1
    int32_t  option_index;  // +4:  high 32 bits of sub_12D6240 return
    int32_t  opt_level;     // +8:  from header
    int32_t  pass_id;       // +12: resolved inline
};

Type D: Integer (16 bytes) -- direct write via `sub_16D2BB0`

6 slots. The integer value is parsed from the registry string by sub_16D2BB0 (string-to-int64). Layout is identical to boolean compact but the first 4 bytes store a full int32_t rather than a single byte.

struct IntegerOption {      // 16 bytes
    int32_t  value;         // +0:  parsed integer
    int32_t  option_index;  // +4:  1-based slot index
    int32_t  opt_level;     // +8
    int32_t  pass_id;       // +12
};

Type E: String Pointer (28 bytes) -- slot 181 only

Unique. Stores a raw char* plus length rather than a managed string. Likely a file path or regex pattern that requires direct C-string access.

struct StringPtrOption {    // 28 bytes, slot 181 only
    char*    data;          // +0:  raw char pointer
    uint64_t length;        // +8:  string length
    int32_t  option_index;  // +16: 1-based slot index
    int32_t  opt_level;     // +20
    int32_t  pass_id;       // +24
};

Pair Organization Pattern

The 221 slots follow a predominantly paired layout. Slots 1--6 are six standalone STRING options (likely the global compilation parameters: ftz, prec-div, prec-sqrt, fmad, opt-level, sm-arch). Starting at slot 7, slots are organized in (EVEN, ODD) pairs:

Even slot N: STRING option -- the pass's parameter value or name
Odd slot N+1: BOOLEAN or INTEGER option -- the enable/disable toggle

Each "pass knob" thus gets a string parameter slot and a boolean gate. The pipeline assembler reads the boolean to decide whether to insert the pass, and passes the string value as the pass's configuration parameter.

Exceptions to the pair pattern:

Region	Anomaly
Slots 160--162	Three consecutive STRING slots with a single boolean at 163
Slots 191--193	Slot 191 STRING, then two consecutive booleans at 192--193
Slot 181	STRING_PTR type instead of normal STRING
Slots 196--207	Alternating STRING + INTEGER instead of STRING + BOOL

Helper Functions

`sub_12D6170` -- `PassOptionRegistry::lookupOption`

Looks up an option by its 1-based slot index in the hash table at registry+120. Returns a pointer to an OptionNode or 0 if the option was not set from the command line:

// Signature: int64 sub_12D6170(void* registry, int option_index)
// Returns: OptionNode* or 0
//
// OptionNode layout:
//   +40   int16   flags
//   +48   char**  value_array_ptr (array of string values)
//   +56   int     value_count

The hash table uses open addressing. The lookup computes hash(option_index) and probes linearly. When an option is not present in the registry (meaning the user did not supply a CLI override), the caller falls back to the hard-coded default in sub_12D6300.

`sub_12D6240` -- `PassOptionRegistry::getBoolOption`

Resolves a boolean option with a default value. This is the critical function for all 100 boolean slots -- it performs a three-step resolution:

sub_12D6240(registry, option_index, default_string):
    1. Call sub_12D6170(registry, option_index)
    2. If found AND has value:
         lowercase the string via sub_16D2060
         result = (first_char == '1' || first_char == 't')  // "1" or "true"
    3. If not found OR no value:
         result = (default_string[0] == '1')  // "0" -> false, "1" -> true
    Return: packed(bool_value:8, flags:32) in low 40 bits

The packing convention is significant: the boolean value occupies the low 8 bits and the flags occupy bits 8--39. Callers unpack with (result & 0xFF) for the boolean and (result >> 8) for the flags.

`sub_1691920` -- `PassDefTable::getPassDef`

Resolves a 1-based pass index to its PassDef entry in a table with 64-byte stride:

// sub_1691920(table_ptr, pass_index):
//   return table_ptr[0] + (pass_index - 1) * 64
//
// PassDef layout (64 bytes):
//   +32   int     pass_id
//   +36   byte    has_overrides
//   +40   int16   override_index

The pass_id field is written into every option slot and later used by the pipeline assembler to map configuration back to the pass factory that should receive it.

`sub_16D2BB0` -- parseInt

Parses a string to a 64-bit integer. Used for the 6 integer-typed option slots (9, 197, 203, 205, 207, 215).

Default Values

Most boolean slots default to 0 (disabled). 14 slots default to 1 (enabled) -- these represent passes that run by default and must be explicitly disabled:

Confidence note: Pass associations marked [MEDIUM] are inferred from pipeline guard cross-references (a4[offset]). Associations marked [LOW] are based solely on offset proximity or default-value patterns.

Slot	Offset	Likely Pass	Confidence
19	400	Inliner (AlwaysInliner gate)	MEDIUM
25	520	NVIDIA-specific pass A	LOW
93	1880	ConstantMerge	HIGH
95	1920	NVVMIntrinsicLowering	HIGH
117	2360	NVVMUnreachableBlockElim	HIGH
141	2840	ADCE	HIGH
143	2880	LICM	HIGH
151	3040	CorrelatedValuePropagation	MEDIUM
155	3120	MemorySpaceOpt (second pass)	MEDIUM
157	3160	PrintModulePass (dump mode)	HIGH
159	3200	Optimization-level gating	MEDIUM
165	3328	Late-pipeline enable block	LOW
211	4264	(inline bool, late pass)	LOW
219	4424	(compact bool, late pass)	LOW

Integer slot defaults:

Slot	Offset	Default	Likely Meaning
9	200	1	Optimization threshold / iteration count
197	3984	20	Limit/threshold (e.g., unroll count)
203	4104	-1	Thread count (sentinel for auto-detect via `get_nprocs()`)
205	4144	-1	Thread count fallback
207	4184	-1	Sentinel for unlimited/auto
215	4344	0	Disabled counter

CLI Flag Routing

The path from a user-visible flag to an NVVMPassOptions slot traverses four stages:

nvcc -Xcicc -opt "-do-licm=0"          ← user invocation
    │
    ▼
sub_9624D0 (flag catalog, 75KB)        ← parses -opt flags into opt_argv vector
    │   pushes "-do-licm=0" into v327 (opt vector)
    ▼
PassOptionRegistry (hash table)         ← opt-phase parser populates registry
    │   key = slot_index, value = "0"
    ▼
sub_12D6300 (125KB initializer)         ← flattens registry into 4512-byte struct
    │   sub_12D6240(registry, LICM_SLOT, "1") → returns 0 (overridden)
    │   writes opts[2880] = 0
    ▼
sub_12E54A0 / sub_12DE8F0              ← pipeline assembler reads opts[2880]
    if (opts[2880]) AddPass(LICM);     ← skipped because opts[2880] == 0

The -opt flag prefix is critical: it routes the argument to the optimizer phase vector rather than to the linker, LTO, or codegen phases. The flag catalog (sub_9624D0) recognizes several shorthand patterns:

User Flag	Routes To	Effect
`--emit-optix-ir`	opt `"-do-ip-msp=0"`, opt `"-do-licm=0"`	Disables IPMSP and LICM for OptiX
`-Ofast-compile=max`	opt `"-fast-compile=max"`, opt `"-memory-space-opt=0"`	Disables MemorySpaceOpt
`-memory-space-opt=0`	opt `"-memory-space-opt=0"`	Direct pass disable
`-Xopt "-do-remat=0"`	opt `"-do-remat=0"`	Direct pass-through to opt phase

Pipeline Consumer: How Passes Read NVVMPassOptions

The pipeline assembler and its sub-pipeline builders receive the NVVMPassOptions struct as parameter a4 (in sub_12E54A0) or opts (in sub_12DE330/sub_12DE8F0). They read individual boolean slots by dereferencing a byte at a known offset and branching:

// Pattern 1: simple disable guard
if (!*(uint8_t*)(opts + 1760))           // opts[1760] = MemorySpaceOpt disable
    AddPass(PM, sub_1C8E680(0), 1, 0);  // insert MemorySpaceOpt

// Pattern 2: enable guard (inverted logic)
if (*(uint8_t*)(opts + 2880))            // opts[2880] = LICM enabled (default=1)
    AddPass(PM, sub_195E880(0), 1, 0);  // insert LICM

// Pattern 3: combined guard with opt-level gating
if (*(uint8_t*)(opts + 3200) &&          // opts[3200] = opt-level sufficient
    !*(uint8_t*)(opts + 880))            // opts[880] = NVVMReflect not disabled
    AddPass(PM, sub_1857160(), 1, 0);   // insert NVVMReflect

// Pattern 4: integer parameter read
v12 = *(int32_t*)(opts + 200);           // opts[200] = opt threshold (default=1)
// used to configure codegen dispatch in sub_12DFE00

The key insight is that the pipeline assembler never performs string comparison or hash-table lookup at pass-insertion time -- it reads pre-resolved values from the flat struct. This makes the ~150 pass-insertion decisions in sub_12E54A0 essentially free in terms of runtime cost.

Offset-to-Pass Mapping

The following table maps struct offsets (as seen in pipeline assembler guards opts[OFFSET]) to the passes they control. Offsets are byte offsets from the struct base. "Guard sense" indicates whether the pass runs when the byte is 0 (!opts[X] -- most common, where the option is a disable flag) or when it is nonzero (opts[X] -- the option is an enable flag).

Offset	Slot	Guard Sense	Controlled Pass	Factory
200	9	value	Optimization threshold (integer, read by `sub_12DFE00`)	--
280	14-15	`!opts`	DCE (DeadCodeElimination)	`sub_18DEFF0`
320	16-17	`!opts`	TailCallElim / JumpThreading	`sub_1833EB0`
360	18-19	`!opts`	NVVMLateOpt	`sub_1C46000`
400	20-21	`!opts`	AlwaysInliner gate A	`sub_1C4B6F0`
440	22-23	`!opts`	AlwaysInliner gate B	`sub_1C4B6F0`
480	24-25	`!opts`	Inliner gate C	`sub_1C4B6F0`
520	26-27	`!opts`	NVIDIA-specific pass A	`sub_1AAC510`
560	28-29	`!opts`	NVIDIA-specific pass B	`sub_1AAC510`
600	30-31	`!opts`	NVVMVerifier	`sub_12D4560`
680	34-35	`!opts`	FunctionAttrs	`sub_1841180`
720	36-37	`!opts`	SCCP	`sub_1842BC0`
760	38-39	`!opts`	DSE (DeadStoreElimination)	`sub_18F5480`
880	44-45	`!opts`	NVVMReflect	`sub_1857160`
920	46-47	`!opts`	IPConstantPropagation	`sub_185D600`
960	48-49	`!opts`	SimplifyCFG	`sub_190BB10`
1000	50-51	`!opts`	InstCombine	`sub_19401A0`
1040	52-53	`!opts`	Sink / SimplifyCFG (early)	`sub_1869C50`
1080	54-55	`!opts`	PrintModulePass (dump IR)	`sub_17060B0`
1120	56-57	`!opts`	NVVMPredicateOpt	`sub_18A3430`
1160	58-59	`!opts`	LoopIndexSplit	`sub_1952F90`
1200	60-61	`!opts`	SimplifyCFG (tier guard)	`sub_190BB10`
1240	62-63	`!opts`	LICM	`sub_195E880`
1280	64-65	`!opts`	Reassociate / Sinking	`sub_1B7FDF0`
1320	66-67	`!opts`	ADCE (AggressiveDeadCodeElimination)	`sub_1C76260`
1360	68-69	`!opts`	LoopUnroll	`sub_19C1680`
1400	70-71	`!opts`	SROA	`sub_1968390`
1440	72-73	`!opts`	EarlyCSE	`sub_196A2B0`
1480	74-75	`!opts`	ADCE extra guard	`sub_1C76260`
1520	76-77	`!opts`	LoopSimplify	`sub_198DF00`
1640	82-83	`!opts`	NVVMWarpShuffle	`sub_1C7F370`
1680	84-85	`!opts`	NVIDIA pass (early)	`sub_19CE990`
1760	88-89	`!opts`	MemorySpaceOpt (primary)	`sub_1C8E680`
1840	92-93	`!opts`	ADCE variant	`sub_1C6FCA0`
1960	98-99	`!opts`	ConstantMerge / GlobalDCE	`sub_184CD60`
2000	100-101	`!opts`	NVVMIntrinsicLowering	`sub_1CB4E40`
2040	102-103	`!opts`	MemCpyOpt	`sub_1B26330`
2080	104-105	`!opts`	BranchDist gate A	`sub_1CB73C0`
2120	106-107	`!opts`	BranchDist gate B	`sub_1CB73C0`
2160	108-109	`!opts`	NVVMPredicateOpt variant	`sub_18A3090`
2200	110-111	`!opts`	GenericToNVVM	`sub_1A02540`
2240	112-113	`!opts`	NVVMLowerAlloca gate A	`sub_1CBC480`
2280	114-115	`!opts`	NVVMLowerAlloca gate B	`sub_1CBC480`
2320	116-117	`!opts`	NVVMRematerialization	`sub_1A13320`
2360	118-119	`!opts`	NVVMUnreachableBlockElim	`sub_1CC3990`
2400	120-121	`!opts`	NVVMReduction	`sub_1CC5E00`
2440	122-123	`!opts`	NVVMSinking2	`sub_1CC60B0`
2560	128-129	`!opts`	NVVMGenericAddrOpt	`sub_1CC71E0`
2600	130-131	`!opts`	NVVMIRVerification	`sub_1A223D0`
2640	132-133	`!opts`	LoopOpt / BarrierOpt	`sub_18B1DE0`
2680	134-135	`!opts`	MemorySpaceOpt (second invocation)	`sub_1C8E680`
2720	136-137	`!opts`	InstructionSimplify	`sub_1A7A9F0`
2760	138-139	`!opts`	LoopUnswitch variant	`sub_19B73C0`
2840	141	`opts`	ADCE (enabled by default, slot 141, default=1)	`sub_1C6FCA0`
2880	143	`opts`	LICM (enabled by default, slot 143, default=1)	`sub_195E880`
2920	145	value	LowerBarriers parameter	`sub_1C98270`
3000	150-151	`opts`	Early pass guard	`sub_18FD350`
3040	151	`opts`	CorrelatedValuePropagation (default=1)	`sub_18EEA90`
3080	153	`opts`	NVIDIA-specific loop pass	`sub_1922F90`
3120	155	`opts`	MemorySpaceOpt second-pass enable (default=1)	`sub_1C8E680`
3160	157	`opts`	PrintModulePass enable (default=1)	`sub_17060B0`
3200	159	`opts`	Optimization-level gate (default=1)	--
3328	165	`opts`	Late-pipeline enable block (default=1)	multiple
3488	174-175	`opts`	NVVMBarrierAnalysis + LowerBarriers enable	`sub_18E4A00`
3648	181	string	Language string (`"ptx"`/`"mid"`)	path dispatch
3704	185	`opts`	Late optimization flag	`sub_1C8A4D0`
3904	193	`opts`	Debug / verification mode	`sub_12D3E60`
3944	195	`opts`	Basic block naming (`"F%d_B%d"`)	sprintf
3984	197	value	Integer limit (default=20)	--
4064	201	value	Concurrent compilation override	`sub_12D4250`
4104	203	value	Thread count (default=-1, auto-detect)	`sub_12E7E70`
4144	205	value	Thread count fallback (default=-1)	`sub_12E7E70`
4184	207	value	Integer parameter (default=-1)	--
4224	209	`opts`	Optimization enabled flag	tier dispatch
4304	213	`opts`	Device-code flag	Pipeline B
4344	215	value	Integer counter (default=0)	--
4384	217	`opts`	Fast-compile bypass flag	Pipeline B dispatch
4464	221	`!opts`	Late CFG cleanup guard	`sub_1654860`

Known Option Names

Option names are stored in the PassOptionRegistry hash table, not in sub_12D6300 itself. The following names are extracted from binary string references in global constructors and pass factories:

Boolean Toggles (do-X / no-X)

Name	Likely Slot Region	Default
`do-ip-msp`	MemorySpaceOpt area	enabled
`do-clone-for-ip-msp`	MemorySpaceOpt variant	--
`do-licm`	offset 2880 (slot 143)	1 (enabled)
`do-remat`	offset 2320 (slot 117)	enabled
`do-cssa`	CSSA pass area	--
`do-scev-cgp`	SCEV-CGP area	--
`do-function-scev-cgp`	function-level SCEV-CGP	--
`do-scev-cgp-aggresively` [sic]	aggressive SCEV-CGP mode	--
`do-base-address-strength-reduce`	BaseAddrSR area	--
`do-base-address-strength-reduce-chain`	BaseAddrSR chain variant	--
`do-comdat-renaming`	COMDAT pass	--
`do-counter-promotion`	PGO counter promotion	--
`do-lsr-64-bit`	64-bit loop strength reduction	--
`do-sign-ext-expand`	sign extension expansion	--
`do-sign-ext-simplify`	sign extension simplification	--

Dump/Debug Toggles

Name	Purpose
`dump-ip-msp`	Dump IR around MemorySpaceOpt
`dump-ir-before-memory-space-opt`	IR dump pre-MSP
`dump-ir-after-memory-space-opt`	IR dump post-MSP
`dump-memory-space-warnings`	MSP diagnostic warnings
`dump-remat` / `dump-remat-add` / `dump-remat-iv` / `dump-remat-load`	Rematerialization diagnostics
`dump-branch-dist`	Branch distribution diagnostics
`dump-scev-cgp`	SCEV-CGP diagnostics
`dump-base-address-strength-reduce`	BaseAddrSR diagnostics
`dump-sink2`	Sinking2 diagnostics
`dump-before-cssa`	CSSA input dump
`dump-phi-remove`	PHI removal diagnostics
`dump-normalize-gep`	GEP normalization dump
`dump-simplify-live-out`	Live-out simplification dump
`dump-process-restrict`	Process-restrict dump
`dump-process-builtin-assume`	Builtin assume processing dump
`dump-conv-dot` / `dump-conv-func` / `dump-conv-text`	Convergence analysis dumps
`dump-nvvmir`	NVVM IR dump
`dump-va`	Value analysis dump

Parametric Knobs

Name	Default	Purpose
`remat-for-occ`	120	Occupancy target for rematerialization
`remat-gep-cost`	6000	GEP rematerialization cost threshold
`remat-lli-factor`	10	Long-latency instruction factor
`remat-max-live-limit`	10	Maximum live range limit for remat
`remat-single-cost-limit`	--	Single-instruction remat cost limit
`remat-loop-trip`	--	Loop trip count for remat decisions
`remat-use-limit`	--	Use count limit for remat candidates
`remat-maxreg-ceiling`	--	Register ceiling for remat
`remat-move`	--	Remat move control
`remat-load-param`	--	Parameter load remat control
`remat-ignore-single-cost`	--	Ignore single-cost heuristic
`branch-dist-block-limit`	-1	Max blocks for branch distribution (-1 = unlimited)
`branch-dist-func-limit`	-1	Max functions for branch distribution
`branch-dist-norm`	0	Branch distribution normalization mode
`scev-cgp-control`	--	SCEV-CGP mode selector
`scev-cgp-norm`	--	SCEV-CGP normalization
`scev-cgp-check-latency`	--	Latency check threshold
`scev-cgp-cross-block-limit`	--	Cross-block limit
`scev-cgp-idom-level-limit`	--	Immediate dominator level limit
`scev-cgp-inst-limit`	--	Instruction count limit
`scev-cgp-old-base`	--	Old base address mode
`scev-cgp-tid-max-value`	--	Thread ID max value
`base-address-strength-reduce-iv-limit`	--	IV limit for base addr SR
`base-address-strength-reduce-max-iv`	--	Max IV count
`cssa-coalesce`	--	CSSA coalescing mode
`cssa-verbosity`	--	CSSA diagnostic verbosity
`memory-space-opt-pass`	--	MSP pass variant selector
`peephole-opt`	--	Peephole optimizer control
`loop-index-split`	--	Loop index split control
`va-use-scdg`	--	Value analysis SCDG mode
`nvvm-peephole-optimizer`	--	NVVM peephole enable
`nvvm-intr-range`	--	Intrinsic range analysis control

Differences from Upstream LLVM

Upstream LLVM has nothing resembling this system. The closest analogue is the cl::opt<T> flag mechanism, but that scatters configuration across hundreds of global variables that each pass reads independently. The differences are architectural:

Aspect	Upstream LLVM	cicc NVVMPassOptions
Storage	~1,689 scattered `cl::opt` globals in BSS	Single 4,512-byte contiguous struct
Initialization	Global constructors register each flag	One 125KB function flattens all 221 slots
Access pattern	Each pass reads its own globals	Pipeline assembler reads all slots centrally
Copyability	Not designed for copying	Struct is trivially `memcpy`-able for Phase I/II
Thread safety	Global `cl::opt` requires careful coordination	Each thread gets its own struct copy
Override mechanism	`cl::opt` command-line parser	`PassOptionRegistry` hash table with fallback defaults
Pass gating	Pass decides internally whether to run	Pipeline assembler decides before constructing pass

The thread-safety property is crucial for the two-phase concurrent compilation model. When Phase II runs per-function compilation in parallel threads, each thread receives a copy of the NVVMPassOptions struct. If NVIDIA used upstream cl::opt globals for pass configuration, they would need global locks or TLS for every option read during pass execution -- an unacceptable overhead for a GPU compiler that may process hundreds of kernels in a single translation unit.

Interaction with Two-Phase Compilation

The NVVMPassOptions struct is allocated and populated before Phase I begins, in the orchestrator sub_12E7E70:

// sub_12E7E70, line ~128
void* opts = malloc(4512);              // allocate NVVMPassOptions
sub_12D6300(opts, registry);            // populate from CLI-parsed registry
// ... pass opts to sub_12E54A0 for Phase I ...
// ... pass same opts to sub_12E54A0 for Phase II ...

Both phases receive the same opts pointer. Individual passes within the pipeline assembler check qword_4FBB3B0 (the TLS phase counter) to skip themselves in the wrong phase -- but the NVVMPassOptions struct itself does not change between phases. This means a pass cannot be enabled in Phase I but disabled in Phase II through NVVMPassOptions alone; phase selection is handled by the separate TLS mechanism.

The second caller, sub_12F4060 (TargetMachine creation in the standalone path), performs an identical allocation and initialization sequence, confirming that every compilation path goes through the same NVVMPassOptions infrastructure.

Function Map

Function	Address	Size	Role
`NVVMPassOptions::init`	`sub_12D6300`	125KB	Populate 221 slots from registry
`PassOptionRegistry::lookupOption`	`sub_12D6170`	~200B	Hash-table lookup by slot index
`PassOptionRegistry::getBoolOption`	`sub_12D6240`	~300B	Boolean resolution with default
`writeStringOption`	`sub_12D6090`	~150B	Write 24-byte string slot
`writeBoolOption`	`sub_12D6100`	~120B	Write 16-byte boolean slot
`PassDefTable::getPassDef`	`sub_1691920`	~80B	64-byte stride table lookup
`parseInt`	`sub_16D2BB0`	~100B	String-to-int64 parser
`toLowercase`	`sub_16D2060`	~80B	String lowercasing for bool parse

Cross-References

LLVM Optimizer -- pipeline assembler that consumes NVVMPassOptions
Configuration Knobs -- all three knob systems (cl::opt, NVVMPassOptions, codegen)
CLI Flags -- flag catalog and routing to opt phase vector
Optimization Levels -- O-level encoding and fast-compile modes
Concurrent Compilation -- Phase I/II threading model
Entry Point & CLI -- wizard mode and -opt flag dispatching
OptiX IR -- forces do-ip-msp=0 and do-licm=0

Configuration Knobs

Three independent knob systems control compiler behavior: LLVM cl::opt flags (~1,496 unique), NVVMPassOptions (222 slots), and NVIDIA codegen knobs (~70).


LLVM cl::opt	1,496 unique flags across 353 constructor files
NVVMPassOptions	222 slots, initialized by `sub_12D6300` (125KB)
Codegen knobs	~70, parsed by `sub_1C20170` / `sub_CD9990` from NVVM container
BSS storage	`0x4F7FEA0`–`0x4FA5xxx` (cl::opt), `a1+0`–`a1+4464` (PassOptions)
Dual PM	Same options registered for both Legacy PM (`sub_C53080`) and New PM (`sub_16B8280`)
NVIDIA-specific	172 of 1,496 cl::opt flags (11.5%) are NVIDIA-added

Knob System 1: LLVM cl::opt

Registration Pattern

Every cl::opt follows this initialization sequence in a global constructor:

// Legacy PM path
InterlockedExchangeAdd64(sub_C523C0(), 1);   // atomic option counter
sub_C53080(&option, "option-name", strlen);   // set name
sub_C53130(&option);                          // finalize registration
__cxa_atexit(destructor, &option, &dso_handle);

// New PM path (parallel registration)
InterlockedExchangeAdd64(&unk_4FA0230, 1);
sub_16B8280(&option, "option-name", strlen);
sub_16B88A0(&option);
__cxa_atexit(destructor, &option, &dso_handle);

Each cl::opt<T> occupies ~224 bytes (0xE0) in BSS. Top constructors by option count: ctor_600 (30), ctor_433 (25), ctor_472 (24), ctor_609 (22), ctor_392 (22).

Category 1: Scalar Optimization (InstCombine + FP)

Constructor: ctor_165_0 at 0x4D0500 (11,731 bytes). Registers 12 NVIDIA-specific flags plus 4 standard LLVM flags.

Flag	Type	Default	BSS Addr	Purpose
`split-gep-chain`	bool	false	`0x4F901A8`	Split GEP chains to independent GEPs for better address mode selection
`Disable-Add-to-Or`	bool	true	—	Disable add-to-or transformation (NVIDIA blocks this LLVM combine)
`opt-use-fast-math`	bool	false	—	Enable aggressive FP simplification (set by `-unsafe-math` / `-fast-math`)
`opt-use-prec-div`	bool	true	—	Use precise division (set by `-prec-div=1`; cleared by `-prec-div=0`)
`opt-no-signed-zeros`	bool	false	—	Ignore signed zero distinction (set by `-no-signed-zeros`)
`disable-fp-cast-opt`	bool	false	—	Disable FP-to-int and int-to-FP cast optimizations
`reorder-sext-before-cnst-add`	bool	false	—	`sext(add(a,CI))` to `add(sext(a),CI)` rewrite; hidden flag
`disable-sink`	bool	false	—	Disable instruction sinking in InstCombine
`partial-sink`	bool	false	—	Enable partial sinking of instructions
`nvptx-rsqrt-approx-opt`	bool	false	—	Enable reciprocal sqrt approximation optimization
`disable-rsqrt-opt`	bool	false	—	Disable reciprocal sqrt optimization entirely
`check-vn`	bool	false	—	Verify value numbers after transformations (debug)

Standard LLVM flags in same constructor: expensive-combines (bool), instcombine-maxarray-size (int, default 1024), instcombine-visit (int), instcombine-lower-dbg-declare (bool).

Category 2: Inliner Heuristics

Constructor: ctor_186_0 at 0x4DBEC0 (14,109 bytes). Nine NVIDIA-specific flags governing the custom CGSCC inliner at sub_1864060.

Flag	Type	Default	Purpose
`profuseinline`	bool	false	Verbose inlining diagnostics (NVIDIA profuse framework, not PGO profuse)
`inline-total-budget`	int	none	Global total budget across all callers; unset = unlimited
`nv-inline-all`	bool	false	Force inline ALL function calls (used by OptiX ray tracing)
`inline-budget`	int	20000	Per-caller inlining cost budget; `-aggressive-inline` sets to 40000
`inline-adj-budget1`	int	none	Secondary adjusted per-caller budget
`inline-switchctrl`	int	none	Tune heuristic for switch-containing callees
`inline-numswitchfunc`	int	none	Threshold for switch-heavy function penalty
`inline-maxswitchcases`	int	none	Max switch cases before inlining penalty kicks in
`disable-inlined-alloca-merging`	bool	false	Disable post-inline alloca merging into single frame slot

"none" means the knob is unset by default and the heuristic falls back to internal logic.

Category 3: GVN (Global Value Numbering)

Constructor: ctor_201 at 0x4E0990. Eleven knobs (8 NVIDIA-specific + 3 upstream).

Flag	Type	Default	BSS Addr	Purpose
`profusegvn`	bool	true	`0x4FAE7E0`	Verbose GVN diagnostics (unusually, defaults on)
`gvn-dom-cache`	bool	true	`0x4FAE700`	Cache dominator tree nodes; cache size = 32
`max-recurse-depth`	int	1000	`0x4FAE620`	Max recursion during value numbering (safety valve for template-heavy code)
`enable-phi-remove`	int	2	`0x4FAEC40`	PHI removal aggressiveness: 0=off, 1=trivial only, 2=post-leader substitution
`dump-phi-remove`	int	0	`0x4FAEB60`	Dump PHI removal decisions (debug)
`no-split-stores-below`	int	-1	`0x4FAEA80`	Min store width for splitting (bits); -1 = no limit
`no-split-stores-above`	int	-1	`0x4FAE9A0`	Max store width for splitting (bits); -1 = no limit
`split-stores`	bool	true	`0x4FAE8C0`	Master enable for NVIDIA store-splitting in GVN
`enable-pre`	bool	true	`0x4FAEEE0`	Enable Partial Redundancy Elimination (upstream LLVM)
`enable-load-pre`	bool	true	`0x4FAEE00`	Enable load PRE across edges (upstream LLVM)
`enable-split-backedge-in-load-pre`	bool	false	`0x4FAED20`	Allow backedge splitting during load PRE (upstream LLVM)

Store splitting uses a custom NVIDIA registrar (sub_190BE40) that takes a default-value pointer. Both limit knobs default to -1 = all sizes eligible.

Category 4: Loop Strength Reduction

Constructor: ctor_214_0 at 0x4E4B00. Eleven NVIDIA-specific LSR flags (69% NVIDIA customization rate).

Flag	Type	Default	Purpose
`disable-unknown-trip-lsr`	bool	false	Disable LSR for loops with unknown trip count
`lsr-check-rp`	bool	true `[MEDIUM]`	Check register pressure before applying LSR
`lsr-rp-limit`	int	~32-64 `[LOW]`	Skip LSR entirely when RP exceeds this limit (occupancy cliff)
`filter-bad-formula`	bool	true `[MEDIUM]`	Filter out poor-quality LSR formulae early
`do-lsr-64-bit`	bool	arch-dependent	Enable 64-bit loop strength reduction (false on sm_3x-5x, true on sm_70+)
`count-sxt-opt-for-reg-pressure`	bool	true `[MEDIUM]`	Factor sign-extension elimination savings into RP analysis
`lsr-sxtopt`	bool	true `[MEDIUM]`	Perform sign-extension elimination within LSR
`lsr-loop-level`	int	0	Apply LSR only at specific loop nesting level (0 = all levels)
`lsr-skip-outer-loop`	bool	false	Ignore outer-loop induction variables in LSR
`disable-lsr-for-sharedmem32-ptr`	bool	false	Disable LSR for 32-bit shared memory pointers (GPU-specific)
`disable-lsr-complexity-discount`	bool	false	Disable complexity estimation discount heuristic

Standard LLVM LSR flags in same constructor: enable-lsr-phielim, lsr-insns-cost, lsr-exp-narrow, lsr-filter-same-scaled-reg, lsr-fix-iv-inc.

Category 5: IndVarSimplify

Constructor: ctor_203_0 at 0x4E1CD0 (7,007 bytes).

Flag	Type	Default	Purpose
`Disable-unknown-trip-iv`	bool	false	Disable IV substitution for unknown-trip-count loops
`iv-loop-level`	int	none	Control which loop nesting levels get IV substitution

Category 6: SimplifyCFG

Constructor: ctor_243_0 at 0x4ED0C0.

Flag	Type	Default	Purpose
`disable-jump-threading`	bool	false	Disable jump threading (for OCG experiments)
`fold-with-var-cond`	bool	false	Fold branches with variance-based conditions

Category 7: NVPTX Backend Math/Scheduling

Constructor: ctor_607 at 0x584B60 (13,700 bytes). Core numeric precision and FMA controls. Defaults are set by the CLI flag routing in sub_9624D0, not by the cl::opt constructors.

Flag	Type	CLI Default	Purpose
`nvptx-sched4reg`	bool	false	Schedule for register pressure (key NVPTX strategy)
`nvptx-fma-level`	int	1	FMA contraction: 0=off, 1=on, 2=aggressive. CLI `-fma=1` is default
`nvptx-prec-divf32`	int	1	F32 div precision: 0=approx, 1=full, 2=IEEE rnd+ftz, 3=IEEE no-ftz
`nvptx-prec-sqrtf32`	int	1	Sqrt precision: 0=approx, 1=rn. CLI `-prec-sqrt=1` is default
`nvptx-approx-log2f32`	bool	false	Use `lg2.approx` for log2 (only set by `-unsafe-math`)
`nvptx-force-min-byval-param-align`	bool	false	Force 4-byte minimum alignment for byval parameters
`nvptx-normalize-select`	bool	false	Override `shouldNormalizeToSelectSequence` in TLI
`enable-bfi64`	bool	false	Enable 64-bit BFI (bit-field insert) instructions

Note: These cl::opt knobs have no explicit default in their constructor (they init to 0/false). The effective defaults come from the CLI flag catalog: -fma=1 routes -nvptx-fma-level=1, -prec-div=1 routes -nvptx-prec-divf32=1, -prec-sqrt=1 routes -nvptx-prec-sqrtf32=1.

Category 8: NVPTX Backend Passes/Features

Constructor: ctor_609_0 at 0x585D30 (22 options total, largest NVPTX constructor).

Flag	Type	Default	Purpose
`disable-nvptx-load-store-vectorizer`	bool	false	Disable load/store vectorizer
`disable-nvptx-require-structured-cfg`	bool	false	Turn off structured CFG requirement (transitional)
`nvptx-short-ptr`	bool	false	32-bit pointers for const/local/shared address spaces
`nvptx-enable-machine-sink`	bool	false	Enable machine-level instruction sinking
`enable-new-nvvm-remat`	bool	true	Enable new NVVM rematerialization engine
`nv-disable-remat`	bool	false	Disable all rematerialization passes
`nv-disable-mem2reg`	bool	false	Disable machine-IR mem2reg promotion
`nv-disable-scev-cgp`	bool	true	Disable SCEV-based address mode optimization (on = disabled)
`nvptx-32-bit-smem`	bool	false	Use 32-bit pointers for shared address space
`nvptx-exit-on-unreachable`	bool	true	Lower `unreachable` as PTX `exit` instruction
`nvptx-early-byval-copy`	bool	false	Create copy of byval function args in entry block
`enable-nvvm-peephole`	bool	true	Enable NVVM peephole optimizer
`no-reg-target-nvptxremat`	bool	false	Only run old remat on kernels without register targets
`lower-func-args`	bool	true	Lower large aggregate function parameters to copies
`enable-sink`	bool	true	Enable LLVM sinking pass
`disable-post-opt`	bool	false	Disable IR optimizations in post-opt phase
`usedessa`	int	2	deSSA method: 0=off, 1=basic, 2=full
`ldg`	bool	true	Load-via-texture (ld.global.nc) constant transform

Category 9: NVPTX Backend Extended

Constructor: ctor_610 at 0x5888A0 (7,400 bytes).

Flag	Type	Default	Purpose
`unroll-assumed-size`	int	4	Assumed element count for unknown-size local arrays during unroll analysis
`enable-loop-peeling`	bool	false	Enable loop peeling transformation
`enable-256-bit-load-store`	bool	false	Enable 256-bit (32-byte) vector load/store generation
`ias-param-always-point-to-global`	bool	false	Assume function parameter pointers always point to global memory
`ias-strong-global-assumptions`	bool	false	Stronger assumption: constant-buffer pointers resolve to globals
`ias-wmma-memory-space-opt`	bool	false	Enable MemorySpaceOpt specialization for WMMA/tensor operations

Category 10: Memory Space Optimization

Scattered across ctor_264, ctor_267_0, ctor_528, ctor_531_0. See MemorySpaceOpt and IPMSP for the full algorithm.

Flag	Type	Default	Purpose
`mem-space-alg`	int	2	Switch between MSO algorithm variants
`dump-ir-before-memory-space-opt`	bool	false	Dump IR before MSO
`dump-ir-after-memory-space-opt`	bool	false	Dump IR after MSO
`track-indir-load`	bool	true	Track indirect loads during MSO dataflow
`track-int2ptr`	bool	true	Track IntToPtr casts in MSO
`param-always-point-to-global`	bool	true	Kernel parameter pointers always point to global memory
`devicefn-param-always-local`	bool	false	Treat parameter space as local in device functions
`ignore-address-space-check`	bool	false	Ignore address-space checks during branch distribution
`sink-into-texture`	int	3	Sink loads into texture blocks: 0=off, 1=cross-block, 2=+intra, 3=+outside-only. See also Category 14
`ldg`	bool	true	Load Global Constant Transform (ld.global.nc)
`do-clone-for-ip-msp`	int	-1	Function cloning limit for IP-MSP (-1 = unlimited, 0 = disable)
`dump-ip-msp`	bool	false	Dump interprocedural MSP info
`lower-read-only-devicefn-byval`	bool	false	Handle byval attribute of args to read-only device functions
`reuse-lmem-very-long-live-range`	int	—	Threshold for very-long live range in local memory reuse
`hoist-load-param`	bool	false	Generate all `ld.param` in entry basic block
`sink-ld-param`	bool	false	Sink one-use `ld.param` to use point
`process-alloca-always`	bool	true	Treat `alloca` as definite local (AS 5) regardless of context
`wmma-memory-space-opt`	bool	true	Enable memory space optimization for WMMA operations
`strong-global-assumptions`	bool	true	Assume const buffer pointers always point to globals
`process-builtin-assume`	bool	—	Process `__builtin_assume(__is*(p))` assertions for space deduction

Category 11: Rematerialization

Scattered across ctor_609_0, ctor_362, ctor_277_0, ctor_361_0, and others. See Rematerialization for full algorithm detail.

IR-Level Knobs (ctor_277_0 at `0x4F7BE0`)

Flag	Type	Default	Global	Purpose
`do-remat`	int	3	`dword_4FC05C0`	Master control. 0=off, 1=conservative, 2=normal, 3=full
`no-remat`	string	(empty)	`qword_4FC0440`	Comma-separated function exclusion list
`remat-iv`	int	4	`dword_4FBFB40`	IV demotion level. 0=off, 4=full
`remat-load`	int	1	`dword_4FBFA60`	Load rematerialization. 0=off, 1=on
`remat-add`	int	0	`dword_4FBF980`	Add/GEP factoring. 0=off
`remat-single-cost-limit`	int	6000	`dword_4FC0080`	Max cost per single live-in reduction
`remat-loop-trip`	int	20	`dword_4FBFFA0`	Default assumed loop trip count
`remat-gep-cost`	int	6000	`dword_4FBFEC0`	Max cost for GEP rematerialization
`remat-use-limit`	int	10	`dword_4FBFDE0`	Max number of uses for a candidate
`remat-max-live-limit`	int	10	`dword_4FBFD00`	Max live-in limit for rematerialization
`remat-maxreg-ceiling`	int	0	`dword_4FBF600`	Register ceiling (0 = uncapped)
`remat-for-occ`	int	120	`dword_4FBF8A0`	Occupancy-driven rematerialization target
`remat-lli-factor`	int	10	`dword_4FC0320`	Long-latency instruction cost factor
`remat-ignore-single-cost`	bool	false	`byte_4FBFC20`	Bypass per-value cost filter
`remat-move`	bool	false	`byte_4FC0400`	Remat move instructions
`simplify-live-out`	int	2	`dword_4FBF520`	NLO level. 0=off, 2=full
`dump-remat`	int	0	`dword_4FC0240`	Debug dump level (0-4+)
`dump-remat-iv`	int	0	`dword_4FC0160`	IV remat debug dump
`dump-remat-load`	int	0	`dword_4FBF720`	Load remat debug dump
`dump-remat-add`	int	0	`dword_4FBF640`	Add remat debug dump
`dump-simplify-live-out`	bool	false	`byte_4FBF400`	NLO debug dump

Machine-Level Knobs (ctor_361_0 at `0x5108E0`)

Flag	Type	Default	Global	Purpose
`nv-remat-block`	int	14	`dword_4FD3820`	Bitmask controlling remat modes (bits 0-3)
`nv-remat-max-times`	int	10	`dword_4FD3740`	Max outer loop iterations
`nv-remat-block-single-cost`	int	10	`dword_4FD3660`	Max cost per single live value pull-in
`nv-remat-block-map-size-limit`	int	6	`dword_4FD3580`	Map size limit for single pull-in
`nv-remat-block-max-cost`	int	100	`dword_4FD3040`	Max total clone cost per live value reduction
`nv-remat-block-liveout-min-percentage`	int	70	`dword_4FD3120`	Min liveout % for special consideration
`nv-remat-block-loop-cost-factor`	int	20	`unk_4FD3400`	Loop cost multiplier
`nv-remat-default-max-reg`	int	70	`unk_4FD3320`	Default max register pressure target
`nv-remat-block-load-cost`	int	10	`unk_4FD2EC0`	Cost assigned to load instructions
`nv-remat-threshold-for-spec-reg`	int	20	`unk_4FD3860`	Threshold for special register remat
`nv-dump-remat-block`	bool	false	`byte_4FD2E80`	Debug dump toggle
`nv-remat-check-internal-live`	bool	false	`byte_4FD2DA0`	Check internal liveness during MaxLive
`max-reg-kind`	int	0	`qword_4FD2C20`	Kind of max register pressure info
`no-mi-remat`	string	(empty)	`qword_4FD2BE0`	Skip machine-level remat for named functions
`load-remat`	bool	true	`word_4FD32F0`	Enable load rematerialization
`vasp-fix1`	bool	false	`word_4FD3210`	VASP fix for volatile/addsp

General Remat Knobs (ctor_609_0, ctor_362, and others)

Flag	Type	Default	Purpose
`nv-disable-remat`	bool	false	Disable all remat passes
`enable-new-nvvm-remat`	bool	true	Enable new NVVM remat engine (disables old)
`no-reg-target-nvptxremat`	bool	false	Only old remat for kernels without register targets
`fp-remat`	bool	false	Allow rematerializing floating-point instructions
`high-cost-remat`	bool	false	Allow rematerializing high-cost instructions
`cost-threshold-remat`	int	—	Cost threshold per remat action
`block-freq-cap-remat`	int	—	Maximum raw block frequency value
`block-freq-norm-range-remat`	int	—	Normalization range for block frequency in remat cost
`collect-candidate-scale-remat`	int	—	Scaling ratio for high-RP candidate collection
`incremental-update-remat`	bool	false	Incrementally update RP analysis after each remat
`verify-update-remat`	bool	false	Debug: verify incremental update vs full analysis
`print-verify-remat`	bool	false	Debug: print problematic RP on verification failure
`rp-remat`	int	—	Debug: set a target register pressure number
`late-remat-update-threshold`	int	—	Threshold for copy with many other copy uses
`remat-load-param`	bool	false	Support rematerializing constant `ld.param` not in NVVM IR

Category 12: SCEV-CGP (Address Mode Optimization)

Eleven NVIDIA-specific knobs for SCEV-based CodeGenPrepare. See CodeGenPrepare.

Flag	Type	Default	Purpose
`do-scev-cgp`	bool	false	Enable SCEV-based CodeGenPrepare
`do-scev-cgp-aggresively`	bool	false	Aggressive SCEV-CGP mode [sic]
`do-function-scev-cgp`	bool	false	Function-level SCEV-CGP
`nv-disable-scev-cgp`	bool	true	Disable SCEV address mode optimization (master kill switch, on by default)
`scev-cgp-control`	int	—	Control max transformations applied
`scev-cgp-cross-block-limit`	int	—	Max common-base expressions from a single block
`scev-cgp-idom-level-limit`	int	—	Max dominator tree levels to walk
`scev-cgp-inst-limit`	int	—	Max instructions for a single parameter
`scev-cgp-old-base`	bool	false	Force SCEV-CGP to create new base (vs reusing old)
`scev-cgp-tid-max-value`	int	—	Max value of thread ID in SCEV expressions
`print-after-scev-cgp`	bool	false	Print function after SCEV-CGP phase

Category 13: Branch Distribution

Seven NVIDIA-specific flags.

Flag	Type	Default	Purpose
`branch-dist-block-limit`	int	—	Max blocks to apply branch distribution
`branch-dist-func-limit`	int	—	Max functions to apply branch distribution
`branch-dist-norm`	int	—	Normalization control
`no-branch-dist`	string	—	Comma-separated list of functions to skip
`disable-complex-branch-dist`	bool	false	Disable complex branch distribution
`dump-branch-dist`	bool	false	Dump branch distribution info

Category 14: Sinking / Code Motion

Thirteen knobs across multiple constructors. See Sinking2 for the NVIDIA-custom texture-aware sinking pass.

Flag	Type	Default	Purpose
`sink-into-texture`	int	3	Texture sinking aggressiveness: 0=off, 1=cross-block, 2=+intra, 3=+outside-only
`sink-limit`	int	20	Max instructions to sink per Sinking2 invocation (complexity limiter)
`dump-sink2`	bool	false	Debug dump for Sinking2 pass
`sink-check-sched`	bool	true	Check scheduling effects of sinking (stock Sink)
`sink-single-only`	bool	true	Only sink single-use instructions (stock Sink)
`enable-andcmp-sinking`	bool	false	Sink and/cmp sequences into branches
`aggressive-no-sink`	bool	false	Sink all generated instructions
`max-uses-for-sinking`	int	—	Don't sink instructions with too many uses
`rp-aware-sink`	bool	false	Consider register pressure impact when sinking
`instcombine-code-sinking`	bool	false	Enable code sinking within InstCombine
`hoist-const-stores`	bool	false	Hoist loop-invariant stores

Category 15: Register Pressure / Allocation

NVIDIA-specific knobs plus LLVM greedy allocator knobs. See Register Allocation for the full algorithm.

NVIDIA RP Knobs

Flag	Type	Default	Purpose
`maxreg`	int	none	Maximum register count (`--maxrregcount` equivalent)
`register-usage-level`	int	—	Register usage level control
`cta-reconfig-aware-mrpa`	bool	false	CTA reconfiguration-aware machine RP analysis
`cta-reconfig-aware-rpa`	bool	false	CTA reconfiguration-aware RP analysis
`pred-aware-mcse`	bool	false	Predicate-aware MachineCSE
`rp-aware-mcse`	bool	false	Register-pressure-aware MachineCSE
`verify-update-mcse`	bool	false	Debug: verify incremental RP update in MachineCSE
`incremental-update-mcse`	bool	true	Incrementally update register pressure analysis in MachineCSE
`print-verify`	bool	false	Print problematic RP info if MCSE verification fails
`pred-target-adjust`	int	0	Predicate register target adjustment (-10 to +10)
`donot-insert-dup-copies`	bool	false	Skip duplicate copies to predecessor basic block
`nv-disable-mem2reg`	bool	false	Disable machine-level mem2reg

LLVM Greedy Allocator Knobs

Flag	Type	Default	Purpose
`split-spill-mode`	int	1	Spill mode: 0=default, 1=size, 2=speed
`lcr-max-depth`	int	5	Last chance recoloring max recursion depth
`lcr-max-interf`	int	8	Last chance recoloring max interferences
`exhaustive-register-search`	bool	false	Bypass LCR depth/interference cutoffs
`enable-deferred-spilling`	bool	false	Defer spill code to end of allocation
`grow-region-complexity-budget`	int	10000	`growRegion()` edge budget for live range splitting
`split-threshold-for-reg-with-hint`	int	75	Split threshold percentage for hinted registers

Category 16: Restrict / Aliasing

Five NVIDIA-specific flags. See Alias Analysis.

Flag	Type	Default	Purpose
`process-restrict`	bool	false	Process `__restrict__` keyword for alias analysis
`allow-restrict-in-struct`	bool	false	Allow `__restrict__` inside struct members
`apply-multi-level-restrict`	bool	false	Apply restrict to all pointer levels
`dump-process-restrict`	bool	false	Debug dump during restrict processing
`strict-aliasing`	bool	false	Datatype-based strict aliasing

Category 17: CSSA / deSSA

Four knobs. See CSSA.

Flag	Type	Default	Purpose
`cssa-coalesce`	int	—	Control PHI operand coalescing strategy
`cssa-verbosity`	int	0	Verbosity level
`dump-before-cssa`	bool	false	Dump specific PHI operands being coalesced
`usedessa`	int	2	deSSA method: 0=off, 1=basic, 2=full

Category 18: Loop / Unrolling

Eight NVIDIA-specific knobs (beyond the 20+ standard LLVM loop-unrolling flags).

Flag	Type	Default	Purpose
`nv-disable-loop-unrolling`	bool	false	Disable loop unrolling in all passes
`aggressive-runtime-unrolling`	bool	false	OCG-style unrolling heuristics
`aggressive-runtime-unrolling-fixed-factor`	int	—	Force fixed unroll factor
`aggressive-runtime-unrolling-max-factor`	int	—	Maximum unroll factor
`aggressive-runtime-unrolling-max-filler-instructions-per-batch`	int	—	Max filler instructions
`unroll-runtime-nv-expensive`	bool	false	NVIDIA heuristics for expensive loops
`unroll-runtime-convergent`	bool	false	Allow unrolling with `convergent` instructions
`track-trip-count-more`	bool	false	Track loop trip count more aggressively

Category 19: GEP / Address Strength Reduction

Eight NVIDIA-specific knobs. See Base Address Strength Reduction.

Flag	Type	Default	Purpose
`normalize-gep`	bool	false	Normalize 64-bit GEP subscripts
`dump-normalize-gep`	bool	false	Debug dump for GEP normalization
`do-base-address-strength-reduce`	int	0	Two levels: 1=unconditional, 2=with conditions
`dump-base-address-strength-reduce`	bool	false	Debug dump
`do-lsr-64-bit`	bool	false	Loop strength reduction for 64-bit (shared with LSR)
`do-sign-ext-expand`	bool	false	Expand sign-extension during SCEV build
`balance-dot-chain`	bool	false	Balance chain of dot operations
`special-reassociate-for-threadid`	bool	false	Don't move back expressions containing thread ID

Category 20: Aggregate / Byval Lowering

Ten knobs.

Flag	Type	Default	Purpose
`aggressive-max-aggr-lower-size`	int	—	Threshold size for lowering aggregates
`aggressive-lsv`	bool	false	Merge smaller dtypes in aggregate before vectorization
`vect-split-aggr`	bool	false	Split aggregates before vectorization
`lower-aggr-unrolled-stores-limit`	int	—	Limit stores in unrolled aggregate lowering
`large-aggr-store-limit`	int	—	Create loops for aggregate store exceeding limit
`lower-func-args`	bool	true	Lower large aggregate function parameters
`lsa-opt`	bool	false	Optimize copying of struct args to local memory
`skiploweraggcopysafechk`	bool	false	Skip safety check in `loweraggcopy`
`memdep-cache-byval-loads`	bool	true	Preprocess byval loads to reduce compile time
`ldstmemcpy-glue-max`	int	—	Limit for gluing ld/st of memcpy

Category 21: Normalization / Canonicalization

Four knobs.

Flag	Type	Default	Purpose
`norm-fold-all`	bool	false	Fold all regular instructions
`norm-preserve-order`	bool	false	Preserve original instruction order
`norm-rename-all`	bool	false	Rename all instructions
`norm-reorder-operands`	bool	false	Sort/reorder operands in commutative operations

Category 22: NVVM Infrastructure

Five knobs.

Flag	Type	Default	Purpose
`nvvm-lower-printf`	bool	false	Enable printf lowering
`nvvm-reflect-enable`	bool	true	NVVM reflection (reads `__CUDA_FTZ`, `__CUDA_PREC_DIV`, etc.)
`nvvm-verify-show-info`	bool	false	Info messages during NVVM verification
`enable-nvvm-peephole`	bool	true	NVVM peephole optimizer
`nv-ocl`	bool	false	Deprecated OpenCL compatibility flag

Category 23: Compilation Control

Constructor: ctor_043_0 at 0x48D7F0 + ctor_028_0 at 0x489160.

Flag	Type	Default	Purpose
`debug-compile`	bool	false	Compile for debugging (set by `-g`)
`generate-line-info`	bool	false	Emit line info even without `-G`
`nvptx-f32ftz`	bool	false	Flush f32 subnormals to zero; hidden
`w`	bool	false	Disable warnings; hidden
`Werror`	bool	false	Treat warnings as errors; hidden
`Osize`	bool	false	Optimize for code size; hidden
`Om`	bool	false	Maximum optimization mode; hidden
`maxreg`	int	none	Maximum register count (no limit if unset)
`nvptx-nan`	bool	false	NaN handling control; hidden
`jump-table-density`	int	10	Minimum density (%) for jump table lowering
`pass-control`	int	-1	Disable all optional passes after pass N; -1 = no limit
`disable-passno`	list	empty	Disable pass(es) by number (comma-separated)
`sep-comp`	bool	false	Separate compilation mode
`proffile`	string	—	Filename for PGO profile information
`R`	string	—	Resource constraint: `name=<int>` format
`lnk-disable-allopts`	bool	false	Disable all linker optimization passes
`disable-peephole`	bool	false	Disable peephole optimizer
`disable-early-taildup`	bool	false	Disable pre-regalloc tail duplication

Category 24: Divergence / GPU Execution

Three flags.

Flag	Type	Default	Purpose
`spec-exec-only-if-divergent-target`	bool	false	Speculative execution only when target is divergent
`prefer-predicated-reduction-select`	bool	false	Prefer predicated reduction over after-loop select
`openmp-opt-disable-barrier-elimination`	bool	false	Disable OpenMP barrier elimination

Category 25: MachinePipeliner (Swing Modulo Scheduling)

Eighteen LLVM-origin knobs for software pipelining. See Scheduling for the full algorithm.

Flag	Type	Default	Global	Purpose
`enable-pipeliner`	bool	true	`unk_503EE20`	Master switch for SMS
`enable-pipeliner-opt-size`	bool	false	`qword_503ED40`	Enable SWP at -Os
`pipeliner-max-mii`	int	27	`qword_503ECE8`	Maximum allowed MII
`pipeliner-force-ii`	int	0	`qword_503EB80`	Force specific II (0 = auto)
`pipeliner-max-stages`	int	3	`qword_503EB28`	Maximum pipeline stages
`pipeliner-prune-deps`	bool	true	`qword_503E9C0`	Prune deps between unrelated Phi nodes
`pipeliner-prune-loop-carried`	bool	true	`qword_503E8E0`	Prune loop-carried order deps
`pipeliner-ignore-recmii`	bool	false	`qword_503E888`	Ignore RecMII; hidden
`pipeliner-show-mask`	bool	false	`qword_503E720`	Debug: show scheduling mask
`pipeliner-dbg-res`	bool	false	`qword_503E640`	Debug: resource usage
`pipeliner-annotate-for-testing`	bool	false	`qword_503E5E8`	Annotate instead of codegen
`pipeliner-experimental-cg`	bool	false	`qword_503E508`	Use peeling code generator
`pipeliner-ii-search-range`	int	10	`qword_503E3A0`	Range to search for II
`pipeliner-register-pressure`	bool	false	`qword_503E2C0`	Consider register pressure
`pipeliner-register-pressure-margin`	int	5	`qword_503E1E0`	Margin % for reg pressure limit
`pipeliner-mve-cg`	bool	true	`unk_503E100`	Use MVE code generator
`pipeliner-enable-copytophi`	bool	true	`qword_503E020`	Enable CopyToPhi DAG Mutation
`pipeliner-force-issue-width`	int	0	`qword_503DF40`	Force issue width (0 = auto)

Category 26: LLVM Standard Inliner (Model B)

Seventeen LLVM-origin knobs from ctor_625_0 / ctor_715_0 at 0x58FAD0. These control the upstream InlineCostAnalysis::analyzeCall path; see Inliner Cost Model for why the NVIDIA custom model (Category 2) dominates in practice.

Flag	Type	Default	Purpose
`inline-threshold`	int	225	Base inlining threshold
`inlinedefault-threshold`	int	225	Default when no hint/profile
`inlinehint-threshold`	int	325	Threshold for `__attribute__((always_inline))` hint
`inline-cold-callsite-threshold`	int	45	Threshold for cold callsites
`inlinecold-threshold`	int	45	Threshold for functions with cold attribute
`hot-callsite-threshold`	int	3000	Threshold for hot callsites (PGO)
`locally-hot-callsite-threshold`	int	525	Threshold for locally hot callsites
`inline-instr-cost`	int	5	Cost per instruction
`inline-call-penalty`	int	25	Penalty per callsite in callee
`inline-memaccess-cost`	int	0	Cost per load/store
`inline-savings-multiplier`	int	8	Multiplier for cycle savings
`inline-savings-profitable-multiplier`	int	4	Multiplier for profitability check
`inline-size-allowance`	int	100	Max callee size inlined without savings proof
`inline-cost-full`	bool	false	Compute full cost even when over threshold
`inline-enable-cost-benefit-analysis`	bool	false	Enable cost-benefit analysis
`inline-deferral`	bool	—	Defer inlining in cold paths (PGO)
`inline-remark-attribute`	bool	false	Emit inline remarks

Category 27: New PM CGSCC Inliner (Model C)

Two knobs for the New Pass Manager CGSCC inliner at 0x2613930. See Inliner Cost Model.

Flag	Type	Default	Purpose
`function-inline-cost-multiplier`	int	—	Penalize recursive function inlining
`enable-ml-inliner`	enum	default	ML advisory mode: `default`, `development`, `release`

Knob System 2: NVVMPassOptions

222 pass option slots initialized by sub_12D6300 (125KB). Each slot is accessed by integer index (1--221) and stored in a ~4,480-byte struct.

Access Functions

Function	Purpose
`sub_12D6170(base+120, index)`	Fetch pass option descriptor by index
`sub_1691920(base+8, index)`	Fetch pass option value from table
`sub_12D6090(a1+offset, ...)`	Store string-typed option
`sub_12D6100(a1+offset, ...)`	Store integer-typed option
`sub_12D6240(a1, index, "0")`	Get option with default value

See NVVMPassOptions for the complete 222-slot inventory.

Knob System 3: NVIDIA Codegen Knobs

Parsed from the NVVM container format by sub_1C20170 and sub_CD9990. See NVIDIA Custom Passes for the complete inventory.

Hidden / Obfuscated Flags

Obfuscated Flag (`ctor_043_0` at `~0x48EE80`)

A 4-byte CLI flag name computed via XOR-based obfuscation from unk_3F6F7C7:

v40 = v37 ^ (-109 * ((offset + 97) ^ 0x811C9DC5));

Stored at qword_4F857C0 with flag bits 0x87 | 0x38 = hidden + really-hidden. NVIDIA deliberately hides this option from static analysis using FNV-1a-like constants.

Environment Variable Backdoors

Variable	Purpose	Location
`NVVMCCWIZ`	Wizard mode (value 553282) -- unlocks `-v`, `-keep`, `-dryrun`, `-lgenfe`, `-opt`, `-llc`, `-lnk`, `-libnvvm`	`sub_8F9C90`
`bar`	Extended debug pass registration	`ctor_107_0` at `0x4A64D0`
`NVVM_IR_VER_CHK`	Override IR version check (set to "0" to disable)	`sub_12BFF60`
`LLVM_OVERRIDE_PRODUCER`	Override bitcode producer string (default `"7.0.1"`)	`ctor_154` at `0x4CE640`
`MALLOC_CONF`	jemalloc allocator tuning	`sub_12FCDB0`
`LIBNVVM_DISABLE_CONCURRENT_API`	Force single-threaded NVVM compilation	`ctor_104` at `0x4A5810`

CLI Defaults Set by Flag Routing

These are effective defaults applied by the flag catalog parser (sub_9624D0), not by cl::opt constructors. When no user flag is specified, the parser injects these:

CLI flag	Default value	Routed cl::opt
`-arch=compute_<N>`	`compute_75` (SM 75, Turing)	target architecture
`-opt=<N>`	`3` (O3)	optimization level
`-ftz=<N>`	`0` (no flush-to-zero)	`nvptx-f32ftz`
`-prec-sqrt=<N>`	`1` (precise)	`nvptx-prec-sqrtf32=1`
`-prec-div=<N>`	`1` (precise)	`nvptx-prec-divf32=1` (CUDA) / `=0` (CL)
`-fma=<N>`	`1` (enabled)	`nvptx-fma-level=1`
`-opt-fdiv=<N>`	`0` (off)	optimizer fast-div control
`-Ofast-compile=<level>`	`0` (off)	fast-compile pipeline

NVIDIA Modification Density

Subsystem	NVIDIA Knobs	LLVM Knobs	Customization Rate
LSR	11	5	69%
InstCombine	12	4	75%
Inliner (NVIDIA custom)	9	0	100%
Inliner (LLVM standard)	0	17	0%
GVN	8	3	73%
NVPTX Backend	30+	0	100%
SimplifyCFG	2	8+	20%
Memory Space Opt	20	0	100%
Rematerialization (IR)	21	0	100%
Rematerialization (MI)	16	0	100%
Rematerialization (General)	15	0	100%
SCEV-CGP	11	0	100%
Register Pressure	12	7	63%
Sinking / Code Motion	5	6	45%
MachinePipeliner	0	18	0%
Vectorizer	0	18+	0%
SCEV	0	10+	0%

Cross-References

NVVMPassOptions -- 222-slot pass option system
CLI Flags -- complete flag-to-pipeline routing
Environment Variables -- all verified env vars
Optimization Levels -- O0/O1/O2/O3 and fast-compile pipelines
Rematerialization -- multi-pass remat engine
Memory Space Optimization -- address space resolution
Sinking2 -- texture-aware sinking
Register Allocation -- greedy RA with NVIDIA extensions
Scheduling -- SMS and MRPA
IPMSP -- memory space optimization engine
Alias Analysis -- restrict propagation
CodeGenPrepare -- SCEV-CGP pass
Inliner Cost Model -- four parallel inliner models
GVN -- GPU-specific value numbering
LSR -- GPU-aware loop strength reduction

Environment Variables

cicc v13.0 checks 22 distinct environment variables across 36 files containing getenv() calls. Six are NVIDIA-specific (two obfuscated), six come from the LLVM infrastructure, six from the EDG frontend, and the remainder from the build system, memory allocator, and shared ptxas/nvptxcompiler infrastructure. Two of the NVIDIA variables have their names encrypted in the .rodata section using an XOR+ROT13 cipher to prevent discovery through string scanning.

String Deobfuscation Engine

The deobfuscation function sub_8F98A0 at 0x8F98A0 decrypts variable names and option strings from .rodata ciphertext. The same engine is also used for hidden CLI option names (see CLI Flags).

Algorithm: sub_8F98A0

// Reconstructed pseudocode — sub_8F98A0 (0x8F98A0)
// Inputs:
//   ciphertext  — pointer to encrypted bytes in .rodata
//   base        — base address used as key seed (a2)
//   length      — number of bytes to decrypt
//
// Output:
//   plaintext string on stack, null-terminated

char* deobfuscate(const uint8_t* ciphertext, uintptr_t base, size_t length) {
    char buf[64];
    for (size_t i = 0; i < length; i++) {
        uint8_t raw = ciphertext[i];

        // Phase 1: XOR with key derived from position
        uint32_t key = -109 * ((i - base + 97) ^ 0xC5);
        char ch = raw ^ (key & 0xFF);

        // Phase 2: ROT13 on alphabetic characters
        if (ch >= 'A' && ch <= 'Z')
            ch = ((ch - 'A' + 13) % 26) + 'A';
        else if (ch >= 'a' && ch <= 'z')
            ch = ((ch - 'a' + 13) % 26) + 'a';

        buf[i] = ch;
    }
    buf[length] = '\0';
    return buf;
}

Key constant: The multiplier -109 (signed, i.e. 0xFFFFFF93) and the XOR mask 0xC5 together form a position-dependent key stream. The ROT13 phase is applied after the XOR, meaning the plaintext must survive two transformations. This is a weak cipher by design -- it only needs to defeat strings(1) scanning, not serious cryptanalysis.

Obfuscated String Table in .rodata

All obfuscated strings live in a contiguous region near 0x3C23A7B--0x3C23AD6. Each entry is referenced by its end byte address (the deobfuscator walks backward):

End address	Length	Decrypted plaintext	Purpose
`byte_3C23AD6`	14	(option prefix)	CLI option prefix for `-nvvm-version` matching
`byte_3C23AC3`	11	`"nvvm-latest"`	Option suffix; sets `v253 = 1` (Path A)
`byte_3C23AB4`	6	`"nvvm70"`	Option suffix; sets `v253 = 0` (Path B)
`byte_3C23AAD`	13	(option name)	Option name for error message display
`byte_3C23A9F`	15	`"NV_NVVM_VERSION"`	Environment variable name for `getenv()`
`byte_3C23A82`	6	`"nvvm70"`	Env var value comparison string
`byte_3C23A7B`	11	`"nvvm-latest"`	Env var value comparison string

Additional encrypted copies exist at 0x42812C0 and 0x42812F0 for the two env var names (NV_NVVM_VERSION and LIBNVVM_NVVM_VERSION) used by sub_12B9F70.

Obfuscated CLI Flag (ctor_043)

A separate obfuscation instance at ~0x48EE80 in ctor_043 (0x48D7F0) decrypts a 4-byte hidden cl::opt name from data at unk_3F6F7C7. The algorithm variant uses FNV-1a-like constants:

v40 = v37 ^ (-109 * ((offset + 97) ^ 0x811C9DC5));

The 0x811C9DC5 constant is the FNV-1a 32-bit offset basis. The resulting 4-character option is registered with flag bits 0x87 | 0x38 = hidden + really-hidden, making it invisible even to --help-hidden. This is stored at qword_4F857C0.

NVIDIA-Specific Variables

NVVMCCWIZ

Property	Value
Checked in	`sub_8F9C90` (real main) at `0x8F9C90`, specifically `0x8F9D36`
Expected value	`"553282"` (magic number = `0x87142`)
Effect	Sets `byte_4F6D280 = 1` -- unlocks developer/wizard mode

Mechanism: The value is parsed via strtol(v, 0, 10) and compared against the integer 553282. Any other value is silently ignored.

What wizard mode does: When byte_4F6D280 = 1:

-v flag actually enables verbose output (v259 = byte_4F6D280 instead of 0)
-keep flag actually preserves intermediate files (v262 = byte_4F6D280)
-dryrun flag enables verbose as a side effect (v259 = byte_4F6D280)
-lnk and -opt modes set v262 = byte_4F6D280 (keep temps)

Without wizard mode, -v and -keep are no-ops -- the flags are parsed but have no effect because they set their variables to byte_4F6D280 which is 0.

NVVM_IR_VER_CHK

Property	Value
Checked in	`sub_12BFF60` at `0x12BFF60` (NVVM IR version verifier, instance 1)
	`sub_2259720` at `0x2259720` (NVVM IR version verifier, instance 2)
Expected value	`"0"` to disable version checking
Effect	Controls NVVM IR bitcode version metadata validation

Detailed mechanism (from sub_12BFF60, 9KB):

Reads getenv("NVVM_IR_VER_CHK").
If NULL or strtol(env, 0, 10) != 0: version checking is enabled (default).
If set to "0": version checking is disabled (bypass).

When enabled, the function:

Looks up "nvvmir.version" named metadata via sub_1632310(module, &name).
Also checks "llvm.dbg.cu" metadata (debug compile unit presence).
Iterates metadata operands, deduplicating via an open-addressing hash table:
- Hash function: (value >> 9) ^ (value >> 4) & mask
- Tombstone: 0xFFFFFFFFFFFFFFF0 (-16)
- Empty: 0xFFFFFFFFFFFFFFF8 (-8)
For each unique 2-element version tuple (major, minor):
- Calls sub_12BDA30(modules, major, minor) for IR compatibility check.
- Special case: major==2, minor==0 always passes (sentinel for libdevice).
For 4-element tuples (major, minor, debug_major, debug_minor):
- Calls sub_12BD890(modules, debug_major, debug_minor) for debug version check.
- Special case: debug_major==3, debug_minor<=2 always passes.

The env var is checked multiple times per invocation: before IR version validation, before debug IR version validation, and at each version tuple comparison. Return code 3 indicates incompatible version.

Current expected versions: nvvmir.version = {2, minor<=0x62}, debug version = {3, minor<=2}.

LIBNVVM_DISABLE_CONCURRENT_API

Property	Value
Checked in	`ctor_104` at `0x4A5810` (global constructor)
Expected value	Any non-NULL value
Effect	Sets `byte_4F92D70 = 1` -- disables thread-safe libnvvm API usage

Safety valve for environments where concurrent libnvvm compilation causes issues. Any non-NULL value triggers single-threaded API behavior. See Concurrent Compilation.

NV_NVVM_VERSION (Obfuscated)

Property	Value
Checked in	`sub_12B9F70` at `0x12B9F70`, `sub_12BB580` at `0x12BB580`, `sub_8F9C90` at `0x8F9C90`
Encrypted at	`0x3C23A90` and `0x42812C0` (two copies, same ciphertext)
Decryption	XOR with `(-109 * ((byte_offset - base + 97) ^ 0xC5))` then ROT13
Expected values	`"nvvm70"` (suppresses check), `"nvvm-latest"` (forces latest mode)

How it controls compilation path selection in sub_8F9C90:

The dispatch variable v253 starts at 2 (default). When v253 is still 2 at post-parse time (lines 1590--1692):

sub_8F98A0 decrypts the env var name from byte_3C23A9F[-15..0].
Calls getenv(decrypted_name).
Compares the result against two decrypted reference strings:
- "nvvm70" (from byte_3C23A82): sets v253 = 0 (Path B -- NVVM/bitcode pipeline via sub_1262860 or sub_1265970)
- "nvvm-latest" (from byte_3C23A7B): sets v253 = 1 (Path A -- PTX pipeline via sub_902D10 or sub_905EE0)
If neither matches: uses (arch > 99) as the tiebreaker, with further modulation by -nvc and -optixir flags.

For multi-stage modes (v263 >= 3), the resolved path also determines which pipeline flag string is appended:

Path A (v253 == 1): xmmword_3C23BC0 + "vm-latest" (25 bytes total, decodes to "-nvvm-version=nvvm-latest")
Path B (v253 == 0): xmmword_3C23BC0 + "vm70" (20 bytes total, decodes to "-nvvm-version=nvvm70")

The variable name is encrypted in the binary's .rodata section because NVIDIA intended to keep this escape hatch undiscoverable through casual strings(1) scanning. It controls a fundamental compilation mode choice.

LIBNVVM_NVVM_VERSION (Obfuscated)

Property	Value
Checked in	`sub_12B9F70` at `0x12B9F70`
Encrypted at	`0x42812F0`
Expected values	Same as `NV_NVVM_VERSION`

Functionally identical to NV_NVVM_VERSION. Both names are checked by the same function sub_12B9F70; this provides an alternative name for the same feature. Likely exists so that libnvvm API users can set LIBNVVM_NVVM_VERSION while standalone cicc users set NV_NVVM_VERSION.

LLVM_OVERRIDE_PRODUCER

Property	Value
Checked in	`ctor_036` at `0x48CC90`, `ctor_154` at `0x4CE640`
Expected value	Any string
Effect	Overrides the producer identification string in output bitcode metadata

Dual constructor behavior:

ctor_036 (0x48CC90): Reads LLVM_OVERRIDE_PRODUCER, falls back to "20.0.0" (the true LLVM version). Stored in qword_4F837E0.
ctor_154 (0x4CE640): Reads LLVM_OVERRIDE_PRODUCER, falls back to "7.0.1" (the NVVM IR compatibility marker). Stored separately.

The bitcode writer (sub_1538EC0, 58KB) uses the ctor_154 value, producing "LLVM7.0.1" in the IDENTIFICATION_BLOCK. This means the output bitcode claims to be LLVM 7.0.1 format, even though cicc is built on LLVM 20.0.0 internally. Setting LLVM_OVERRIDE_PRODUCER overrides both constructors' values.

CAN_FINALIZE_DEBUG

Property	Value
Checked in	`sub_60F290` at `0x60F290`, `sub_4709E0` at `0x4709E0`, `sub_470DA0` at `0x470DA0`
Expected value	Controls debug finalization behavior
Effect	Gates debug information finalization passes

Shared with ptxas and nvptxcompiler (same codebase origin). Controls whether debug information finalization passes execute. When unset, the default behavior applies. Three call sites confirmed.

LLVM Infrastructure Variables

AS_SECURE_LOG_FILE

Checked in ctor_720 at 0x5C0D60. Sets the secure log file path for the integrated assembler, registered as LLVM cl::opt "as-secure-log-file-name". Expected: a file path.

TMPDIR / TMP / TEMP / TEMPDIR

Checked in sub_16C5C30, sub_C843A0, and sub_721330. These are probed in priority order: TMPDIR first, then TMP, TEMP, TEMPDIR. The EDG frontend (sub_721330) only checks TMPDIR and falls back to "/tmp".

PATH

Checked in sub_16C5290, sub_16C7620, sub_C86E60. Standard PATH for findProgramByName lookups.

HOME

Checked in sub_C83840. Used by sys::path::home_directory with getpwuid_r() as fallback.

PWD

Checked in sub_16C56A0, sub_C82800. Used for fast current-directory resolution (faster than getcwd).

TERM

Checked in sub_7216D0 (EDG) and sub_16C6A40/sub_C86300 (LLVM). If TERM=="dumb", terminal colors are disabled. Otherwise, specific terminal type strings (ansi, xterm, screen, linux, cygwin, etc.) are matched by integer comparison to determine color capability.

EDG Frontend Variables

NOCOLOR

Checked in sub_67C750. Respects the no-color.org convention: if set to any value, all diagnostic coloring is disabled.

EDG_COLORS

Checked in sub_67C750. Custom color specification string for EDG diagnostics. Example: "error=01;31:warning=01;35:note=01;36:locus=01:quote=01".

GCC_COLORS

Checked in sub_67C750. Fallback if EDG_COLORS is not set. Default: "error=01;31:warning=01;35:note=01;36:locus=01:quote=01:range1=32". Provides GCC-compatible diagnostic coloring.

USR_INCLUDE

Checked in sub_720A60. Overrides the system include path (default: "/usr/include") for the EDG frontend.

EDG_BASE

Checked in sub_7239A0. Sets the EDG base directory for predefined configuration files. Stored in qword_4F07578.

EDG_MODULES_PATH

Checked in sub_723900. Adds an additional search path for C++ modules in the EDG frontend.

Build System / Parallelism

MAKEFLAGS

Property	Value
Checked in	`sub_1682BF0` at `0x1682BF0`
Effect	GNU Make jobserver integration for parallel compilation limiting

Parses for --jobserver-auth= with either:

fifo: prefix: FIFO-based jobserver (modern GNU Make)
N,M format: pipe file descriptor pair (classic GNU Make)

When detected, cicc integrates with the jobserver to limit concurrent compilation passes.

Memory Allocator

MALLOC_CONF

Checked in sub_12FCDB0 (jemalloc initialization, 131,600 bytes -- the largest function in its range). One of five configuration sources for the bundled jemalloc allocator. Expected: jemalloc config string such as "narenas:2,dirty_decay_ms:0".

Dynamic / Generic Access

Two mechanisms allow runtime access to arbitrary environment variables:

--trace-env=VARNAME CLI flag (in sub_125FB30 and sub_900130): reads the named variable and injects its value into the compilation trace. This is a pass-through mechanism for build system integration.
sub_C86120 (LLVM sys::Process::GetEnv wrapper): generic getenv helper called with dynamic name parameters by LLVM's option processing infrastructure.

Complete Inventory

#	Env Var Name	Origin	Obfuscated	Category	Key Function
1	`NVVMCCWIZ`	NVIDIA	no	Developer mode	`sub_8F9C90`
2	`NVVM_IR_VER_CHK`	NVIDIA	no	Version check gate	`sub_12BFF60`, `sub_2259720`
3	`LIBNVVM_DISABLE_CONCURRENT_API`	NVIDIA	no	Thread safety	`ctor_104`
4	`NV_NVVM_VERSION`	NVIDIA	yes	Version compat / path select	`sub_12B9F70`, `sub_12BB580`
5	`LIBNVVM_NVVM_VERSION`	NVIDIA	yes	Version compat (alias)	`sub_12B9F70`
6	`LLVM_OVERRIDE_PRODUCER`	LLVM/NVIDIA	no	Bitcode metadata	`ctor_036`, `ctor_154`
7	`CAN_FINALIZE_DEBUG`	NVIDIA	no	Debug finalization	`sub_60F290`, `sub_4709E0`, `sub_470DA0`
8	`AS_SECURE_LOG_FILE`	LLVM	no	Assembler logging	`ctor_720`
9	`TMPDIR`	LLVM/EDG	no	Temp directory	`sub_16C5C30`, `sub_C843A0`, `sub_721330`
10	`TMP`	LLVM	no	Temp directory (fallback)	`sub_16C5C30`, `sub_C843A0`
11	`TEMP`	LLVM	no	Temp directory (fallback)	`sub_16C5C30`, `sub_C843A0`
12	`TEMPDIR`	LLVM	no	Temp directory (fallback)	`sub_16C5C30`, `sub_C843A0`
13	`PATH`	LLVM	no	Executable lookup	`sub_16C5290`, `sub_16C7620`, `sub_C86E60`
14	`HOME`	LLVM	no	Home directory	`sub_C83840`
15	`PWD`	LLVM	no	Working directory	`sub_16C56A0`, `sub_C82800`
16	`TERM`	LLVM/EDG	no	Terminal type	`sub_7216D0`, `sub_16C6A40`, `sub_C86300`
17	`NOCOLOR`	EDG	no	Color disable	`sub_67C750`
18	`EDG_COLORS`	EDG	no	Color scheme	`sub_67C750`
19	`GCC_COLORS`	EDG	no	Color scheme (fallback)	`sub_67C750`
20	`USR_INCLUDE`	EDG	no	Include path	`sub_720A60`
21	`EDG_BASE`	EDG	no	EDG base dir	`sub_7239A0`
22	`EDG_MODULES_PATH`	EDG	no	Module search path	`sub_723900`
23	`MAKEFLAGS`	Build	no	Jobserver	`sub_1682BF0`
24	`MALLOC_CONF`	jemalloc	no	Allocator config	`sub_12FCDB0`

Decompiler Artifacts

Several getenv("bar") calls appear in ctor_106, ctor_107, ctor_376, ctor_614. These are not real environment variable checks. The pattern getenv("bar") == (char*)-1 is jemalloc's initialization probe testing whether getenv is intercepted by a sanitizer. The string "bar" is a dummy.

"getenv" as a string in ctor_133 (qword_4F9B700[502]) is a function name in a libc symbol table used by the EDG frontend for tracking known standard library functions.

"fegetenv" in sub_E42970 is a math library function name in a builtin table.

Function Map

Function	Address	Size	Role
StringDeobfuscate	`sub_8F98A0`	~400B	XOR + ROT13 decryption engine
RealMain	`sub_8F9C90`	10,066B	Main entry; checks NVVMCCWIZ, dispatches via NV_NVVM_VERSION
NvvmVersionHelper	`sub_12B9F70`	~3KB	Reads NV_NVVM_VERSION / LIBNVVM_NVVM_VERSION; compares values
NvvmVersionHelper2	`sub_12BB580`	~3KB	Second call site for NV_NVVM_VERSION
CheckIRVersion	`sub_12BDA30`	~1KB	IR major/minor compatibility check
CheckDebugVersion	`sub_12BD890`	~1KB	Debug IR major/minor compatibility check
NVVMIRVersionCheck	`sub_12BFF60`	9KB	Full NVVM IR version validator; reads NVVM_IR_VER_CHK
NVVMIRVersionCheck2	`sub_2259720`	14KB	Second instance of version checker
JemallocInit	`sub_12FCDB0`	131,600B	jemalloc config parser; reads MALLOC_CONF
JobserverParser	`sub_1682BF0`	~2KB	MAKEFLAGS --jobserver-auth parser
GenericGetEnv	`sub_C86120`	~100B	LLVM sys::Process::GetEnv wrapper
EDGColorInit	`sub_67C750`	~2KB	NOCOLOR / EDG_COLORS / GCC_COLORS handler

Cross-References

CLI Flags -- flag-to-pipeline routing, -v/-keep wizard mode dependency
Knobs -- internal configuration knobs (separate from env vars)
Pipeline Entry -- sub_8F9C90 real main, v253 dispatch logic
Bitcode I/O -- LLVM_OVERRIDE_PRODUCER / dual producer mechanism
Concurrent Compilation -- LIBNVVM_DISABLE_CONCURRENT_API
Libdevice Linking -- NVVM_IR_VER_CHK bypass for libdevice
Debug Verify -- CAN_FINALIZE_DEBUG, NVVM_IR_VER_CHK interaction
NVVM Container -- NvvmIRVersion/NvvmDebugVersion in container header

Keyboard shortcuts

CICC Reverse Engineering Reference