Code Generation
NVPTX backend: SelectionDAG lowering, instruction selection, register allocation, and machine-level passes. Address range 0x1700000–0x35EFFFF (~37 MB of code) -- the largest address range in the binary. This page is the hub for the entire code generation pipeline; each stage has a dedicated deep-dive page linked below.
| SelectionDAG pipeline | SelectionDAG & ISel — build, legalize, combine, select |
| Type legalization | Type Legalization — 348KB monolithic dispatch |
| ISel patterns | ISel Pattern Matching — three-level dispatch, 900KB |
| Register allocation | Register Allocation — pressure-driven greedy RA |
| Register classes | NVPTX Register Classes — nine classes, ID map |
| Scheduling | Instruction Scheduling — MRPA, pipeliner, post-RA |
| Machine passes | Machine-Level Passes — MRPA, remat, LDG, peephole |
| StructurizeCFG | StructurizeCFG — mandatory structured control flow |
| CodeGenPrepare | CodeGenPrepare & SCEV-CGP — IR-level backend prep |
| KnownBits | KnownBits & DemandedBits — fused analysis with GPU SR oracle |
| Tensor core codegen | MMA Code Generation — HMMA/IMMA/WGMMA/tcgen05 lowering pipeline |
| Tensor core builtins | Tensor / MMA Builtins — per-ID reference, validation rules |
| Atomics | Atomic Builtins — scope-aware atom lowering |
| Target infrastructure | NVPTX Target Infrastructure — TargetMachine, TTI, SubtargetFeatures |
| Live range calc | LiveRangeCalc — dual-bitvector liveness |
| Rematerialization | Rematerialization — IR-level + machine-level remat |
| InstrEmitter | InstrEmitter — DAG-to-MachineInstr conversion |
| DAG node layout | SelectionDAG Node Structure — 104-byte SDNode |
Architecture
The code generation pipeline runs after the LLVM optimizer and produces MachineIR that the PTX emission stage serializes to text. The pipeline follows upstream LLVM's SelectionDAG architecture with NVIDIA-specific passes inserted at key points.
LLVM IR
│
├─ CodeGenPrepare (IR-level backend prep)
│ sub_1D70000-1D7FFFF: sunkaddr, sunk_phi, block splitting
│
├─ SelectionDAG Build
│ sub_2065D30 (visit dispatcher)
│ sub_2056920 (major worker, 69KB)
│ sub_2077400 (NVVM tex/surf handle lowering) ★ NVIDIA
│ sub_2072590 (NVPTX argument passing, 38KB) ★ NVIDIA
│
├─ LegalizeTypes
│ sub_20019C0 (348KB main loop)
│ sub_201E5F0 (opcode dispatch, 81KB)
│ sub_201BB90 (expand integer, 75KB)
│
├─ LegalizeOp
│ sub_1FFB890 (169KB, type action dispatch)
│ sub_1FF6F70 (43KB, atomic target-specific lowering) ★ NVIDIA
│
├─ DAG Combining
│ sub_F681E0 (65KB, top-level orchestrator)
│ sub_F20C20 (64KB, visitNode main)
│
├─ Instruction Selection
│ sub_3090F90 (91KB, NVPTXDAGToDAGISel::Select) ★ NVIDIA
│ sub_33D4EF0 (complex addressing, calls sub_969240 399×)
│
├─ Instruction Scheduling
│ sub_355F610 (64KB, ScheduleDAGMILive post-RA)
│ sub_3563190 (58KB, MachinePipeliner)
│
├─ Register Allocation
│ sub_2F49070 (82KB, RAGreedy::selectOrSplit)
│ sub_2F2D9F0 (93KB, LiveRangeSplitter)
│
├─ Machine-Level Passes
│ MRPA, Block Remat, Mem2Reg, LDG, Peephole, etc.
│
└─ StructurizeCFG
sub_35CC920 (95KB, mandatory for PTX structured control flow)
Items marked ★ NVIDIA are NVIDIA-proprietary additions not present in upstream LLVM.
Stage Overview
CodeGenPrepare (detail) -- last IR-level pass before ISel. Sinks address computations, creates PHI nodes for sunk values, and splits critical edges. NVIDIA adds an optional SCEV-CGP extension.
SelectionDAG Build (detail) -- converts LLVM IR into a target-independent DAG. NVPTX intercepts for .param-space argument passing and texture/surface handle lowering.
Type Legalization (detail) -- rewrites every illegal type into legal equivalents via promote, expand, soften, or split-vector actions.
Operation Legalization -- processes nodes whose opcodes are illegal for the target. Atomic operations receive NVIDIA-specific scope-aware lowering (CTA/GPU/SYS) with per-SM feature gates.
DAG Combining -- folds redundant operations, canonicalizes patterns, and reduces the DAG before instruction selection. The KnownBits analysis feeds into combining decisions.
Instruction Selection (detail) -- matches DAG nodes against PTX instruction patterns via a three-level dispatch hierarchy. A compressed per-SM-variant legality table gates which opcodes exist on which GPU architecture.
Instruction Scheduling (detail) -- post-RA scheduling plus an optional software pipeliner. NVIDIA's custom MRPA provides incremental register pressure tracking.
Register Allocation (detail) -- pressure-driven greedy allocator adapted for PTX's virtual register model. Works with nine typed register classes; live range splitting and rematerialization reduce spill pressure.
Machine-Level Passes (detail) -- NVIDIA-proprietary and stock LLVM passes that optimize register pressure, promote stack objects back to registers, and prepare clean PTX for ptxas.
StructurizeCFG (detail) -- mandatory pass that converts arbitrary CFGs into the structured form PTX requires, rejecting irreducible CFGs and EH funclets.
Two-Stage Compilation: cicc + ptxas
CUDA compilation is a two-stage process. cicc (this binary) compiles CUDA/NVVM IR down to PTX assembly text -- a virtual ISA with unlimited registers and structured control flow. ptxas then compiles the PTX into SASS machine code for a specific SM target. This split means that many of cicc's code generation decisions (register allocation, instruction scheduling, peephole optimization) are revisited by ptxas with full hardware knowledge. cicc's code generation pipeline therefore optimizes for two audiences simultaneously: (1) reducing register pressure and producing clean PTX that gives ptxas maximum optimization freedom, and (2) performing target-aware lowering (type legalization, instruction selection, structured CFG) that ptxas cannot undo. The practical consequence is that cicc's backend is pressure-driven rather than latency-driven -- scheduling for low register count matters more than scheduling for pipeline throughput, because ptxas will re-schedule for the hardware but cannot reduce register demand below what cicc emitted.
Cross-References
- NVPTX Subtarget & feature flags -- SM processor table, type legality offsets
- GPU target feature gates -- per-SM architecture feature matrix
- DAG node structure -- SDNode 104-byte layout, operand stride
- Pattern database -- ISel pattern table format
- NVPTX machine opcodes -- opcode reference
- Address spaces -- global, shared, local, param encoding
- PTX emission -- downstream consumer of machine-level output
- Register coalescing -- pre-RA copy elimination
- PrologEpilogInserter --
.localframe layout