Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Code Generation

NVPTX backend: SelectionDAG lowering, instruction selection, register allocation, and machine-level passes. Address range 0x17000000x35EFFFF (~37 MB of code) -- the largest address range in the binary. This page is the hub for the entire code generation pipeline; each stage has a dedicated deep-dive page linked below.

SelectionDAG pipelineSelectionDAG & ISel — build, legalize, combine, select
Type legalizationType Legalization — 348KB monolithic dispatch
ISel patternsISel Pattern Matching — three-level dispatch, 900KB
Register allocationRegister Allocation — pressure-driven greedy RA
Register classesNVPTX Register Classes — nine classes, ID map
SchedulingInstruction Scheduling — MRPA, pipeliner, post-RA
Machine passesMachine-Level Passes — MRPA, remat, LDG, peephole
StructurizeCFGStructurizeCFG — mandatory structured control flow
CodeGenPrepareCodeGenPrepare & SCEV-CGP — IR-level backend prep
KnownBitsKnownBits & DemandedBits — fused analysis with GPU SR oracle
Tensor core codegenMMA Code Generation — HMMA/IMMA/WGMMA/tcgen05 lowering pipeline
Tensor core builtinsTensor / MMA Builtins — per-ID reference, validation rules
AtomicsAtomic Builtins — scope-aware atom lowering
Target infrastructureNVPTX Target Infrastructure — TargetMachine, TTI, SubtargetFeatures
Live range calcLiveRangeCalc — dual-bitvector liveness
RematerializationRematerialization — IR-level + machine-level remat
InstrEmitterInstrEmitter — DAG-to-MachineInstr conversion
DAG node layoutSelectionDAG Node Structure — 104-byte SDNode

Architecture

The code generation pipeline runs after the LLVM optimizer and produces MachineIR that the PTX emission stage serializes to text. The pipeline follows upstream LLVM's SelectionDAG architecture with NVIDIA-specific passes inserted at key points.

LLVM IR
  │
  ├─ CodeGenPrepare (IR-level backend prep)
  │    sub_1D70000-1D7FFFF: sunkaddr, sunk_phi, block splitting
  │
  ├─ SelectionDAG Build
  │    sub_2065D30 (visit dispatcher)
  │    sub_2056920 (major worker, 69KB)
  │    sub_2077400 (NVVM tex/surf handle lowering) ★ NVIDIA
  │    sub_2072590 (NVPTX argument passing, 38KB) ★ NVIDIA
  │
  ├─ LegalizeTypes
  │    sub_20019C0 (348KB main loop)
  │    sub_201E5F0 (opcode dispatch, 81KB)
  │    sub_201BB90 (expand integer, 75KB)
  │
  ├─ LegalizeOp
  │    sub_1FFB890 (169KB, type action dispatch)
  │    sub_1FF6F70 (43KB, atomic target-specific lowering) ★ NVIDIA
  │
  ├─ DAG Combining
  │    sub_F681E0 (65KB, top-level orchestrator)
  │    sub_F20C20 (64KB, visitNode main)
  │
  ├─ Instruction Selection
  │    sub_3090F90 (91KB, NVPTXDAGToDAGISel::Select) ★ NVIDIA
  │    sub_33D4EF0 (complex addressing, calls sub_969240 399×)
  │
  ├─ Instruction Scheduling
  │    sub_355F610 (64KB, ScheduleDAGMILive post-RA)
  │    sub_3563190 (58KB, MachinePipeliner)
  │
  ├─ Register Allocation
  │    sub_2F49070 (82KB, RAGreedy::selectOrSplit)
  │    sub_2F2D9F0 (93KB, LiveRangeSplitter)
  │
  ├─ Machine-Level Passes
  │    MRPA, Block Remat, Mem2Reg, LDG, Peephole, etc.
  │
  └─ StructurizeCFG
       sub_35CC920 (95KB, mandatory for PTX structured control flow)

Items marked ★ NVIDIA are NVIDIA-proprietary additions not present in upstream LLVM.

Stage Overview

CodeGenPrepare (detail) -- last IR-level pass before ISel. Sinks address computations, creates PHI nodes for sunk values, and splits critical edges. NVIDIA adds an optional SCEV-CGP extension.

SelectionDAG Build (detail) -- converts LLVM IR into a target-independent DAG. NVPTX intercepts for .param-space argument passing and texture/surface handle lowering.

Type Legalization (detail) -- rewrites every illegal type into legal equivalents via promote, expand, soften, or split-vector actions.

Operation Legalization -- processes nodes whose opcodes are illegal for the target. Atomic operations receive NVIDIA-specific scope-aware lowering (CTA/GPU/SYS) with per-SM feature gates.

DAG Combining -- folds redundant operations, canonicalizes patterns, and reduces the DAG before instruction selection. The KnownBits analysis feeds into combining decisions.

Instruction Selection (detail) -- matches DAG nodes against PTX instruction patterns via a three-level dispatch hierarchy. A compressed per-SM-variant legality table gates which opcodes exist on which GPU architecture.

Instruction Scheduling (detail) -- post-RA scheduling plus an optional software pipeliner. NVIDIA's custom MRPA provides incremental register pressure tracking.

Register Allocation (detail) -- pressure-driven greedy allocator adapted for PTX's virtual register model. Works with nine typed register classes; live range splitting and rematerialization reduce spill pressure.

Machine-Level Passes (detail) -- NVIDIA-proprietary and stock LLVM passes that optimize register pressure, promote stack objects back to registers, and prepare clean PTX for ptxas.

StructurizeCFG (detail) -- mandatory pass that converts arbitrary CFGs into the structured form PTX requires, rejecting irreducible CFGs and EH funclets.

Two-Stage Compilation: cicc + ptxas

CUDA compilation is a two-stage process. cicc (this binary) compiles CUDA/NVVM IR down to PTX assembly text -- a virtual ISA with unlimited registers and structured control flow. ptxas then compiles the PTX into SASS machine code for a specific SM target. This split means that many of cicc's code generation decisions (register allocation, instruction scheduling, peephole optimization) are revisited by ptxas with full hardware knowledge. cicc's code generation pipeline therefore optimizes for two audiences simultaneously: (1) reducing register pressure and producing clean PTX that gives ptxas maximum optimization freedom, and (2) performing target-aware lowering (type legalization, instruction selection, structured CFG) that ptxas cannot undo. The practical consequence is that cicc's backend is pressure-driven rather than latency-driven -- scheduling for low register count matters more than scheduling for pipeline throughput, because ptxas will re-schedule for the hardware but cannot reduce register demand below what cicc emitted.

Cross-References