Code Generation

NVPTX backend: SelectionDAG lowering, instruction selection, register allocation, and machine-level passes. Address range 0x1700000–0x35EFFFF (~37 MB of code) -- the largest address range in the binary. This page is the hub for the entire code generation pipeline; each stage has a dedicated deep-dive page linked below.


SelectionDAG pipeline	SelectionDAG & ISel — build, legalize, combine, select
Type legalization	Type Legalization — 348KB monolithic dispatch
ISel patterns	ISel Pattern Matching — three-level dispatch, 900KB
Register allocation	Register Allocation — pressure-driven greedy RA
Register classes	NVPTX Register Classes — nine classes, ID map
Scheduling	Instruction Scheduling — MRPA, pipeliner, post-RA
Machine passes	Machine-Level Passes — MRPA, remat, LDG, peephole
StructurizeCFG	StructurizeCFG — mandatory structured control flow
CodeGenPrepare	CodeGenPrepare & SCEV-CGP — IR-level backend prep
KnownBits	KnownBits & DemandedBits — fused analysis with GPU SR oracle
Tensor core codegen	MMA Code Generation — HMMA/IMMA/WGMMA/tcgen05 lowering pipeline
Tensor core builtins	Tensor / MMA Builtins — per-ID reference, validation rules
Atomics	Atomic Builtins — scope-aware atom lowering
Target infrastructure	NVPTX Target Infrastructure — TargetMachine, TTI, SubtargetFeatures
Live range calc	LiveRangeCalc — dual-bitvector liveness
Rematerialization	Rematerialization — IR-level + machine-level remat
InstrEmitter	InstrEmitter — DAG-to-MachineInstr conversion
DAG node layout	SelectionDAG Node Structure — 104-byte SDNode

Architecture

The code generation pipeline runs after the LLVM optimizer and produces MachineIR that the PTX emission stage serializes to text. The pipeline follows upstream LLVM's SelectionDAG architecture with NVIDIA-specific passes inserted at key points.

LLVM IR
  │
  ├─ CodeGenPrepare (IR-level backend prep)
  │    sub_1D70000-1D7FFFF: sunkaddr, sunk_phi, block splitting
  │
  ├─ SelectionDAG Build
  │    sub_2065D30 (visit dispatcher)
  │    sub_2056920 (major worker, 69KB)
  │    sub_2077400 (NVVM tex/surf handle lowering) ★ NVIDIA
  │    sub_2072590 (NVPTX argument passing, 38KB) ★ NVIDIA
  │
  ├─ LegalizeTypes
  │    sub_20019C0 (348KB main loop)
  │    sub_201E5F0 (opcode dispatch, 81KB)
  │    sub_201BB90 (expand integer, 75KB)
  │
  ├─ LegalizeOp
  │    sub_1FFB890 (169KB, type action dispatch)
  │    sub_1FF6F70 (43KB, atomic target-specific lowering) ★ NVIDIA
  │
  ├─ DAG Combining
  │    sub_F681E0 (65KB, top-level orchestrator)
  │    sub_F20C20 (64KB, visitNode main)
  │
  ├─ Instruction Selection
  │    sub_3090F90 (91KB, NVPTXDAGToDAGISel::Select) ★ NVIDIA
  │    sub_33D4EF0 (complex addressing, calls sub_969240 399×)
  │
  ├─ Instruction Scheduling
  │    sub_355F610 (64KB, ScheduleDAGMILive post-RA)
  │    sub_3563190 (58KB, MachinePipeliner)
  │
  ├─ Register Allocation
  │    sub_2F49070 (82KB, RAGreedy::selectOrSplit)
  │    sub_2F2D9F0 (93KB, LiveRangeSplitter)
  │
  ├─ Machine-Level Passes
  │    MRPA, Block Remat, Mem2Reg, LDG, Peephole, etc.
  │
  └─ StructurizeCFG
       sub_35CC920 (95KB, mandatory for PTX structured control flow)

Items marked ★ NVIDIA are NVIDIA-proprietary additions not present in upstream LLVM.

Stage Overview

CodeGenPrepare (detail) -- last IR-level pass before ISel. Sinks address computations, creates PHI nodes for sunk values, and splits critical edges. NVIDIA adds an optional SCEV-CGP extension.

SelectionDAG Build (detail) -- converts LLVM IR into a target-independent DAG. NVPTX intercepts for .param-space argument passing and texture/surface handle lowering.

Type Legalization (detail) -- rewrites every illegal type into legal equivalents via promote, expand, soften, or split-vector actions.

Operation Legalization -- processes nodes whose opcodes are illegal for the target. Atomic operations receive NVIDIA-specific scope-aware lowering (CTA/GPU/SYS) with per-SM feature gates.

DAG Combining -- folds redundant operations, canonicalizes patterns, and reduces the DAG before instruction selection. The KnownBits analysis feeds into combining decisions.

Instruction Selection (detail) -- matches DAG nodes against PTX instruction patterns via a three-level dispatch hierarchy. A compressed per-SM-variant legality table gates which opcodes exist on which GPU architecture.

Instruction Scheduling (detail) -- post-RA scheduling plus an optional software pipeliner. NVIDIA's custom MRPA provides incremental register pressure tracking.

Register Allocation (detail) -- pressure-driven greedy allocator adapted for PTX's virtual register model. Works with nine typed register classes; live range splitting and rematerialization reduce spill pressure.

Machine-Level Passes (detail) -- NVIDIA-proprietary and stock LLVM passes that optimize register pressure, promote stack objects back to registers, and prepare clean PTX for ptxas.

StructurizeCFG (detail) -- mandatory pass that converts arbitrary CFGs into the structured form PTX requires, rejecting irreducible CFGs and EH funclets.

Two-Stage Compilation: cicc + ptxas

CUDA compilation is a two-stage process. cicc (this binary) compiles CUDA/NVVM IR down to PTX assembly text -- a virtual ISA with unlimited registers and structured control flow. ptxas then compiles the PTX into SASS machine code for a specific SM target. This split means that many of cicc's code generation decisions (register allocation, instruction scheduling, peephole optimization) are revisited by ptxas with full hardware knowledge. cicc's code generation pipeline therefore optimizes for two audiences simultaneously: (1) reducing register pressure and producing clean PTX that gives ptxas maximum optimization freedom, and (2) performing target-aware lowering (type legalization, instruction selection, structured CFG) that ptxas cannot undo. The practical consequence is that cicc's backend is pressure-driven rather than latency-driven -- scheduling for low register count matters more than scheduling for pipeline throughput, because ptxas will re-schedule for the hardware but cannot reduce register demand below what cicc emitted.

Cross-References

NVPTX Subtarget & feature flags -- SM processor table, type legality offsets
GPU target feature gates -- per-SM architecture feature matrix
DAG node structure -- SDNode 104-byte layout, operand stride
Pattern database -- ISel pattern table format
NVPTX machine opcodes -- opcode reference
Address spaces -- global, shared, local, param encoding
PTX emission -- downstream consumer of machine-level output
Register coalescing -- pre-RA copy elimination
PrologEpilogInserter -- .local frame layout

Keyboard shortcuts

CICC Reverse Engineering Reference

Code Generation

Architecture

Stage Overview

Two-Stage Compilation: cicc + ptxas

Cross-References