Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ISel Pattern Matching & Instruction Selection

Prerequisites: Familiarity with SelectionDAG, Type Legalization, and DAG Node Layout. Understanding of the Pattern Database structure and NVPTX opcodes is recommended.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

The NVPTX instruction selector in cicc v13.0 translates legal SelectionDAG nodes into target MachineInstr opcodes through a three-level dispatch hierarchy totaling approximately 900KB of code. At the top sits NVPTXDAGToDAGISel::Select (sub_3090F90, 91KB), which builds a per-function cost table, manages a priority-queue-driven topological worklist, and calls the pattern matcher (sub_308FEE0) for every node. The pattern matcher fans out to a hand-written NVPTX-specific select switch (sub_347A8D0, 309KB) and a TableGen-generated SelectCode function (sub_348D3E0, 256KB). Surrounding this core are six NVPTX-specific sub-selectors covering memory operations, texture/surface fetches, complex addressing modes, vector patterns, and atomics. NVIDIA's key delta from upstream LLVM is (1) a compressed per-SM-variant legality table that gates which target opcodes exist on which GPU architecture, (2) a secondary 4-bit packed bitfield for fine-grained operand-class legality, and (3) the iteration budget that prevents the selector from looping indefinitely on pathological DAGs.

ISel driversub_3090F90 (91KB, 2,828 lines)
Pattern matcher entrysub_308FEE0
NVPTX Select switchsub_347A8D0 (309KB -- largest ISel function)
SelectCode (TableGen)sub_348D3E0 (256KB -- auto-generated)
Vector/SIMD patternssub_3475BB0 (89KB)
Memory operation patternssub_306D850 (77KB)
Complex addressing modessub_30811D0 (77KB)
Addressing mode helpersub_30783B0 (39KB)
Texture/surface ISelsub_306A930 (52KB)
Atomic loweringsub_3048C30 (86KB)
Constraint tableword_3F3E6C0 (see Pattern Database)
Compressed legality tableBase + 6414, 500-byte stride per SM variant
Secondary 4-bit bitfieldBase + 521536
Legalize action tableObject + 72760, 4-bit packed
Knob registrationctor_286 at 0x4FA0C0 (5KB)
Upstream LLVM sourcelib/CodeGen/SelectionDAG/SelectionDAGISel.cpp, lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp

ISel Driver: sub_3090F90

The top-level driver is not the pattern matcher itself; it is the orchestration loop that feeds nodes to the matcher in the right order and maintains shared state. It breaks into three phases.

Phase 1: Function Argument Cost Table

Before selecting any instructions, the driver builds a DenseMap-style hash table at this + 408 that maps function argument indices to their byte sizes. The hash table uses LLVM's standard integer-key hash function key * 37, open addressing with linear probing, and the tombstone sentinel -2. Growth triggers at 75% load factor (4 * (count + 1) >= 3 * capacity).

// Phase 1: build argument cost table
hash_table = this->arg_cost_map;  // at this + 408
for each argument A in function->args():
    byte_size = alignTo(getSizeInBits(A.type) / 8, A.alignment)
    key = A.index
    slot = (key * 37) & (capacity - 1)
    while hash_table[slot] is occupied and != key:
        slot = (slot + 1) & (capacity - 1)
    hash_table[slot] = { key, byte_size }
    if load_factor > 0.75: rehash()

The table layout:

FieldOffset from thisDescription
data+416Pointer to hash bucket array
count+424Number of live entries
tombstone_count+428Number of tombstone slots
capacity+432Total bucket count (power of 2)

If the function has a non-void return type, the driver also inserts the return value sizes into the same table, computing aligned_size = ((size + 7) >> 3 + (1 << align) - 1) >> align << align for each return element. The return-type attribute check uses attribute kind 81 (likely sret).

Phase 2: Return Value Processing

For non-void functions, the driver iterates each return value element via:

  • sub_A74710(attribute, 81) -- checks for sret attribute
  • sub_A748A0(index) -- gets return type at given index
  • sub_AE5020(dataLayout, type) -- computes ABI alignment
  • sub_9208B0(dataLayout, type) -- computes size in bits

Each return value's aligned byte size is inserted into the argument cost table, so the pattern matcher can look up the cost of materializing any function parameter or return value during instruction selection.

Phase 3: Topological Selection Loop

The main selection loop processes DAG nodes in topological order using a min-heap priority queue where priority equals topological order (lower number = earlier in the DAG, processed first). The iteration is bounded by an explicit budget.

// Phase 3: main ISel loop
sub_308B6F0(this);  // initialize worklist from DAG
budget = 4 * numInstructions * maxBlockSize
iteration = 0

while heap is not empty:
    node = heap.extractMin()         // sub_3089BD0: heap-sift-down
    sub_308FEE0(this, node, &tmp)    // pattern matcher dispatch

    if this->selectionChanged:       // byte at this + 400
        re-scan affected nodes

    iteration++
    if iteration > budget:
        break  // anti-infinite-loop guard

sub_308AB30(this)    // cleanup
sub_264E600(this)    // deallocate worklist
sub_308B100(this)    // destroy hash table

The min-heap stores (SDNode*, priority) pairs at 16-byte stride. The heap-sift-down operation (sub_3089BD0) maintains the heap invariant after extraction. The selectionChanged flag at this + 400 is set by the pattern matcher when it replaces a node, signaling the driver to re-examine downstream users.

The iteration budget formula 4 * numInstructions * maxBlockSize is an NVIDIA addition -- upstream LLVM's SelectionDAGISel does not have this guard. It prevents pathological DAGs (for example, from heavily-inlined device functions with thousands of parameters) from causing the selector to spin indefinitely when combine/legalize/select cycles interact.

Pattern Matcher Dispatch: sub_308FEE0

The pattern matcher is called once per SDNode. It reads the node's opcode at *(node + 24) and dispatches through a multi-level decision tree:

  1. Quick-reject filter. If the node is already selected (machine opcode bit set in flags), return immediately.
  2. NVPTX-specific hand-written patterns. Calls sub_347A8D0 for NVPTX custom opcodes (NVPTXISD range >= 499). This handles texture loads, MMA instructions, atomic operations, .param-space loads/stores, and other GPU-specific patterns.
  3. TableGen auto-generated matcher. Calls sub_348D3E0 (SelectCode) for standard ISD opcodes. This function is mechanically generated from the .td pattern files in the NVPTX backend and contains a massive switch table mapping DAG patterns to MachineInstr opcodes.
  4. Complex pattern matching. For load/store addressing modes, calls sub_30811D0 (77KB) and sub_30783B0 (39KB), which match base + offset, base + scaled_index, and address-space-qualified patterns.
  5. Fallback. If no pattern matches, the node is marked as "failed ISel" and the driver may retry after DAG combining.

NVPTX Select Switch: sub_347A8D0 (309KB)

This is the largest single ISel function, containing the hand-written pattern matching for all NVIDIA-specific DAG nodes. It calls sub_969240 263 times (SDNode accessor), is self-recursive 42 times, and dispatches to:

Sub-selectorSizeCoverage
sub_3447D7032KBSpecific pattern sub-dispatch
sub_3441190--Pattern helpers
sub_343FD60--Type-aware matching
sub_3475BB089KBVector/SIMD patterns (v2, v4 packed types)

The function switches on the SDNode opcode to handle:

  • Load/store with address spaces -- selects between ld.global, ld.shared, ld.local, ld.param, ld.const, and generic-space loads, each requiring different PTX instructions.
  • Texture/surface operations -- dispatches to sub_306A930 for tex, suld, sust instruction patterns.
  • MMA/WMMA/tensor ops -- selects the correct mma.sync, wmma.mma, wgmma variant based on operand types and SM architecture.
  • Atomic operations -- selects between atom.global.add, atom.shared.cas, red.global.add, etc., with scope qualifiers (.cta, .gpu, .sys).
  • Barrier/fence operations -- selects bar.sync, bar.warp.sync, membar.cta, membar.gl, membar.sys.

SelectCode (TableGen): sub_348D3E0 (256KB)

This auto-generated function implements the standard LLVM TableGen pattern matching algorithm. It is a giant switch-table compiled from the .td instruction pattern files in lib/Target/NVPTX/*.td. The function:

  • Calls sub_969240 45 times and sub_32889F0 38 times (opcode/type checkers).
  • Contains no string literals (purely mechanical code).
  • Works in tandem with sub_347A8D0: the hand-written selector handles NVPTX custom nodes first, and anything that falls through goes to SelectCode.

The auto-generated matcher encodes patterns as a sequence of opcode checks, type checks, and operand recursive matches. When a full pattern matches, it calls MorphNodeTo to convert the SDNode into a MachineSDNode with the target opcode and register operands.

Compressed Instruction Legality Table

NVIDIA's instruction selector uses a per-SM-variant legality table to determine whether a given target opcode is legal on the current GPU architecture. This table is checked during instruction selection to gate SM-specific instructions (for example, wgmma instructions are illegal on SM 70 but legal on SM 90+).

The table lives at a fixed offset from the base of the ISel object, accessed by sub_376DE90:

legality = *(uint8_t*)(base + 500 * arch_variant + opcode + 6414)
FieldEncoding
Base offset6414 bytes from object base
Row stride500 bytes per architecture variant
Index500 * arch_variant + opcode
Value 0Illegal -- this opcode does not exist on this SM
Value 1Custom -- requires custom lowering before emission
Value 2Legal -- can be emitted directly

The arch_variant value selects which row of the table to consult. Each row contains 500 entries, one per target opcode. The table is read-only after initialization and occupies approximately num_variants * 500 bytes in the .data section.

Secondary 4-bit Packed Bitfield

A second legality table at base + 521536 provides fine-grained operand-class legality using 4-bit packed nibbles:

byte_offset = (opcode_class >> 3) + 36 * arch_id - arch_id
nibble      = (*(uint8_t*)(base + 521536 + byte_offset) >> (4 * (opcode_class & 7))) & 0xF

The offset simplification 36 * arch_id - arch_id equals 35 * arch_id, giving a 35-byte stride per architecture variant. Each byte packs two 4-bit legality fields, and the low/high nibble is selected by bit 0 of opcode_class. The 4-bit values encode a richer set of actions than the primary table's 3-value encoding.

Legalize Action Table

The operation legalization subsystem (separate from the ISel legality table above) uses a 4-bit packed action table at object offset 72760 to determine how to legalize each (opcode, type) pair:

index  = type_bits + 15 * opcode + 18112
action = (*(uint32_t*)(object + 4 * index + 72760) >> (4 * (type & 7))) & 0xF
ActionValueBehavior
Legal0Node is natively supported
Promote1Widen to a larger legal type
Custom5Call NVPTXTargetLowering::LowerOperation via vtable slot 164
ExpandInteger9Split wide integers into halves
ExpandFloat13Emulate unsupported FP via libcalls
SplitVector14Decompose illegal vector into legal sub-vectors

This table is distinct from the type-legality table at TLI + 2422 (described in SelectionDAG), which uses a 259-byte stride and encodes the simpler 5-action set (Legal/Custom/Expand/LibCall/Promote). The table at +72760 is the operation-level action table used during the LegalizeOp phase, while the +2422 table is the type-level action table used during LegalizeTypes.

NVPTX-Specific Pattern Categories

Memory Operations: sub_306D850 (77KB)

Selects PTX load/store instructions with the correct address space qualifier, vector width, and volatility. The function handles the full matrix of {ld,st} x {.global,.shared,.local,.param,.const,.gen} x {.b8,.b16,.b32,.b64,.b128} x {.v1,.v2,.v4} x {.volatile,.relaxed,.acquire,.release} instruction variants. Address space is determined by querying the pointer operand's address space attribute through the DAG.

The memory pattern matching also covers:

  • Vector loads/stores -- ld.global.v2.b32, ld.global.v4.b32, and their 64-bit variants, selected based on the vector element count (1, 2, or 4).
  • Parameter loads -- ld.param.b32 and st.param.b32 for call ABI (see SelectionDAG: .param ABI).
  • Generic-space loads with addrspacecast -- when the address space is generic (AS 0), the selector checks whether the source can be proven to be in a specific space and emits a non-generic load if so.

Texture/Surface Instructions: sub_306A930 (52KB)

Selects tex, suld, and sust instructions from DAG nodes produced by the intrinsic lowering mega-switch. The selector dispatches through helper functions:

HelperPurpose
sub_2FE5F00Texture fetch type selection
sub_2FE5F30Surface read type selection
sub_2FE5F60Surface write type selection
sub_2FE69A0Texture sampler mode selection
sub_2FE6CC0Unified texture/surface dispatch

Texture instructions have complex operand requirements: sampler reference, texture reference, coordinate type (1D/2D/3D/cube), data type (f32/i32/f16), and optional LOD/gradient parameters. The selector maps each combination to a specific PTX tex.1d.v4.f32.f32 (or similar) opcode.

Complex Addressing Modes: sub_30811D0 (77KB)

Matches addressing patterns for load/store operands. NVPTX supports a limited set of addressing modes compared to x86:

  • Register + immediate offset -- [%r1 + 16], the most common PTX addressing mode.
  • Register -- [%r1], zero-offset variant.
  • Immediate -- [0x1000], absolute address (rare on GPU).
  • Register + register -- not directly supported in PTX; decomposed into add + register addressing.

The complex pattern matcher at sub_30811D0 calls seven helper functions (sub_307B990 through sub_307FEF0) to decompose DAG address expressions into base-register + offset pairs. When the offset is a constant that fits in the PTX immediate field, it folds into the instruction encoding. When the offset is too large or non-constant, it generates a separate add instruction and uses register addressing.

MMA / Tensor Core Instructions

Tensor core instruction selection is split across the intrinsic lowering stage (which generates NVPTXISD nodes from wmma.load, wmma.mma, mma.sync, wgmma intrinsics) and the ISel stage (which selects the specific PTX opcode). The ISel switch in sub_347A8D0 handles these by checking:

  1. SM architecture -- wmma requires SM 70+, mma.sync requires SM 75+, wgmma requires SM 90+.
  2. Matrix dimensions -- m16n16k16, m8n8k4, m16n8k8, etc.
  3. Data types -- f16, bf16, tf32, f64, i8, i4, b1, fp8 (SM 90+), fp4 (SM 100+).
  4. Accumulator type -- f16 or f32 for half-precision MMA.

The architecture check consults the compressed legality table to determine whether a given MMA variant is legal on the target SM.

Atomic Operations: sub_3048C30 (86KB)

Atomic instruction selection generates atom.{scope}.{op}.{type} instructions. The selector handles:

OperationPTXNVPTXISD opcodes
Compare-and-swapatom.cas462
Add (int)atom.add294--297
Min (signed)atom.min302--305
Max (signed)atom.max314--317
Exchangeatom.exch(via generic path)
AND/OR/XORatom.and / atom.or / atom.xor(via generic path)

The selector checks "vector atomics not supported on this architecture!" for vector-width atomics and gates them behind an SM version check (likely SM 90+). Scope qualifiers (.cta, .gpu, .sys) are determined from the memory ordering of the LLVM atomic instruction.

Vector / SIMD Patterns: sub_3475BB0 (89KB)

Handles vector-type instruction selection for NVPTX's limited vector support (v2 and v4 packed types). The function calls sub_969240 121 times and is self-recursive 28 times. It selects between:

  • Packed register operations -- add.v2.f32, mul.v2.f32 when the SM supports native vector operations.
  • Scalarized fallback -- decomposes vector operations into per-element scalar operations when the vector type is not natively supported.
  • mov.v2 / mov.v4 -- register-to-register vector moves for shuffles and extracts.

Knobs

The ISel subsystem registers its knobs at ctor_286 (0x4FA0C0, 5KB):

KnobTypeDescription
fast-isel-abortintAbort mode for FastISel failures (0=silent, 1=warn, 2=abort)
fast-isel-report-on-fallbackboolReport when FastISel falls back to SelectionDAG
use-mbpiboolUse Machine Branch Probability Info during ISel
dag-disable-combineboolDisable DAG combining entirely
pre-RA-schedenumPre-RA scheduler variant: "default", "list-burr", "source", "list-hybrid", "list-ilp"

Note that cicc does not use FastISel for GPU code generation. The fast-isel-* knobs exist because the upstream LLVM SelectionDAGISel framework registers them unconditionally, but the NVPTX backend always takes the full SelectionDAG path. The dag-disable-combine flag is the only ISel-phase knob that has a meaningful effect on NVPTX code generation; setting it skips the DAG combiner entirely, which produces worse code but can be useful for debugging.

Differences from Upstream LLVM

AspectUpstream LLVM 20.0NVIDIA cicc v13.0
Iteration budgetNo explicit budget; relies on DAG invariants to terminateBudget = 4 * numInstructions * maxBlockSize
Argument cost tableNot present in SelectionDAGISelHash table with key * 37 hash for argument byte sizes
Legality tableSimple isLegal() callback per targetCompressed 500-stride table + 4-bit packed secondary table
FastISelUsed for -O0 on most targetsNever used; always full SelectionDAG
ISel function sizeTypical NVPTX Select() is ~50KB upstream309KB hand-written + 256KB TableGen = 565KB total
Memory patternsStandard load/store5 address spaces, each with distinct PTX encoding
Texture/surfaceNot present in upstream NVPTX (handled by intrinsics only)52KB dedicated sub-selector for tex/suld/sust
Atomic patternsStandard expansion via AtomicExpandPass86KB custom selector with scope qualifiers and architecture gating

Function Map

FunctionAddressSizeRole
NVPTXDAGToDAGISel::Select -- ISel driversub_3090F9091KB--
Pattern matcher entry (dispatches to Select switch and SelectCode)sub_308FEE0----
NVPTX hand-written Select switchsub_347A8D0309KB--
TableGen-generated SelectCodesub_348D3E0256KB--
Vector/SIMD pattern selectionsub_3475BB089KB--
Memory operation patterns (ld/st with address spaces)sub_306D85077KB--
Complex addressing mode matchingsub_30811D077KB--
Addressing mode helper (base + offset extraction)sub_30783B039KB--
Texture/surface instruction selectionsub_306A93052KB--
Atomic operation selectionsub_3048C3086KB--
Sub-selector for specific NVPTX patternssub_3447D7032KB--
Pattern matching helperssub_347297036KB--
Operand matchingsub_343A2E049KB--
Compressed legality table lookupsub_376DE90----
Initialize topological worklistsub_308B6F0----
Min-heap sift-down (priority queue)sub_3089BD0----
ISel cleanupsub_308AB30----
Hash table destructionsub_308B100----

Cross-References