SelectionDAG & Instruction Selection
NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.
Upstream source: Target-independent DAG infrastructure:
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp,DAGCombiner.cpp,LegalizeDAG.cpp,LegalizeTypes.cpp,SelectionDAGBuilder.cpp,SelectionDAGISel.cpp. NVPTX target:llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp,NVPTXISelDAGToDAG.cpp,NVPTXInstrInfo.td(LLVM 20.0.0).LLVM version note: The target-independent SelectionDAG infrastructure at
0xF05000--0xF70000appears to be stock LLVM 20 with no detectable NVIDIA modifications. All NVIDIA customization lives in the NVPTX target range (0x3290000--0x35FFFFF) via virtual dispatch throughNVPTXTargetLoweringandNVPTXDAGToDAGISel. The intrinsic lowering switch covers IDs up to 14196 (0x3774), far exceeding upstream NVPTX which covers approximately IDs 0--300.
CICC v13.0 contains a complete NVPTX SelectionDAG backend derived from LLVM 20.0.0, with substantial NVIDIA customizations for GPU-specific lowering, the PTX .param-space calling convention, tensor core intrinsic selection, and a 343KB intrinsic lowering mega-switch covering over 200 CUDA intrinsic IDs. The SelectionDAG pipeline converts LLVM IR into machine-level PTX instructions through four major phases: type legalization, operation legalization, DAG combining, and pattern-based instruction selection.
The NVPTX SelectionDAG backend spans roughly 4MB of code across two address ranges: 0xF05000--0xF70000 for the target-independent DAG infrastructure (combining, known-bits, node management) and 0x3290000--0x35FFFFF for the NVPTX-specific lowering, instruction selection, and register allocation. The infrastructure range is stock LLVM with no detectable NVIDIA modifications; all NVIDIA customization lives in the latter range via target hooks and virtual dispatch.
| LowerOperation dispatcher | sub_32E3060 (111KB, 3,626 lines) |
| LowerCall (.param ABI) | sub_3040BF0 (88KB, 2,909 lines) |
| Intrinsic lowering switch | sub_33B0210 (343KB, 9,518 lines) |
| ISel::Select driver | sub_3090F90 (91KB, 2,828 lines) |
| LegalizeTypes | sub_20019C0 (348KB, 10,739 lines) |
| LegalizeOp dispatcher | sub_1FCE100 (91KB, ~100 opcodes) |
| LegalizeOp action dispatch | sub_1FFB890 (137KB, 967 cases) |
| DAG combiner visitor | sub_F20C20 (64KB) |
| DAG combiner orchestrator | sub_F681E0 (65KB) |
| DAGCombiner::combine (NVPTX) | sub_3425710 (142KB, "COVERED"/"INCLUDED" tracing) |
| PerformDAGCombine (NVPTX) | sub_33C0CA0 (62KB) |
| DAG combine: post-legalize | sub_32EC4F0 (92KB) |
| computeKnownBits (NVPTX) | sub_33D4EF0 (114KB, 3,286 lines) |
| Inline asm lowering | sub_2079C70 (83KB, 2,797 lines) |
| Inline asm constraints (NVPTX) | sub_338BA40 (79KB) |
| NVPTXTargetLowering init | sub_3056320 (45KB, constructor) |
| Type legalization setup | sub_3314670 (73KB, table population) |
| Upstream | lib/CodeGen/SelectionDAG/, lib/Target/NVPTX/NVPTXISelLowering.cpp |
Complexity
Let N = number of DAG nodes and E = number of edges (use-def relationships). The SelectionDAG pipeline runs eight sequential phases. SelectionDAGBuilder converts IR instructions to DAG nodes in O(I) where I = LLVM IR instruction count. Each DAG Combiner pass is worklist-driven: O(N) nodes are visited, each matched against pattern rules in O(1) via opcode dispatch; ReplaceAllUsesWith is O(U) per node where U = uses. The three combiner passes total O(3 * N * U_avg). Type legalization (sub_20019C0, 348KB) iterates until all types are legal -- each iteration processes O(N) nodes, and convergence is guaranteed in O(T) iterations where T = max type-promotion depth (typically 2--3 for GPU types). Operation legalization (sub_1FFB890, 137KB) visits each node once: O(N). The action table lookup is O(1) via the 2D array at TLI + 259 * VT + opcode + 2422. ISel pattern matching (sub_3090F90, 91KB) visits each node once in topological order: O(N). Per-node matching is O(P) where P = number of patterns for that opcode, but NVPTX patterns are organized by opcode-indexed tables making this effectively O(1) for common opcodes. The DAG worklist uses ((addr >> 9) ^ (addr >> 4)) & (cap - 1) hashing for O(1) amortized membership tests. Overall: O(I + N * U_avg * 3 + N * T + N) which simplifies to O(N * U_avg) in practice. The intrinsic lowering mega-switch (343KB, 200+ IDs) adds O(1) per intrinsic call via the jump table, not O(200).
Pipeline Position
The SelectionDAG phases execute in a fixed sequence after SelectionDAGBuilder (sub_2081F00) converts LLVM IR into an initial DAG:
- SelectionDAGBuilder -- IR-to-DAG lowering, visitor dispatch at
sub_2065D30 - DAG Combiner (
sub_F681E0/sub_F20C20) -- initial algebraic simplification - DAGTypeLegalizer (
sub_20019C0) -- iterates to fixpoint until all types are legal; see Type Legalization - DAG Combiner -- second pass after type legalization
- LegalizeDAG (
sub_1FCE100dispatcher,sub_1FFB890action engine) -- legalizes operations on legal types - DAG Combiner -- third pass after operation legalization
- NVPTXTargetLowering::PerformDAGCombine (
sub_33C0CA0) -- NVPTX-specific post-legalize combines - Instruction Selection (
sub_3090F90) -- see ISel Patterns
Type Legalization
Type legalization (sub_20019C0) is the largest single function in the SelectionDAG pipeline at 348KB. Unlike upstream LLVM, which splits legalization across LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, and LegalizeVectorTypes.cpp, NVIDIA ships all type-legalization logic inlined into a single monolithic dispatch. This may be an LTO artifact or a deliberate choice for branch-prediction locality.
The master switch dispatches on approximately 50 ISD opcodes. Type legalization actions follow the standard LLVM model:
- Promote -- widen small types to register width (e.g.,
i8toi32) viaANY_EXTEND/ZERO_EXTEND, perform the operation, thenTRUNCATEthe result. - Expand -- split wide types into halves (e.g.,
i128into twoi64values) using shift-and-OR sequences. - Soften -- emulate unsupported FP types through integer libcall sequences.
- Scalarize/Split Vector -- decompose illegal vector types into scalar element operations.
The legality table lives inside NVPTXTargetLowering at offset +2422, organized as a 2D array indexed by 259 * VT + opcode. The 259-byte row stride accommodates LLVM's ~250 generic opcodes plus approximately 10 NVPTX target-specific opcodes. A secondary condition-code action table at offset +18112 uses 4-bit packed nibbles indexed by (VT_row + 15 * CC).
The SimpleVT type encoding appears as a recurring pattern throughout the function (at least 11 instances of the same bitwidth-to-VT mapping):
| SimpleVT | Type | SimpleVT | Type |
|---|---|---|---|
| 1 | i1 | 7 | i128 |
| 3 | i8 | 8 | f16 |
| 4 | i16 | 9 | f32 |
| 5 | i32 | 10 | f64 |
| 6 | i64 | 14--109 | vector types |
The vector type range 14--109 maps fixed-width (14--55) and scalable (56--109) vector MVTs to their scalar element types through a ~100-case switch block that appears six times in the function body. The definitive MVT::getSizeInBits() mapping (confirmed at sub_1FDDC20) is:
| MVT Range | Bits | Description |
|---|---|---|
| 0, 1 | 0 | Other, Glue |
| 2 | 1 | i1 |
| 3 | 8 | i8 |
| 4, 8 | 16 | i16, f16 |
| 5, 9 | 32 | i32, f32 |
| 6, 10 | 64 | i64, f64 |
| 7 | 128 | i128 |
| 11 | 80 | ppcf128 / x87 f80 |
| 14--23 | varies | 2-element vectors |
| 24--109 | varies | 3+ element vectors |
| 111--114 | 0 | token, metadata, untyped |
Type legalization workers fan out from several dispatch functions:
| Dispatcher | Role | Size | Cases |
|---|---|---|---|
sub_201E5F0 | Promote/expand secondary dispatch | 81KB | 441 case labels, 6 switches |
sub_201BB90 | ExpandIntegerResult | 75KB | 632 case labels |
sub_2000100 | PromoteIntegerResult | 45KB | recursive self-calls |
sub_2029C10 | SplitVectorResult | 5KB (dispatcher) | ~190 cases |
sub_202E5A0 | SplitVectorOperand | 6KB (dispatcher) | ~157 cases |
sub_2036110 | ScalarizeVectorResult | dispatch | "Do not know how to scalarize..." |
sub_2035F80 | ScalarizeVectorOperand | dispatch | "Do not know how to scalarize..." |
For complete detail, see Type Legalization.
Operation Legalization
LegalizeOp Dispatcher: sub_1FCE100
The top-level operation legalizer (sub_1FCE100, 91KB) is a massive switch on SDNode::getOpcode() (read as *(uint16_t*)(node + 24)) that dispatches approximately 100 ISD opcodes to dedicated per-opcode handler functions. The switch covers all major categories:
| Opcode | ISD Name | Handler | Size |
|---|---|---|---|
| 0x02 | EntryToken | sub_1F823C0 | |
| 0x03--0x04 | TokenFactor | sub_1F73660 | |
| 0x32 | CopyFromReg | sub_1F78510 | |
| 0x33 | CopyToReg | sub_1F987D0 | |
| 0x34 | MERGE_VALUES | sub_1FC08F0 | |
| 0x35 | ADD | sub_1FA8F90 | 31KB |
| 0x36 | SUB | sub_1FAA420 | 26KB |
| 0x37 | MUL | sub_1FAB9E0 | |
| 0x38 | SDIV/UDIV | sub_1FABFF0 | |
| 0x39--0x3A | SREM/UREM | sub_1F99DA0 | |
| 0x3B | AND | sub_1FD2F20 | |
| 0x3C | OR | sub_1FD2A20 | |
| 0x40 | SHL | sub_1FA27D0 | |
| 0x41 | SRA | sub_1FA2510 | |
| 0x42 | SRL | sub_1F71080 | |
| 0x43 | ROTL | inline | builds opcode 65 target node |
| 0x44 | ROTR | sub_1FA2D60 | |
| 0x47 | CTLZ | sub_1FA7370 | |
| 0x49 | CTPOP | sub_1FA2A00 | |
| 0x4A | BSWAP | inline | 16-bit width check |
| 0x4B | BITREVERSE | inline | |
| 0x4C | SELECT | sub_1FAC480 | 78KB |
| 0x4D | SELECT_CC | sub_1FAE680 | 87KB |
| 0x4E | SETCC | sub_1FB04B0 | 26KB |
| 0x4F | VSELECT | sub_1FCC170 | |
| 0x63 | SIGN_EXTEND | sub_1F8D440 | 22KB |
| 0x65 | ZERO_EXTEND | sub_1F74E80 | |
| 0x68 | TRUNCATE | sub_1F912F0 | 77KB |
| 0x69 | FP_ROUND | sub_1F97850 | 27KB |
| 0x6A | FP_EXTEND | sub_1FC15C0 | 36KB |
| 0x6C | BITCAST | sub_1F94350 | 22KB |
| 0x6D | LOAD | inline | alignment+memtype checks |
| 0x70 | STORE | sub_1F766E0 | |
| 0x72--0x75 | ATOMIC_FENCE..LOAD | sub_1FAA010 | |
| 0x76 | ATOMIC_STORE | sub_1FBDC00 | 76KB |
| 0x77 | ATOMIC_LOAD_ADD | sub_1FB1F30 | 37KB |
| 0x78 | ATOMIC_LOAD_SUB | sub_1FBB600 | 44KB |
| 0x7A | ATOMIC_LOAD_AND | sub_1FB8710 | 47KB |
| 0x7B | ATOMIC_LOAD_OR | sub_1FBA730 | 24KB |
| 0x7C | ATOMIC_LOAD_XOR | sub_1FB6C10 | 39KB |
| 0x86 | INTRINSIC_WO_CHAIN | sub_1F9E480 | 47KB |
| 0x87 | INTRINSIC_W_CHAIN | sub_1F9D3D0 | 26KB |
| 0x88 | INTRINSIC_VOID | sub_1F9CFD0 | |
| 0x8E | BUILD_VECTOR | sub_1FA3B00 | 26KB |
| 0x8F | INSERT_VECTOR_ELT | sub_1FA4AC0 | 67KB |
| 0x90 | EXTRACT_VECTOR_ELT | sub_1FA0CA0 | 20KB |
| 0x91 | CONCAT_VECTORS | sub_1FB3BB0 | 65KB |
| 0x94 | EXTRACT_SUBVECTOR | sub_1FB5FC0 | 19KB |
| 0x9A | DYNAMIC_STACKALLOC | sub_1F8F600 | |
| 0x9E | BR_CC | sub_1F8B6C0 |
Opcodes not listed (0--1, 5--0x31, 0x3D--3F, 0x46, 0x48, 0x51--0x62, etc.) return immediately with code 0 (legal, no transformation needed).
Action Dispatch Engine: sub_1FFB890
The operation legalization action engine (sub_1FFB890, 137KB) determines what to do for each DAG node based on the target's action table, then executes the chosen strategy. It reads the per-opcode action byte from NVPTXTargetLowering + 2422 using the formula *(uint8_t*)(TLI + 259 * VT + opcode + 2422):
| Action | Code | Behavior |
|---|---|---|
| Legal | 0 | Return immediately -- node is natively supported |
| Custom | 1 | Call NVPTXTargetLowering::LowerOperation (vtable slot #164, offset +1312); if NULL returned, fall through to expand |
| Expand | 2 | Try LegalizeTypes, then ExpandNode (sub_1FF6F70) as fallback |
| LibCall | 3 | Call ExpandNode directly for libcall substitution |
| Promote | 4 | Find a larger legal type and rebuild the node |
The function contains 967 case labels dispatching on opcode. When LowerOperation returns NULL (the custom lowering cannot handle the node), the framework falls through to the expansion path. When it returns a different node, ReplaceAllUsesWith (sub_1D44C70) splices the replacement into the DAG and marks the old node as dead (tombstone value -2 in the worklist hash set).
The promote path contains approximately 30 opcode-specific expansion strategies covering integer arithmetic, FP operations, vector operations, bitcasts, shifts, and NVPTX-specific operations. For FP promotion, the pattern is: FP_EXTEND both operands to the promoted type, apply the original operation, then FP_ROUND the result back.
Worklist management uses sub_1FF5010 with a DenseSet-like structure. The hash function for SDNode pointers follows the standard LLVM pattern: ((addr >> 9) ^ (addr >> 4)) & (capacity - 1).
Load/Store Legalization
The largest individual per-opcode handlers deal with memory operations:
| Handler | Opcode | Size | Behavior |
|---|---|---|---|
sub_1FC2C30 | LOAD (complex) | 70KB | Extending loads, vector loads, memory type conversion |
sub_1FC66B0 | Load/Store vectorization | 68KB | Offset-based coalescing with introsort (sub_1F6CA30) |
sub_1FC9570 | STORE legalization | 60KB | Alignment checks, store splitting, scatter sequences |
The load/store vectorization helper sorts operands by memory offset to detect coalescing opportunities, then creates vector load/store sequences when contiguous accesses are found. This is important for NVPTX because PTX supports ld.v2/ld.v4/st.v2/st.v4 instructions that load/store 2 or 4 elements in a single transaction.
Atomic Legalization
All atomic operations (ATOMIC_STORE through ATOMIC_LOAD_XOR, opcodes 0x72--0x7C) follow a shared structural pattern:
- Check operation legality via
sub_1D16620(isAtomicStoreLegal/isOperationLegalOrCustom) - If legal, emit the operation directly
- If custom, call
NVPTXTargetLowering::LowerOperationfor scope-aware NVPTX atomics - Build atomic fence pairs around the operation when needed
- Lower to target-specific NVPTX atomic operations with CTA/GPU/SYS scope
The ATOMIC_LOAD_SUB handler at sub_1FBB600 converts subtraction to atom.add of the negated operand when the target lacks native atom.sub.
NVPTX Custom Lowering: sub_32E3060
The LowerOperation dispatcher (sub_32E3060, 111KB) handles NVPTX-specific ISD opcode lowering. This is the second-largest function in the 0x32XXXXX range. It operates through a multi-phase approach rather than a clean switch-on-opcode, with approximately 620 local variables and a 0x430-byte stack frame.
The dispatcher is reached via vtable slot #164 (offset +1312) of the NVPTXTargetLowering object whenever the operation legalizer encounters action code 1 (Custom).
Supported Opcodes
| Opcode | ISD Node | Lowering Strategy |
|---|---|---|
| 51 | UNDEF | Direct pass-through via getNode(UNDEF) |
| 156 | BUILD_VECTOR | Iterates operands, detects all-same, calls dedicated handler |
| 186 | VECTOR_SHUFFLE | Three-level approach by result count (1, 2, 3+) |
| 234 | EXTRACT_VECTOR_ELT | Three sub-paths: predicate check, direct sub-register, general extract |
Additionally, the function handles load/store lowering (sub_32D2680, 81KB companion), integer/FP operation legalization (sub_32983B0, 79KB), address space casts (sub_32C3760, 54KB), bitcast/conversion (sub_32C7250, 57KB), and conditional/select patterns (sub_32BE8D0, 54KB). These large helper functions are called from within sub_32E3060's dispatch logic.
BUILD_VECTOR Lowering
BUILD_VECTOR (opcode 156) lowering begins by iterating all operands to detect the all-same (splat) case. When all elements are the same value, the lowering produces a single scalar load followed by register-class-appropriate replication. When elements differ, it falls through to a per-element insert chain.
For NVPTX, BUILD_VECTOR is significant because PTX has no native vector construction instruction -- vectors are built by storing elements into .param space and reloading as a vector type, or through register-pair packing for 2-element vectors.
VECTOR_SHUFFLE Three-Level Lowering
Vector shuffle lowering (lines 2665--3055 of the decompilation) implements a three-level strategy based on the result element count:
Level 1 -- Single-result shuffle. When the shuffle produces a single element, the lowering extracts the source element directly via EXTRACT_VECTOR_ELT and wraps it in a BUILD_VECTOR if needed. This avoids any actual shuffle machinery.
Level 2 -- Two-result shuffle. The handler uses a two-phase identity/extract detection with BitVector tracking. Phase A scans the shuffle mask to identify which source elements map to which result positions. Phase B determines whether each result position is an identity (element already in the correct position in one of the source vectors) or requires extraction. Results that are identities are left in place; non-identity elements are extracted and inserted.
Level 3 -- General shuffle (3+ results). Falls back to a BUILD_VECTOR-based reconstruction. Each result element is individually extracted from the appropriate source vector using EXTRACT_VECTOR_ELT, then all elements are combined via BUILD_VECTOR. For certain mask patterns, pairwise shuffle via sub_32B2430 is attempted first as an optimization.
EXTRACT_VECTOR_ELT Three Sub-Paths
EXTRACT_VECTOR_ELT (opcode 234) lowering takes one of three paths based on the extraction context:
-
Predicate extraction. When extracting from a vector of
i1(predicates), the lowering produces a bitwise test on the packed predicate register. This is NVPTX-specific: PTX stores predicate vectors packed into integer registers. -
Direct sub-register extraction. When the element index is a compile-time constant and the element aligns with a register boundary, the lowering generates a direct sub-register reference. This maps to PTX's
mov.b32ormov.b64for extracting elements from packed register pairs. -
General extraction. For non-constant indices or non-aligned elements, the lowering stores the entire vector to local memory, computes the byte offset from the index, and loads the element back. This generates
st.local+ld.localsequences, which is expensive but handles all cases.
Supporting NVPTX Lowering Functions
The custom lowering infrastructure at 0x3290000--0x32FFFFF consists of approximately 13 large functions totaling ~850KB:
| Function | Size | Role |
|---|---|---|
sub_32E3060 | 111KB | Master LowerOperation dispatcher |
sub_32A1EF0 | 109KB | Custom type promotion for NVPTX types |
sub_32EC4F0 | 92KB | Post-legalize DAG combine |
sub_32FE970 | 88KB | Vector operation splitting/scalarization |
sub_32D2680 | 81KB | Load/store DAG lowering (address space, alignment) |
sub_32983B0 | 79KB | Integer/FP operation legalization |
sub_32B8A20 | 71KB | NVVM intrinsic lowering (tex/surf/special) |
sub_32CBCB0 | 57KB | Extended type legalization |
sub_32C7250 | 57KB | Bitcast/conversion lowering |
sub_32A9030 | 55KB | Vector operation lowering |
sub_32C3760 | 54KB | Address space cast / pointer lowering |
sub_32BE8D0 | 54KB | Conditional/select lowering |
sub_32B6540 | 50KB | Special register / intrinsic lowering |
Common helpers shared across all functions in this cluster:
| Range | Role |
|---|---|
sub_325Fxxx | EVT/MVT type utilities |
sub_326xxxx | DAG node creation (getNode variants) |
sub_327xxxx | DAG memory node creation |
sub_328xxxx | Target-specific node creation |
sub_33Exxxx | NVPTX-specific node builders |
sub_33Fxxxx | NVPTX instruction node helpers |
sub_340xxxx | NVPTX constant/register node helpers |
sub_341xxxx | NVPTX chain/glue node construction |
The .param-Space Calling Convention
PTX does not use registers for argument passing. Instead, all arguments flow through .param memory space, a compiler-managed address space specifically for call sites. LowerCall (sub_3040BF0, 88KB) implements this convention by emitting a structured sequence of NVPTXISD custom DAG nodes.
Call Sequence DAG Structure
CallSeqBegin(315, seq_id, 0)
DeclareScalarParam(506, align=4, idx=0, size=32) // scalar arg
DeclareParam(505, align=4, idx=1, size=N) // struct arg (byval)
StoreV1(571, ...) // 8 bytes at a time
StoreV2(572, ...) // or 2-element vector
DeclareRetScalarParam(508, 1, 32, 0) // return decl
CallProto(518, callee, ...)
CallStart(514, ...) // actual call
LoadRetParam(515, 1, 0, ...) // load return value
CallSeqEnd(517, ...)
CallSeqEnd_Outer(316, ...)
Each call increments a monotonic sequence counter at NVPTXTargetLowering + 537024 (offset 134256 * 4), used to match CallSeqBegin/CallSeqEnd pairs and generate unique .param variable names (e.g., __param_0, __param_1, etc.).
Scalar Widening Rules
Scalar arguments narrower than 32 bits are widened to 32 bits; values between 32 and 64 bits are widened to 64 bits. This matches the PTX ABI requirement that .param scalars have a minimum 32-bit size:
| Source Width | Widened To | PTX Type |
|---|---|---|
i1 (1 bit) | i32 (32 bit) | .param .b32 |
i8 (8 bit) | i32 (32 bit) | .param .b32 |
i16 (16 bit) | i32 (32 bit) | .param .b32 |
i32 (32 bit) | i32 (no change) | .param .b32 |
i64 (64 bit) | i64 (no change) | .param .b64 |
f16 (16 bit) | i32 (32 bit) | .param .b32 |
f32 (32 bit) | f32 (no change) | .param .f32 |
f64 (64 bit) | f64 (no change) | .param .f64 |
Vector Parameter Passing
Vector arguments use StoreV1/StoreV2/StoreV4 (opcodes 571--573) mapping to PTX st.param.b32, st.param.v2.b32, st.param.v4.b32 and their 64-bit variants. The element count determines the opcode:
| Opcode | Name | PTX | Description |
|---|---|---|---|
| 571 | StoreV1 | st.param.b32 / .b64 | Single element store |
| 572 | StoreV2 | st.param.v2.b32 / .v2.b64 | 2-element vector store |
| 573 | StoreV4 | st.param.v4.b32 / .v4.b64 | 4-element vector store |
For byval struct arguments, the lowering decomposes the aggregate into chunks that fit the largest available vector store. An 80-byte struct, for example, might be lowered as five StoreV4.b32 operations (5 x 4 x 4 = 80 bytes).
NVPTXISD DAG Node Opcodes
The complete set of NVPTXISD opcodes used in call lowering:
| Opcode | Name | Role |
|---|---|---|
| 315 | CallSeqBegin | Marks start of call parameter setup (maps to ISD opcode) |
| 316 | CallSeqEnd | Outer end-of-call marker (maps to ISD opcode) |
| 505 | DeclareParam | Declares a byval .param aggregate parameter |
| 506 | DeclareScalarParam | Declares a scalar .param parameter with width+alignment |
| 508 | DeclareRetScalarParam | Declares the return value .param parameter |
| 510 | CallDirect | Direct call with prototype |
| 511 | CallDirectNoProto | Direct call without prototype (old-style C) |
| 512 | CallIndirect | Indirect call (function pointer) with prototype |
| 513 | CallIndirectNoProto | Indirect call without prototype |
| 514 | CallStart | The actual call instruction |
| 515 | LoadRetParam | Loads return value from .param space |
| 517 | CallSeqEnd (inner) | Inner end-of-call marker |
| 518 | CallProto | Call prototype declaration (type signature) |
| 571--573 | StoreV1/V2/V4 | Stores to .param space |
Four Call Flavors
Call dispatch is selected by prototype availability and call directness:
| Opcode | Name | When Used |
|---|---|---|
| 510 | CallDirect | Direct call to a named function with a known prototype |
| 511 | CallDirectNoProto | Direct call without prototype (K&R C style, rare in CUDA) |
| 512 | CallIndirect | Function pointer call with known prototype |
| 513 | CallIndirectNoProto | Function pointer call without prototype |
In CUDA code, CallDirect (510) dominates because the vast majority of device function calls are direct with full prototypes. CallIndirect (512) appears when calling through __device__ function pointers. The no-prototype variants are legacy paths that may not be exercisable from CUDA C++ but are retained for C compatibility.
Libcall Generation
When the lowering needs to synthesize a library call (e.g., for __divdi3 software division), it attaches "nvptx-libcall-callee" metadata set to "true" on the callee. This metadata string was extracted from the binary at sub_3040BF0. The metadata tells later passes that the callee is a compiler-generated runtime helper rather than user code.
The primary helpers called from LowerCall:
| Helper | Role |
|---|---|
sub_302F170 | Parameter marshaling setup |
sub_3031480 | Argument type coercion |
sub_3031850 | Scalar widening |
sub_30351C0 | Struct decomposition for byval args |
sub_303E700 | Return value handling |
DAG Combining
The DAG combiner runs three times during the SelectionDAG pipeline: once after initial DAG construction, once after type legalization, and once after operation legalization. The combiner consists of a target-independent framework and NVPTX-specific target hooks.
Target-Independent Combiner Framework
The combiner orchestrator (sub_F681E0, 65KB) manages the worklist-driven iteration over all DAG nodes:
function DAGCombine(dag):
worklist = dag.allNodes() // linked list iteration
visited = SmallPtrSet()
while worklist not empty:
node = worklist.pop()
if visited.count(node): continue
visited.insert(node) // sub_C8CA60 / sub_C8CC70
result = visitNode(node) // sub_F20C20
if result != node:
ReplaceAllUsesWith(node, result) // sub_F162A0
add users of result to worklist
mark node dead
The worklist operates on the SDNode linked list. Nodes are processed via sub_C8CA60 (SmallPtrSet::count for visited check) and sub_C8CC70 (SmallPtrSet::insert with vector growth for worklist membership). The exclusion list at this + 64 (with count at this + 76) prevents certain nodes from being visited.
Global flag byte_4F8F8E8 enables verbose/debug tracing of the combining process.
Visitor: sub_F20C20
The per-node combine visitor (sub_F20C20, 64KB) implements six sequential optimization phases for each node:
Phase 1: Opcode-specific combine. Calls sub_100E380, the target-independent combine dispatcher, which switches on the node's opcode and applies algebraic simplifications (e.g., x + 0 -> x, x & -1 -> x, x * 1 -> x). For NVPTX, this also invokes the target-specific combine hook via vtable dispatch.
Phase 2: Known-bits narrowing. For nodes with constant operands, the combiner builds APInt masks and calls sub_11A3F30 (computeKnownBits / SimplifyDemandedBits) to narrow constants. When all high bits of a result are known-zero, the operation can be narrowed to a smaller type. Two global cl::opt flags gate this phase: qword_4F8B3C8 controls strict-FP known-bits combining, and qword_4F8B548 controls 2-operand reassociation.
Phase 3: Operand type-narrowing loop. For each operand, the combiner computes the legalized type, skips zero-constant operands, creates legalized replacements, and inserts SIGN_EXTEND/TRUNCATE cast nodes as needed. This handles the common case where an operation was originally on i64 but only uses the low 32 bits.
Phase 4: All-constant-operand fold. Detects when every operand is a ConstantSDNode (opcode 17) and calls sub_1028510 for full constant-fold evaluation. The constant check uses a 4x-unrolled loop for performance. The operand count is extracted via the 0x7FFFFFF mask from the packed SDNode header.
Phase 5: Division-by-constant strength reduction. Replaces division by power-of-two constants with shift+mask sequences via APInt shift/mask computation. Division by non-power-of-two constants uses the magic-number reciprocal multiplication technique: x / C becomes (x * M) >> shift where M is the multiplicative inverse.
Phase 6: Vector stride / reassociation patterns. Attempts associative FP decomposition via sub_F15980, with fast-math flag propagation when both sub-results are known non-negative. This handles patterns like (a + b) + c -> a + (b + c) when nsz and arcp flags permit.
ReplaceAllUsesWith: sub_F162A0
The combiner's RAUW implementation walks the use-list and hashes each user into a worklist map using the standard DenseMap infrastructure with LLVM-layer sentinels (-4096 / -8192). See Hash Table and Collection Infrastructure for the hash function and growth policy.
Supporting Combine Functions
| Function | Size | Role |
|---|---|---|
sub_F0F270 | 25.5KB | Pattern matcher (STORE/BITCAST/CONSTANT) |
sub_F24210 | 34.6KB | DAG simplification pass |
sub_F2B940 | 29.8KB | Truncation/extension chain combines |
sub_F29CA0 | 26.9KB | Node morphing / operand updating |
sub_F27020 | 25KB | Specific operation combines |
sub_F2D1B0 | 22.2KB | Comparison combines |
sub_F2DD30 | 11.5KB | Shift combines |
sub_F62E00 | 46.7KB | Address/memory operation combines |
sub_F657D0 | 26.1KB | Vector operation combines |
sub_F6C1B0 | 15.7KB | TokenFactor chain management |
SDNode Data Structure
The combiner manipulates SDNodes using these field offsets (reconstructed from access patterns throughout the combining code):
| Offset | Size | Field |
|---|---|---|
| -8 | 8 | Operand list pointer (when bit 6 of byte +7 is set) |
| 0 | 8 | First operand / use chain linked list |
| +4 | 4 | Packed: NumOperands (bits 0--26) | Flags (bits 27--31) |
| +7 | 1 | Extra flags (bit 6 = has operand pointer at -8) |
| +8 | 8 | ValueType / MVT |
| +16 | 8 | Use chain (next user pointer, 0 if none) |
| +24 | 2 | Opcode (uint16_t) |
| +32 | 4 | Result type info |
| +36 | 4 | DebugLoc / location ID |
| +40 | 8 | Chain operand |
| +48 | 8 | Value pointer / type info |
| +72 | 4 | NumResults |
| +80 | 4 | Additional operand count / mask index |
Operand stride is 32 bytes. Access pattern: node - 32 * (node[+4] & 0x7FFFFFF) yields the first operand.
NVPTX Target-Specific Combines: sub_33C0CA0
NVPTXTargetLowering::PerformDAGCombine (sub_33C0CA0, 62KB) provides NVPTX-specific algebraic optimizations. This function is called from the target-independent combiner framework via vtable dispatch. It receives an SDNode and returns either NULL (no transformation) or a replacement node.
The function calls sub_2FE8D10 (13x), sub_2FE6CC0 (12x), sub_30070B0 (14x), and sub_2D56A50 (9x), with 27 calls into sub_B2D*/B2C* for debug value builders.
A secondary NVPTX DAG combine function at sub_32EC4F0 (92KB) handles post-legalize optimization, operating after the main legalization pass. It calls into the same shared DAG construction helpers (sub_2FE3480, sub_2FE6750, sub_325F5D0, sub_3262090).
The NVIDIA-side DAGCombiner at sub_3425710 (142KB) includes debug tracing with "COVERED: " and "INCLUDED: " prefix strings, confirming it was built with NVIDIA's internal debug infrastructure. This function calls sub_C8D5F0 (31x for type action checks), sub_2E79000 (14x for value type access), and sub_3423E80 (8x for combine helper dispatch).
NVPTX Address Spaces
Address space constants appear throughout the SelectionDAG lowering. See Address Spaces for the master table and SelectionDAG Address Space Encoding for the backend-specific secondary encoding used in .param passing conventions.
In LowerCall, pointer arguments undergo addrspacecast to generic (AS 0) via sub_33F2D30. The pointer size for AS 5 follows a power-of-two encoding: sizes 1, 2, 4, 8, 16, 32, 64, 128 bytes map to codes 2, 3, 4, 5, 6, 7, 8, 9.
Address space handling permeates the entire lowering infrastructure. Functions sub_33067C0 (74KB), sub_331F6A0 (62KB), sub_331C5B0 (60KB), and sub_33D4EF0 (114KB) all contain address-space-aware logic for NVPTX memory operations, global address lowering, argument handling, and complex pattern matching respectively.
Intrinsic Lowering
The intrinsic lowering mega-switch (sub_33B0210, 343KB) dispatches over 200 distinct NVPTX intrinsic IDs into DAG node construction. The switch covers intrinsic IDs 0--0x310 in the main body, with high-ID ranges for texture/surface operations extending to ID 14196 (0x3774). The function contains approximately 1,000 local variables and calls sub_338B750 (getValue helper) 195 times, sub_3406EB0 (getNode) 116 times, and sub_337DC20 (setValue) 100 times.
Key intrinsic categories:
| Category | ID Range | Handler | Count |
|---|---|---|---|
| Math ops (rounding modes) | 2, 10, 12, 20, 21, 63, ... | sub_33FA050 | ~20 |
| WMMA / MMA (tensor core) | 0xA4--0xA8, 0x194--0x1EC | sub_33A64B0 | 95 |
| Texture sampling | 0x5D--0x8D | sub_33A4350 | 50 |
| Surface read/write | 0x8E--0x90 | sub_33A3180 | 3 |
| Warp shuffle | 0xD4, 0xD5, 0xDF, 0xE0 | sub_33FAF80 | 4 |
| Vote intrinsics | 0xE1--0xE6 | sub_339CDA0 / sub_339E310 | 6 |
| Atomics | 0xEB--0xF8 | sub_3405C90 / sub_340AD50 | ~14 |
| cp.async / TMA | 0x175--0x17C | sub_33AD3D0 | ~8 |
| MMA sm90+ (Hopper wgmma) | 0x183--0x191 | sub_33AC8F0 | 15 |
| Texture/surface handle | 10578 | inline | nvvm_texsurf_handle |
The WMMA/MMA block is the largest single-handler group: 95 consecutive case labels (intrinsic IDs 404--492) all delegate to sub_33A64B0, covering wmma.load, wmma.store, wmma.mma, mma.sync (sm70+), mma.sp (sm80+), and mma.f64 (sm90+). The warp shuffle intrinsics map to specific NVPTXISD opcodes: __shfl_down_sync to 277, __shfl_up_sync to 275, __shfl_xor_sync to 278, and __shfl_sync to 276.
Math intrinsics encode explicit rounding modes via an inner opcode table. For example, ADD_RN (round-to-nearest) maps to opcode 252, ADD_RZ (round-toward-zero) to 249, ADD_RM (round-toward-minus-infinity) to 245, and ADD_RP (round-toward-plus-infinity) to 270.
NVIDIA-specific intrinsic IDs include high-value entries: ID 10578 handles nvvm_texsurf_handle, IDs 8920/8937--8938 handle texture/surface operations. The overflow path at sub_33A1E80 handles intrinsic IDs that fall outside the main switch range.
NVPTX computeKnownBits
The NVPTX target provides a custom computeKnownBitsForTargetNode implementation (sub_33D4EF0, 114KB) that propagates bit-level information through 112 opcode cases in the SelectionDAG. This function calls sub_969240 (SDNode accessor) 399 times and itself recursively 99 times. It supports demanded-bits pruning via an APInt mask parameter and caps recursion at depth 6 (matching LLVM's default MaxRecursionDepth).
Notable NVPTX-specific known-bits behaviors:
- Memory operation type inference (opcode 0x12A): Propagates known bits through load operations based on extension mode (zero-extend, sign-extend, any-extend) encoded in the node flags byte at bits
[2:3]. Handlesld.global.u32vsld.global.s32vsld.global.b32distinctions. - Texture/surface fetch results (opcodes 0x152--0x161): Sets known bits in the range
[elementSize..width]based on the result type, encoding the known bit-width of texture fetch results. - Constant pool integration (opcode 0x175): Uses LLVM's
ConstantRangeclass to derive known bits from constant pool values, chainingfromKnownBitsthroughintersecttotoKnownBits. - Target fence at opcode 499 (
ISD::BUILTIN_OP_END): All opcodes above 499 delegate to theTargetLoweringvirtual method; below that, the generic ISD switch handles everything.
APInt values with width at most 64 bits use inline storage; wider values trigger heap allocation. The constant 0x40 (64) appears hundreds of times as the inline/heap branch condition.
The target-independent known-bits infrastructure at 0xF50000--0xF60000 includes:
| Function | Size | Role |
|---|---|---|
sub_F5A610 | 36.7KB | computeKnownBits for generic ISD opcodes (depth limit at a4 == 48) |
sub_F5F040 | 52.4KB | Extended known-bits with recursive expansion limit: (v74-1)*v77 > qword_4F8BF28 |
sub_F5CD10 | 26.6KB | DAG combine using known-bits results |
sub_F54050 | 17.8KB | Known-bits for multi-result nodes |
sub_F54F50 | 10.7KB | Known-bits for vector operations |
Global qword_4F8BF28 is a threshold that limits recursive known-bits expansion to prevent combinatorial blowup.
Inline Assembly Lowering
Inline assembly lowering spans two locations in the binary: the target-independent SelectionDAGBuilder::visitInlineAsm at sub_2079C70 (83KB) and the NVPTX-specific constraint handler at sub_338BA40 (79KB).
Target-Independent Framework: sub_2079C70
The inline assembly visitor (sub_2079C70, 83KB, 2,797 lines) lowers LLVM IR asm statements into ISD::INLINEASM (opcode 193) or ISD::INLINEASM_BR (opcode 51) DAG nodes. The function allocates an 8.4KB stack frame and processes operands in five phases:
-
Initialization. Parses the asm string and metadata. Looks up
"srcloc"metadata on the asm instruction for error location reporting. -
Constraint pre-processing. Each constraint string is parsed into a 248-byte record. Constraints are classified as: immediate (
'i', flag0x20000), memory ('m', flag0x30000), or register (determined by target). -
Tied operand resolution. Input operands tied to output operands (e.g.,
"=r"and"0") are matched and validated for type compatibility. Diagnostic:"inline asm not supported yet: don't know how to handle tied indirect register inputs". -
Per-operand lowering. Each operand is lowered to an SDValue. Register operands go through
TargetLowering::getRegForInlineAsmConstraint()(virtual dispatch). Diagnostics:"couldn't allocate output register for constraint '","couldn't allocate input reg for constraint '". -
DAG node finalization. All operands are assembled into an
INLINEASMSDNode with chain and flag operands.
The function uses a 16-entry inline operand buffer (7,088 bytes on stack), reflecting the assumption that CUDA inline asm rarely exceeds 16 operands. Each operand working structure is 440 bytes. Overflow triggers heap reallocation via sub_205BBA0.
Diagnostic strings found in the binary:
| String | Condition |
|---|---|
"couldn't allocate output register for constraint '" | Register constraint unsatisfiable |
"couldn't allocate input reg for constraint '" | Input constraint unsatisfiable |
"Don't know how to handle indirect register inputs yet..." | Indirect tied operand |
"inline asm error: This value type register class is not natively supported!" | Unsupported type for register |
"invalid operand for inline asm constraint '" | Generic operand mismatch |
"Indirect operand for inline asm not a pointer!" | Non-pointer indirect operand |
NVPTX Constraint Handler: sub_338BA40
The NVPTX-specific inline asm constraint handler (sub_338BA40, 79KB) is part of the NVPTXTargetLowering class. It processes constraint strings specific to the NVPTX backend:
-
Simplified constraint model. NVPTX recognizes single-character
'i'(immediate) and'm'(memory) constraints throughsub_2043C80, avoiding the complex multi-character constraint tables used by x86/ARM backends. -
Register class mapping. The function maps MVT values to NVPTX register classes using a 544-case switch (confirmed at
sub_204AFD0, 60KB): MVTs 0x18--0x20 map toInt32Regs, 0x21--0x28 toInt64Regs, 0x29--0x30 toFloat32Regs, 0x31--0x36 toFloat64Regs, 0x37 toInt128Regs, 0x56--0x64 to 2-element vector registers. -
Convergent flag handling (bit 5): Ensures barrier semantics are preserved for inline asm, checked via operand bundle attribute or function-level
convergent. -
Scalar-to-vector conversion. String
"non-trivial scalar-to-vector conversion"indicates that the handler attempts to pack scalar inline-asm results into vector register classes when the output constraint specifies a vector type.
Additional support at sub_2046E60 emits ", possible invalid constraint for vector type" when a vector type is used with an incompatible constraint.
ISel Pattern Matching Driver
The instruction selection driver (sub_3090F90) manages the top-level selection loop rather than performing pattern matching directly. It builds a cost table for function arguments using a hash table with hash function key * 37, processes the topological worklist using a min-heap priority queue, and calls the actual pattern matcher (sub_308FEE0) for each node.
The driver maintains an iteration budget of 4 * numInstructions * maxBlockSize to guard against infinite loops. When the budget is exceeded, selection terminates for the current function.
For complete ISel detail, see ISel Pattern Matching & Instruction Selection.
NVPTXTargetLowering Initialization
The NVPTXTargetLowering constructor (sub_3056320, 45KB + sub_3314670, 73KB) populates the legalization action tables that drive all subsequent SelectionDAG processing. It calls sub_302E500, sub_302F030, sub_3030230, and sub_3034720 to register legal/custom/expand actions for each {ISD_opcode, MVT} pair.
Key aspects of the initialization:
-
Subtarget-gated feature checks. Offsets
+2843,+2584, and+2498in the subtarget object encode SM-version-dependent feature availability. These control which types and operations are marked Legal vs. Custom vs. Expand. -
Vector support. NVPTX has limited native vector support. Most vector operations are marked Custom or Expand, forcing them through the custom lowering at
sub_32E3060. -
Atomic support. The string
"vector atomics not supported on this architecture!"atsub_3048C30confirms SM-version-gated vector atomic support, likely SM 90+ (Hopper) or SM 100+ (Blackwell). -
Address space assertions. AS values (generic=0, global=1, shared=3, const=4, local=5) are encoded directly into the legalization tables, with different legal operation sets per address space.
What Upstream LLVM Gets Wrong for GPU
Upstream LLVM's SelectionDAG framework was designed for CPU ISAs where register classes overlap and share a unified physical register file. The NVPTX target breaks these assumptions at every level:
- Upstream assumes register classes interfere with each other. On x86, GR32 is a sub-register of GR64; allocating
eaxconstrainsrax. The interference graph, coalescing, and copy elimination infrastructure all assume overlapping classes. NVPTX has nine completely disjoint classes (%r,%f,%fd,%p, etc.) with zero cross-class interference. The DAG's register pressure tracking, copy coalescing hints, and class constraint propagation solve a problem that does not exist on this target. - Upstream assumes function calls are cheap register shuffles. CPU calling conventions move arguments through registers (
rdi,rsi, etc.) or a stack backed by L1 cache. NVPTX function calls go through the.paramaddress space with explicitDeclareParam/st.param/ld.paramsequences -- O(n) memory operations per argument. TheLowerCallfunction in cicc is 88KB (vs. upstream's few KB) because it must handle four call flavors, monotonic.paramnaming, and"nvptx-libcall-callee"metadata for synthesized calls. - Upstream assumes a small set of intrinsics. Upstream NVPTX intrinsic lowering covers approximately IDs 0-300. CICC's intrinsic mega-switch at
sub_33B0210(343KB) handles IDs up to 14196, covering cp.async, TMA, WGMMA, and the full SM 90/100 tensor operation set. The upstream framework's assumption that intrinsic lowering is a small switch case is off by two orders of magnitude. - Upstream assumes vector types are natively supported. CPU targets have native vector registers (XMM/YMM/ZMM, NEON Q-registers). NVPTX has no native vector registers -- most vector operations are marked Custom or Expand, forcing them through 111KB of custom lowering at
sub_32E3060. The "legalize then select" pipeline spends most of its time decomposing vectors that never should have been formed. - Upstream assumes known-bits propagation is a small target hook. Upstream NVPTX's
computeKnownBitsForTargetNodehandles fewer than 20 opcodes. CICC's version atsub_33D4EF0(114KB, 112 opcode cases) propagates bits through texture fetches, address space loads, and NVPTX-specific operations -- a 50x expansion that upstream's hook interface was never designed to support cleanly.
Differences from Upstream LLVM
The NVPTX SelectionDAG backend in cicc v13.0 diverges from upstream LLVM NVPTX in several structural and behavioral ways. This section catalogs the known differences.
Structural Divergences
Monolithic type legalizer. Upstream LLVM splits type legalization across four source files (LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, LegalizeVectorTypes.cpp, LegalizeTypes.cpp). In cicc, all four are collapsed into a single 348KB function (sub_20019C0), likely an LTO artifact. The behavioral result is identical, but the code layout makes the function nearly impossible to patch incrementally.
Dual-address ISel infrastructure. The NVPTX lowering code exists at two address ranges (0x32XXXXX and 0x33XXXXX), with functions at sub_32E3060 (LowerOperation) and sub_3377410 (secondary dispatch) forming a two-level dispatch. Upstream NVPTX uses a single LowerOperation method. The binary has a secondary overflow path for intrinsic IDs that fall outside the main switch range.
142KB NVPTX DAGCombiner. The function sub_3425710 includes "COVERED:" and "INCLUDED:" debug trace strings not present in any upstream LLVM release. This is NVIDIA internal instrumentation for tracking combine coverage during development.
Two inline asm subsystems. The target-independent visitInlineAsm at sub_2079C70 (83KB) and the NVPTX-specific constraint handler at sub_338BA40 (79KB) total 162KB. The upstream NVPTX inline asm support is approximately 200 lines of code. The cicc version is vastly more complex, likely handling NVIDIA-internal PTX inline asm patterns.
Behavioral Divergences
Calling convention. Upstream LLVM NVPTX uses a simplified LowerCall that handles only the standard .param space protocol. CICC's sub_3040BF0 (88KB) adds "nvptx-libcall-callee" metadata for synthesized libcalls, monotonic sequence counters for unique .param names, and four call flavors (with/without prototype x direct/indirect). The upstream has two flavors.
Intrinsic count. The cicc intrinsic lowering switch (sub_33B0210, 343KB) handles intrinsic IDs up to 14196 (0x3774), with dedicated handlers for cp.async/TMA and WGMMA instructions. Upstream LLVM's NVPTX intrinsic lowering covers approximately IDs 0--300. The extended range covers SM 90 (Hopper) and SM 100 (Blackwell) tensor operations.
Vector shuffle lowering. The three-level shuffle lowering (identity detection, BitVector tracking, BUILD_VECTOR fallback) is more sophisticated than upstream NVPTX, which typically scalarizes all shuffles unconditionally.
Atomic scope awareness. CICC's atomic lowering at sub_3048C30 (86KB) supports CTA/GPU/SYS scope atomics with SM-version gating. Upstream LLVM NVPTX handles basic atomics but lacks the full scope hierarchy.
Known-bits propagation. The NVPTX computeKnownBitsForTargetNode at sub_33D4EF0 (114KB, 112 opcode cases, 399 SDNode accesses, 99 recursive calls) is far more extensive than the upstream version, which typically handles fewer than 20 target-specific opcodes. The cicc version propagates bits through texture fetches, address space loads, and NVPTX-specific operations.
PerformDAGCombine depth. The NVPTX-specific combine at sub_33C0CA0 (62KB) plus the post-legalize combine at sub_32EC4F0 (92KB) total 154KB. Upstream NVPTXISelLowering::PerformDAGCombine is approximately 2KB.
Address space 101. CICC uses address space 101 as an alternative .param encoding (seen in sub_33067C0), which does not exist in upstream LLVM NVPTX. This may be an internal convention for distinguishing kernel .param from device-function .param.
Unchanged from Upstream
The following components appear to be stock LLVM with no NVIDIA modifications:
- SelectionDAG core infrastructure at
0xF05000--0xF70000(combining, known-bits, node management) - DAG node hashing with
((a3 >> 4) ^ (a3 >> 9)) & (capacity - 1)atsub_F4CEE0 - Constrained FP intrinsic lowering at
sub_F47010(36KB,"round.tonearest","fpexcept.ignore") ReplaceAllUsesWithimplementation atsub_F162A0- All SDNode creation, deduplication, and lifecycle management
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
SelectionDAGLegalize::LegalizeOp dispatcher (~100 opcodes) | sub_1FCE100 | 91KB | -- |
SelectionDAGLegalize action dispatch (967 cases) | sub_1FFB890 | 137KB | -- |
| Legalization worklist management | sub_1FF5010 | -- | |
ExpandNode fallback | sub_1FF6F70 | -- | |
DAGCombiner::visitNode (6-phase per-node combine) | sub_F20C20 | 64KB | -- |
DAGCombiner::combine orchestrator (worklist management) | sub_F681E0 | 65KB | -- |
ReplaceAllUsesWith (hash: ((id >> 9) ^ (id >> 4))) | sub_F162A0 | -- | |
| Combine pattern matcher (STORE/BITCAST/CONSTANT) | sub_F0F270 | 25.5KB | -- |
| Target-independent opcode-specific combine dispatcher | sub_100E380 | -- | |
| All-constant-operand fold evaluation | sub_1028510 | -- | |
| Vector stride / reassociation combine | sub_F15980 | -- | |
Generic computeKnownBits | sub_F5A610 | 36.7KB | -- |
| Extended known-bits (recursive expansion limit) | sub_F5F040 | 52.4KB | -- |
SelectionDAG::getNode / CSE hash table | sub_F4CEE0 | 41.3KB | -- |
| DAG node builder (operand/result setup) | sub_F49030 | 38.2KB | -- |
| Constrained FP intrinsic lowering | sub_F47010 | 36.4KB | -- |
NVPTXTargetLowering::LowerOperation dispatcher | sub_32E3060 | 111KB | -- |
| LowerOperation secondary dispatch (overflow) | sub_3377410 | 75KB | -- |
| NVPTX custom type promotion | sub_32A1EF0 | 109KB | -- |
| NVPTX post-legalize DAG combine | sub_32EC4F0 | 92KB | -- |
| NVPTX vector operation splitting | sub_32FE970 | 88KB | -- |
| NVPTX load/store lowering | sub_32D2680 | 81KB | -- |
| NVPTX integer/FP legalization | sub_32983B0 | 79KB | -- |
| NVPTX intrinsic lowering (tex/surf) | sub_32B8A20 | 71KB | -- |
| NVPTX vector operation lowering | sub_32A9030 | 55KB | -- |
| NVPTX addrspacecast / pointer lowering | sub_32C3760 | 54KB | -- |
| NVPTX conditional/select lowering | sub_32BE8D0 | 54KB | -- |
| NVPTX special register lowering | sub_32B6540 | 50KB | -- |
NVPTXTargetLowering::PerformDAGCombine | sub_33C0CA0 | 62KB | -- |
| NVPTX DAGCombiner with "COVERED"/"INCLUDED" tracing | sub_3425710 | 142KB | -- |
NVPTXTargetLowering::LowerCall | sub_3040BF0 | 88KB | -- |
| NVPTX atomic operation lowering | sub_3048C30 | 86KB | -- |
NVPTXTargetLowering constructor (action setup) | sub_3056320 | 45KB | -- |
| Type legalization table population | sub_3314670 | 73KB | -- |
| Intrinsic lowering mega-switch | sub_33B0210 | 343KB | -- |
NVPTX computeKnownBitsForTargetNode | sub_33D4EF0 | 114KB | -- |
| NVPTX inline asm constraint handler | sub_338BA40 | 79KB | -- |
SelectionDAGBuilder::visitInlineAsm | sub_2079C70 | 83KB | -- |
NVPTX visitNVVMTexSurf handler | sub_2077400 | 20KB | -- |
| NVPTX argument passing / type coercion | sub_2072590 | 38KB | -- |
NVPTXDAGToDAGISel::Select driver | sub_3090F90 | 91KB | -- |
| Address space / memory operation support | sub_33067C0 | 74KB | -- |
| Global address lowering | sub_331F6A0 | 62KB | -- |
| Formal arguments / return lowering | sub_3349730 | 82KB | -- |
Call lowering (visitCall / LowerCallTo) | sub_332FEA0 | 79KB | -- |
Reimplementation Checklist
- NVPTXTargetLowering with legality tables. Populate the 2D action table at offset +2422 (259-byte row stride, indexed by
259 * VT + opcode) with per-SM-version legal/custom/expand/promote actions for all ISD opcodes and NVPTX-specific opcodes. Include the condition-code action table at offset +18112 and the SM-gated type legality rules (f16 on SM 53+, v2f16 on SM 70+, bf16 on SM 80+). - LowerOperation dispatcher (111KB equivalent). Implement the master
LowerOperationswitch dispatching ~3,626 lines of GPU-specific lowering for loads, stores, calls, atomics, vector operations, and address space casts, including the.param-space calling convention with DeclareParam/StoreV1-V4/LoadRetParam sequences. - Intrinsic lowering mega-switch (343KB equivalent). Build the intrinsic lowering function covering 200+ CUDA intrinsic IDs (up to ID 14196/0x3774), organized as a jump table with per-intrinsic lowering handlers for tensor core, warp, surface/texture, and math intrinsics.
- PerformDAGCombine for NVPTX. Implement the NVPTX-specific DAG combines (62KB) that run after operation legalization, including load/store vectorization (offset-based coalescing with sorting for
ld.v2/ld.v4/st.v2/st.v4detection), NVPTX-specific algebraic simplifications, and the "COVERED"/"INCLUDED" tracing infrastructure. - ISel::Select pattern matching (91KB equivalent). Implement the top-down instruction selection driver that visits DAG nodes in topological order, matching against NVPTX-specific patterns via opcode-indexed tables, with special handling for tensor core instructions, inline assembly constraints, and multi-result nodes.
- computeKnownBits for NVPTX (114KB). Implement the NVPTX-specific known-bits analysis covering
ctaid,tid,ntid, address space pointer width constraints, and GPU-specific intrinsic range information to enable downstream optimization.
Cross-References
- Type Legalization -- detailed 348KB monolith documentation
- ISel Pattern Matching -- instruction selection patterns and matching
- Register Allocation -- follows ISel in the pipeline
- Address Spaces -- consolidated AS reference
- Register Classes -- NVPTX register class definitions
- NVPTX Opcodes -- MachineInstr opcode reference
- NVPTXTargetMachine -- target machine and TTI hooks
- Emission -- PTX emission from MachineInstrs
- Tensor Core Intrinsics -- WMMA/MMA intrinsic detail
- Surface/Texture Intrinsics -- tex/surf lowering