Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Type Legalization

Prerequisites: Familiarity with SelectionDAG, NVPTX register classes, and LLVM type system basics. Understanding of the compilation pipeline up to instruction selection is assumed.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Type legalization is the SelectionDAG phase that rewrites every DAG node whose result or operand type is illegal for the target into equivalent sequences of legal-type operations. In upstream LLVM this logic spans four source files (LegalizeTypes.cpp, LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, LegalizeVectorTypes.cpp) totaling roughly 16,000 lines. In CICC v13.0, NVIDIA ships all of it as a single 348KB monolithic function -- sub_20019C0 -- the largest function in the SelectionDAG address range and among the largest in the entire binary. Operation legalization follows in a separate 169KB function (sub_1FFB890), and vector split/scalarize dispatchers fan out into an additional 25+ worker functions.

The monolithic structure is either an LTO inlining artifact (all four upstream .cpp files collapsed by link-time optimization) or a deliberate choice for branch-prediction locality. The functional behavior is a faithful reproduction of upstream LLVM's DAGTypeLegalizer, but the legality tables, legal-type set, and vector legalization rules are heavily NVPTX-specific.

Type legalizer monolithsub_20019C0 (348KB, 10,739 lines)
Operation legalizersub_1FFB890 (169KB)
SplitVectorResultsub_2029C10 (dispatcher, 190 cases)
SplitVectorOperandsub_202E5A0 (dispatcher, 157 cases)
ScalarizeVectorResultsub_2036110
ScalarizeVectorOperandsub_2035F80
WidenVectorsub_2036AE0 (31KB, limited NVPTX usage)
ExpandIntegerResultsub_201BB90 (75KB, 632 case labels)
PromoteIntegerResultsub_2000100 (45KB)
PerformExpensiveCheckssub_2010FB0 (62KB, debug verifier)
NVPTXTargetLowering initsub_3314670 (73KB, table population)
UpstreamLegalizeTypes.cpp, LegalizeIntegerTypes.cpp, LegalizeFloatTypes.cpp, LegalizeVectorTypes.cpp

Pipeline Position

Type legalization runs as the first major SelectionDAG transformation after the initial DAG is built by SelectionDAGBuilder (sub_2081F00). The full sequence:

  1. SelectionDAGBuilder converts LLVM IR to an initial DAG with potentially illegal types
  2. DAG Combiner (sub_F20C20) runs initial combines
  3. DAGTypeLegalizer (sub_20019C0) iterates until all types are legal -- this page
  4. LegalizeDAG (sub_1FFB890) legalizes operations on now-legal types
  5. DAG Combiner runs again to clean up
  6. Instruction selection (sub_3090F90) pattern-matches the final legal DAG

The type legalizer iterates to a fixpoint: each pass may create new nodes with illegal types (e.g., splitting a vector creates two half-width vectors that may themselves be illegal), so the worklist loops until every node in the DAG has only legal result and operand types.

The legal type set is defined in the NVPTXTargetLowering constructor (sub_3314670, 73KB) which populates the action table at offset +2422. NVPTX has a narrow set of legal types dictated by the PTX register file:

Register ClassLegal MVTs
Int1Regs (%p)i1
Int16Regs (%rs)i16
Int32Regs (%r)i32
Int64Regs (%rd)i64
Float32Regs (%f)f32
Float64Regs (%fd)f64
Int16HalfRegs (%h)f16, bf16
Int32HalfRegs (%hh)v2f16, v2bf16, v2i16, v4i8
Int128Regs (%rq)i128 (SM 70+)

For the complete register class table (vtable addresses, PTX types, encoded IDs, copy opcodes) see Register Classes.

The critical constraint: Int32HalfRegs is the only vector register class. It holds exactly 32 bits of packed data. The only legal vector types are those that pack into 32 bits:

  • v2f16 -- two f16 values in one 32-bit register
  • v2bf16 -- two bf16 values (SM 80+)
  • v2i16 -- two i16 values in one 32-bit register
  • v4i8 -- four i8 values in one 32-bit register

Every other vector type (v4f32, v2f32, v8i32, v4f16, v2f64, etc.) is illegal and must be split, scalarized, or expanded during type legalization. There is no packed float32 SIMD on NVPTX -- this is a fundamental architectural constraint.

SM-Gated Type Legality

The legal type set changes with the SM version. The constructor at sub_3314670 queries subtarget features and conditionally marks types legal or illegal:

SM RangeLegal Types AddedLegalization Change
SM < 53(base: i1, i16, i32, i64, f32, f64)f16 ops promoted to f32; no legal vectors
SM 53--69Scalar f16v2f16 legal for ld/st but packed arithmetic is Custom/Expand
SM 70+v2f16 packed arithmetic, i128f16x2 PTX instructions (add.f16x2, mul.f16x2, fma.rn.f16x2)
SM 80+v2bf16bf16x2 PTX instructions
SM 100+e2m1x2 (FP4), e2m3x2 (FP6), e3m2x2 (FP6), ue8m0x2Additional packed narrow FP types for tensor core feeders

On SM 70+, v2f16 operations marked Legal or Custom in the action table map directly to packed PTX instructions, delivering 2x throughput versus scalarized f16. This is why CUDA __half2 operations are efficient: the type stays packed through the entire pipeline. In contrast, float4 is always fully scalarized to four independent f32 operations on every SM generation.

The Legality Table

Primary Action Table (offset +2422)

The core data structure is a 2D array inside NVPTXTargetLowering:

action = *(uint8_t *)(TLI + 259 * VT + opcode + 2422)

Where:

  • TLI = pointer to NVPTXTargetLowering object (loaded from this->TLI at a1[1])
  • VT = SimpleVT enum value (1--10 for scalar types, 14--109 for vector types)
  • opcode = ISD opcode (0--258), capped at 0x102 by a guard check
  • 259 = row stride (256 generic opcodes + 3 metadata bytes per VT row)

The action byte encodes:

ValueActionMeaning
0LegalNode is natively supported -- return immediately
1CustomCall NVPTXTargetLowering::LowerOperation (vtable slot #164, offset +1312)
2ExpandCall LegalizeTypes, then ExpandNode (sub_1FF6F70) as fallback
3LibCallCall ExpandNode directly for library-call substitution
4PromoteFind a larger legal type and rebuild the node at that type

The legality check uses (action & 0xFB) == 0 as the "legal" predicate. This means bit 2 is a don't-care -- a node with action byte 0x04 is still treated as legal in certain fast-path checks, which is the standard LLVM encoding where bit 2 flags "custom-but-legal" operations.

Type-Supported Flag Array (offset +120)

A second structure at TLI + 8*VT + 120 is a pointer array: non-null means the type VT is natively supported by the target. This provides a fast "is this type legal at all?" check before the per-opcode lookup.

Promotion Action Table (offset +2681)

A 1D table indexed by opcode only (no VT dimension):

action = *(uint8_t *)(TLI + opcode + 2681)

Used for four specific opcodes: BSWAP (43), CTLZ (44), CTTZ (45), and BITREVERSE (199). Also used for opcode 204 (CONCAT_VECTORS) when the operand type is zero. This table encodes whether these operations should be promoted regardless of operand type.

FSINCOS Action Table (offset +3976)

Another 1D table for FSINCOS (opcode 211):

action = *(uint8_t *)(TLI + opcode + 3976)

FSINCOS has unique legalization requirements because it produces two results (sin and cos simultaneously).

Condition Code Action Table (offset +18112)

A packed 4-bit nibble table for condition-code-dependent operations (FP_TO_SINT, FP_TO_UINT, SELECT_CC, BR_CC):

base   = (VT_id >> 3) + 15 * condcode_type + 18112
action = (*(uint32_t *)(TLI + base * 4 + 12) >> (4 * (VT_id & 7))) & 0xF

The 15-entry stride per condition code allows per-CC/per-VT legalization decisions. Each nibble stores a 4-bit action code, so two VT actions pack into one byte. This is the standard LLVM condition-code action encoding, but the table is populated with NVPTX-specific rules (e.g., PTX's limited set of comparison predicates determines which CCs are legal for which types).

SimpleVT Type Encoding

Types throughout the legalizer are encoded as a single byte, the SimpleVT enum:

SimpleVTTypeSimpleVTType
0extended/custom7i128
1i18f16
2i2 (rare)9f32
3i810f64
4i1614--55fixed-width vectors
5i3256--109scalable vectors
6i64

The bitwidth-to-SimpleVT conversion pattern appears as a recurring code fragment at least 11 times in sub_20019C0:

// Reconstructed from decompilation -- 11 instances in the function
if (bits == 32)       VT = 5;  // i32
else if (bits > 32) { VT = 6;  // i64 tentative
  if (bits != 64) { VT = 0;    // extended type
    if (bits == 128) VT = 7;   // i128
  }
} else {
  VT = 3;                      // i8 tentative
  if (bits != 8) VT = 4 * (bits == 16);  // i16 or 0
}

The vector type range 14--109 maps to scalar element types through a ~100-case switch block that also appears six times in the function body:

MVT RangeScalar ElementDescription
14--23i2 (VT 2)Fixed-width v2i2..v1024i2
24--32i8 (VT 3)Fixed-width v2i8..v256i8
33--40i16 (VT 4)Fixed-width v2i16..v64i16
41--48i32 (VT 5)Fixed-width v2i32..v64i32
49--54i64 (VT 6)Fixed-width v2i64..v32i64
55i128 (VT 7)Fixed-width v2i128
56--61i2 (VT 2)Scalable nxv2i2..nxv64i2
62--67i8 (VT 3)Scalable nxv2i8..nxv64i8
68--73i16 (VT 4)Scalable nxv2i16..nxv64i16
74--79i32 (VT 5)Scalable nxv2i32..nxv64i32
80--85i64 (VT 6)Scalable nxv2i64..nxv64i64
86--88f16 (VT 8)Scalable nxv2f16..nxv8f16
89--93f32 (VT 9)Scalable nxv2f32..nxv32f32
94--97f64 (VT 10)Scalable nxv2f64..nxv16f64
98--100f16 (VT 8)Fixed-width v2f16..v8f16 (additional)
101--105f32 (VT 9)Fixed-width v2f32..v32f32 (additional)
106--109f64 (VT 10)Fixed-width v2f64..v16f64 (additional)

This switch implements getVectorElementType() on the decompiled SimpleVT enum. Its six-fold repetition in the monolith accounts for a significant fraction of the function's 348KB size.

The Four Legalization Actions

Promote (Type Widening)

Promotion widens a narrow type to the nearest legal register width. The pattern is consistent across integer and FP promotion:

promoted_vt = TLI.getTypeToPromoteTo(opcode, VT)       // sub_1F40B60
extended    = DAG.getNode(ANY_EXTEND, DL, promoted_vt, input)   // opcode 143
result      = DAG.getNode(original_op, DL, promoted_vt, extended, ...)
truncated   = DAG.getNode(TRUNCATE, DL, original_vt, result)   // opcode 145

For integer promotion, ANY_EXTEND (opcode 143) or ZERO_EXTEND (opcode 144) widens the input depending on whether the high bits need defined values (unsigned operations use ZERO_EXTEND). For FP promotion, the pattern uses FP_EXTEND/FP_ROUND instead:

ext0 = DAG.getNode(FP_EXTEND, DL, promoted_vt, op0)
ext1 = DAG.getNode(FP_EXTEND, DL, promoted_vt, op1)
res  = DAG.getNode(FADD, DL, promoted_vt, ext0, ext1)
out  = DAG.getNode(FP_ROUND, DL, original_vt, res)

The promote path in sub_1FFB890 contains approximately 30 opcode-specific expansion strategies. The custom-promotion BST (red-black tree at TLI + 9257/9258) stores (opcode, VT) pairs that override the default promotion target. When no BST entry exists, a linear scan walks upward from the current VT until it finds a type where the action is not Custom (i.e., Legal or Expand).

Expand (Type Splitting)

Expansion splits a wide type into two halves and reassembles the result:

// i128 ADD expansion (simplified)
lo_a = DAG.getNode(EXTRACT_ELEMENT, DL, i64, a, 0)   // low half
hi_a = DAG.getNode(EXTRACT_ELEMENT, DL, i64, a, 1)   // high half
lo_b = DAG.getNode(EXTRACT_ELEMENT, DL, i64, b, 0)
hi_b = DAG.getNode(EXTRACT_ELEMENT, DL, i64, b, 1)
lo_r = DAG.getNode(ADD, DL, i64, lo_a, lo_b)
carry = ...  // carry detection via SETCC
hi_r = DAG.getNode(ADD, DL, i64, hi_a, hi_b)
hi_r = DAG.getNode(ADD, DL, i64, hi_r, carry)
result = DAG.getNode(BUILD_PAIR, DL, i128, lo_r, hi_r)

For CTLZ (case 53), expansion builds an all-ones mask, AND chain, and shift sequence. For SINT_TO_FP/UINT_TO_FP (cases 59/60), the helper sub_20B5C20 performs iterative two-way splitting: it finds the half-type, builds the pair, and recursively legalizes each half.

The ExpandIntegerResult handler at sub_201BB90 (75KB, 632 case labels) is itself a major function that dispatches expansion for specific opcodes including STORE (case 77), shifts (81--93), and atomics.

Soften (Float-to-Integer Emulation)

Softening converts unsupported FP operations to integer-based library call sequences. On NVPTX this primarily affects f128 (which has no hardware support on any SM) and f16 on SM < 53. The softened path at sub_2019DA0 (18KB) dispatches via the SoftenedFloats DenseMap.

The FADD/FMUL cases (74/75 in the main switch) compute twice the bit width, find the promoted FP type, and build SUB (opcode 54) / SRL (opcode 123) chains that implement the FP operation in integer arithmetic.

Scalarize and Split Vector

Vector legalization proceeds through recursive halving:

v8f32  -> split -> 2x v4f32
v4f32  -> split -> 2x v2f32
v2f32  -> scalarize -> 2x f32    (v2f32 is NOT legal on NVPTX)

v4f16  -> split -> 2x v2f16     (LEGAL on SM 70+ -- stops here)
v8f16  -> split -> 2x v4f16 -> 4x v2f16

v4i8   -> LEGAL (packed in Int32HalfRegs, no split needed)
v8i8   -> split -> 2x v4i8     (one split, then legal)

The splitting strategy follows LLVM's standard approach:

  1. Determine half type: v4f32 splits to v2f32 via EVT::getVectorVT(scalar_element, count/2) (sub_1F58CC0)
  2. Split operands: Look up the SplitVectors DenseMap to get {Lo, Hi} halves from the input's own legalization
  3. Apply operation: Lo_result = DAG.getNode(opcode, DL, half_type, Lo_op1, Lo_op2), and similarly for Hi
  4. Record result: Store {Lo_result, Hi_result} in the SplitVectors DenseMap via sub_20167D0

The critical observation for NVPTX: v2f32 is not legal (no 64-bit packed float register class), so v4f32 ends up fully scalarized to 4x f32. In contrast, v4f16 on SM 70+ splits to 2x v2f16 which is legal, enabling the f16x2 packed instruction path.

Master Opcode Dispatch (sub_20019C0)

The main body of sub_20019C0 is a switch on *(int16_t *)(node + 24) -- the ISD opcode of the current SDNode. Approximately 50 cases are handled:

CaseISD OpcodeAction
10LOADlegalizeLoad -- type-aware load splitting
11STOREIterative type demotion loop (see below)
20--21, 26Generic arithmeticPromote via sub_1D38BB0 (getConstant)
27EXTRACT_ELEMENTSplit + re-extract
29BUILD_PAIRPromote to i32
48BITCASTPromote or expand depending on isSimple()
49EXTRACT_SUBVECTORExtract + rebuild via TRUNCATE (opcode 145)
50INSERT_SUBVECTORLow/upper split via ANY_EXTEND (143) / ZERO_EXTEND_INREG (144)
51CONCAT_VECTORSIterate operands, copy each to result list
53CTLZ / CTPOPExpand via mask-then-shift (AND=120, ADD=52)
54ATOMIC_CMP_SWAPFull promote path: check legality table, fallback to libcall
55--56SIGN_EXTEND_INREG / SMINLegality check via TLI + 259*VT + opcode + 2422
57--58FP_TO_SINT / FP_TO_UINTChain of promote + expand nodes
59--60SINT_TO_FP / UINT_TO_FPIterative split via sub_20B5C20
70, 72FMINNUM / FMAXNUMBUILD_PAIR (opcode 0x89) reassembly
74--75FADD / FMULPromote to wider FP type
77FMAExtend operands, FMA at wider type, round back
105BUILD_VECTORDelegate to sub_1FEC5F0
106EXTRACT_VECTOR_ELTCheck vector element count, dispatch
108MGATHER / MSCATTERLoad/store with alignment fixup via sub_20BD400
110VSELECTElement-by-element type demotion loop
112--113SETCCLegality check with swapped-direction fallback
114--117VECREDUCE_*Opcode lookup in dword_42FEAE0, chain to VECREDUCE
122--124SHL / SRL / SRAIterative width expansion
125--126ROTL / ROTR4-way split: shift + mask + OR
136BR_CCUses CC action table at offset +18112
152ATOMIC_LOAD_*Delegate to sub_20B7F50 (atomic promote)
153ATOMIC_CMP_SWAP_WITH_SUCCESSFull CAS expansion with APInt mask
199--200INTRINSIC_W_CHAIN / INTRINSIC_WO_CHAINTLI+112 check, intrinsic lowering dispatch
211UNDEFReplicate zero-constant to fill operand count
243TOKEN_FACTORDuplicate single operand to all slots

Cases not listed fall through to LABEL_25 (node already legal or handled by a different legalization category).

Store Iterative Demotion (Case 11)

The STORE case contains an explicit type-walking loop that searches downward for a legal store type:

// Reconstructed from case 11, lines ~2077-2095
while ((vt_byte - 8) > 1) {          // while VT is not f16(8) or f32(9)
    --vt_byte;                        // try next smaller type
    if (TLI.getTypeAction(VT))        // sub_1D16180
        if (TLI.isOperationLegal(STORE, VT))
            break;                    // found a legal store type
}

This walks i64 -> i32 -> i16 -> i8 (or f64 -> f32 -> f16) until it finds a type the target can store natively, then emits a truncating store sequence via sub_1D3C080 (getTruncStore).

Atomic CAS Expansion (Cases 54, 153)

Atomic operations receive extensive legalization because PTX has limited atomic type support. The CAS expansion at case 153 (ATOMIC_CMP_SWAP_WITH_SUCCESS) builds APInt masks via sub_16A4EF0, constructs compare-and-swap loops, and handles the success flag as a separate result. The helper sub_20B7E10 decides whether to use a CAS loop or a direct atomic based on the target SM's capabilities.

Vector Legalization Workers

SplitVectorResult (sub_2029C10)

This thin dispatcher reads the opcode from *(uint16_t *)(node + 0x18), subtracts base 0x30 (48), and dispatches across 190 cases (opcodes 48--237) to SplitVecRes_XXX workers. Key handler categories:

HandlerCasesDescription
sub_20230C0FADD--FREM, SHL/SRA/SRL, int arithGeneric binary op split: split both inputs, apply op to each half
sub_2028A10CONCAT, INSERT_ELT, load/store variantsUnary/multi-input split with reassembly
sub_2025910Strict FP (cases 81--98)Strict FP split with exception chain propagation
sub_2023B70BUILD_VECTOR (case 104)Split BUILD_VECTOR into two half-width constructs
sub_2023F80CONCAT inner (case 107)Trivial: return two operands as Lo and Hi
sub_20293A0VECTOR_SHUFFLE (case 110, 10KB)Decompose shuffle into sub-shuffles on half-width vectors
sub_20251A0VSELECT, EXTRACT_ELTSplit condition mask along with operands
sub_2025380Extending loads (cases 149--151)Split load into two half-width loads

Four handlers in the 0x214xxxx range are NVPTX-specific split workers not present in upstream:

HandlerOpcodeNVPTX-Specific Behavior
sub_2146BB0CONCAT_VECTORSChecks VT range 0x0E--0x6D for packed-type dispatch
sub_2146C90SELECT_CC / BR_CC (2.7KB)Multi-operand split with per-operand type classification
sub_2147770FP_ROUND-likeNVPTX-specific FP rounding split
sub_2147AE0BITCASTNVPTX-specific bitcast split for packed registers

After a handler returns, the dispatcher stores the {Lo, Hi} result pair in the SplitVectors DenseMap via sub_20167D0 (hash = 37 * key, quadratic probing, rehash at 75% load).

Fatal error on unhandled opcode: "Do not know how to split the result of this operator!" via sub_16BD130.

SplitVectorOperand (sub_202E5A0)

Same dispatch pattern as SplitVectorResult but for operand-side legalization. Base opcode 0x65 (101), range 157 (opcodes 101--258). Notable inline handling for FP_EXTEND/FP_ROUND (cases 146--147, 152--153) that compares source and destination type sizes to choose the correct split strategy:

// Inline in SplitVectorOperand, cases 146-147
src_size = getSizeInBits(src_vt);    // sub_2021900
dst_size = getSizeInBits(dst_vt);
if (dst_size < src_size)
    SplitVecOp_VSELECT(...)          // sub_202D8A0 -- shrinking
else
    SplitVecOp_Generic(...)          // sub_202A670 -- standard split

After the handler, ReplaceAllUsesOfValueWith (sub_2013400) substitutes the old node with the split result.

Scalarize and Widen

ScalarizeVectorResult (sub_2036110) handles vector types that reduce to scalar. ScalarizeVectorOperand (sub_2035F80) has 80 cases starting from base opcode 106. These cover the final step when splitting has reduced a vector to width 1 or 2 elements, and those elements must become individual scalars.

WidenVector (sub_2036AE0, 31KB) sees limited use on NVPTX. Widening is only useful when the wider type is legal:

  • Widening v1f16 to v2f16 is useful (promotes to legal packed type)
  • Widening v3i8 to v4i8 is useful (promotes to legal packed type)
  • Widening v3f32 to v4f32 is not useful (v4f32 is still illegal)

The WidenVector path uses the MVT lookup table at word_4305480 to determine element counts and find the nearest wider legal vector type.

Operation Legalization (sub_1FFB890)

After type legalization, operation legalization processes each node through a per-opcode action lookup. The same primary action table is used:

action = *(uint8_t *)(TLI + 259 * VT + opcode + 2422)

The dispatch:

ActionCodePath
Legal0Return immediately
Custom1TLI->LowerOperation(node, DAG) via vtable slot #164 (offset +1312)
Expand2sub_20019C0 (LegalizeTypes), then sub_1FF6F70 (ExpandNode) as fallback
LibCall3sub_1FF6F70 (ExpandNode) directly
Promote4Find larger legal type, rebuild node
Special5+sub_1FF9780 (ExpandLoad) or sub_1FF5310 (LegalizeLoadOps) for load/store variants

When Custom lowering returns NULL, the framework falls through to expansion. When it returns a different node, ReplaceAllUsesWith splices the replacement into the DAG and marks the old node dead (tombstone value -2 in the worklist hash set).

The operation legalizer also contains an outer switch on the ISD opcode (v11 = *(uint16_t *)(node + 24)) for opcode-specific handling before the table lookup. Shift/rotate opcodes (81--98) are remapped to internal opcode numbers before the table lookup (e.g., case 81 maps to internal opcode 76, case 82 to 77). The opcode-specific dispatch covers approximately 30 opcode groups.

How CUDA Vector Types Get Legalized

Tracing common CUDA types through the full legalization pipeline:

float4 (v4f32) -- fully scalarized on every SM:

  1. SplitVectorResult: v4f32 -> 2x v2f32
  2. ScalarizeVectorResult: v2f32 -> 2x f32 (no packed f32 register class)
  3. Final: 4 independent f32 scalar operations
  4. PTX: 4 separate add.f32 / mul.f32 instructions

half2 (__half2 / v2f16) -- stays packed on SM 70+:

  1. Legal type, no splitting needed
  2. Final: single v2f16 packed operation
  3. PTX: add.f16x2, mul.f16x2, fma.rn.f16x2

__nv_bfloat162 (v2bf16) -- legal on SM 80+:

  1. Same as half2 but with bf16x2 PTX instructions

float2 (v2f32) -- scalarized, not packed:

  1. ScalarizeVectorResult: v2f32 -> 2x f32
  2. No 64-bit packed float register class exists

v4f16 on SM 70+:

  1. SplitVectorResult: v4f16 -> 2x v2f16 (legal -- stops here)
  2. Final: 2x f16x2 packed operations (2x throughput vs scalarized)

v4f16 on SM < 53:

  1. Split: v4f16 -> 2x v2f16
  2. Scalarize: each v2f16 -> 2x f16
  3. Promote: each f16 -> FP_EXTEND -> f32
  4. Final: 4x f32 operations with FP_EXTEND/FP_ROUND wrappers

double2 (v2f64):

  1. Scalarize: v2f64 -> 2x f64 (splitting would give v1f64 which is scalar)

Tensor core fragments bypass vector legalization entirely. WMMA/MMA intrinsics represent matrix fragments as individual scalar registers, not LLVM vector types. However, packed conversion types used with tensor cores (e4m3x2, e5m2x2, e2m1x2, etc.) do pass through legalization and map to Int32HalfRegs.

Verification Infrastructure

sub_2010FB0 (62KB) implements DAGTypeLegalizer::PerformExpensiveChecks, gated by the enable-legalize-types-checking flag (registered at ctor_341). It validates nine DenseMap categories that track the state of every legalized value:

MapContent
PromotedIntegersValues widened to a larger integer type
ExpandedIntegersValues split into two halves
SoftenedFloatsFP values converted to integer representation
PromotedFloatsFP values widened to a larger FP type
ExpandedFloatsFP values split into halves
ScalarizedVectorsVectors reduced to scalar elements
SplitVectorsVectors split into {Lo, Hi} pairs
WidenedVectorsVectors widened to a larger legal type
ReplacedValuesValues replaced by RAUW

Diagnostic strings on verification failure: "Processed value not in any map!", "Value in multiple maps!", "Value with legal type was transformed!".

DAG Node Builder Subroutines

Key subroutines called from the type legalizer for constructing replacement DAG nodes:

FunctionUpstream EquivalentNotes
sub_1D309E0DAG.getNode(opc, DL, VT, op)1-operand (TRUNCATE, ANY_EXTEND, etc.)
sub_1D332F0DAG.getNode(opc, DL, VT, op1, op2)2-operand
sub_1D3A900DAG.getNode(opc, DL, VT, op1, op2, op3)3-operand (FMA)
sub_1D38BB0DAG.getConstant(val, DL, VT)Integer constant creation
sub_1D38970DAG.getConstant(APInt)Wide constant / all-ones mask
sub_1D364E0DAG.getUNDEF(VT)Undefined value
sub_1D37440DAG.getSetCC(DL, VT, LHS, RHS, CC)Comparison node
sub_1D36A20DAG.getSelectCC(DL, VT, ..., CC)Select-on-comparison
sub_1D3BC50DAG.getExtLoad(opc, DL, VT, ...)Extending load
sub_1D3C080DAG.getTruncStore(...)Truncating store
sub_1D23890DAG.ReplaceAllUsesWith(old, new)RAUW for result replacement
sub_1FEB8F0MVT::getSizeInBits(SimpleVT)Bit width from SimpleVT
sub_1F58D40EVT::getSizeInBits()Bit width from extended VT
sub_1F58D30EVT::getVectorNumElements()Vector element count
sub_1F40B60TLI.getTypeToPromoteTo(opc, VT)Promotion target lookup
sub_1D16180TLI.getTypeAction(VT)Action for type
sub_1D16EF0TLI.getCondCodeAction(CC, VT)Condition code legality

Result Accumulation and Worklist

Results from each legalization step are accumulated into a SmallVector of {SDValue, SDValue} pairs (node pointer + result index). The vector grows via sub_16CD150 (SmallVector::grow()) when count exceeds capacity. After each pass, new nodes feed back into the worklist for iterative re-legalization until fixpoint -- all types are legal.

The worklist hash set uses open addressing with hash function ((id >> 9) ^ (id >> 4)) & (size - 1) and grows at 75% load factor. Dead nodes are marked with sentinel -2 (tombstone). The DenseMap instances used by the split/scalarize infrastructure use hash 37 * key with quadratic probing.

Differences from Upstream LLVM

AspectUpstream LLVM 20CICC v13.0
Source organization4 files, ~16,000 lines total1 monolithic function, 10,739 lines (348KB)
Vector legal typesTarget-dependent, often includes v4f32, v2f64Only v2f16, v2bf16, v2i16, v4i8 (32-bit packed)
v2f32Legal on most targets (x86, ARM)Illegal -- scalarized
Scalable vectorsActively used (AArch64 SVE)Encoded in tables but no SM target uses them
i128Expanded on most targetsLegal on SM 70+ (Int128Regs / .b128 / %rq)
NVPTX-specific split handlersN/A4 functions in 0x214xxxx range for packed-type dispatch
Custom-promotion BSTStandard red-black treeSame, at TLI offsets +9257/+9258
Type-supported flag arrayPointer array at known offsetAt TLI + 8*VT + 120
CC action table4-bit packed nibblesSame encoding, NVPTX-specific CC legal set

The monolithic structure means that code changes to any legalization category (integer promote, float soften, vector split) require recompilation of the entire 348KB function. In upstream LLVM, these are independent compilation units.

Configuration

KnobLocationDefaultDescription
enable-legalize-types-checkingctor_341falseEnables PerformExpensiveChecks debug verifier

No CICC-specific legalization knobs beyond the standard LLVM flag were found. The ptxas assembler has a related knob MercuryDisableLegalizationOfTexToURBound for texture-to-uniform-register legalization, but this operates at the assembler level, not in CICC.

Key Functions

FunctionAddressSizeRole
Type legalizer monolithsub_20019C0348KBDAGTypeLegalizer::run() master dispatch
PromoteIntegerResultsub_200010045KBInteger type promotion
PromoteFloatResultsub_2019DA018KBFloat type promotion / softening
ExpandFloatResultsub_201B41011KBFloat type expansion
ExpandIntegerResultsub_201BB9075KBInteger type expansion (632 case labels)
Promote+expand dispatchsub_201E5F081KBSecondary dispatch (441 case labels)
PerformExpensiveCheckssub_2010FB062KBDebug verifier for 9 DenseMap categories
SplitVectorResultsub_2029C105KBDispatcher for 190 opcode cases
SplitVectorOperandsub_202E5A06KBDispatcher for 157 opcode cases
SplitVecRes_BinOpsub_20230C0--Generic binary op split
SplitVecRes_VECTOR_SHUFFLEsub_20293A010KBShuffle decomposition
ScalarizeVectorResultsub_2036110--Vector-to-scalar reduction
ScalarizeVectorOperandsub_2035F80--Operand scalarization (80 cases)
WidenVectorsub_2036AE031KBVector widening (limited NVPTX use)
Operation legalizersub_1FFB890169KBLegalizeOp per-node action dispatch
ExpandNodesub_1FF6F7043KBFull node expansion fallback
ExpandLoadsub_1FF978055KBLoad legalization
LegalizeLoadOpssub_1FF531041KBStore splitting/coalescing
NVPTX split: CONCATsub_2146BB0219BNVPTX-specific CONCAT_VECTORS split
NVPTX split: SELECT_CCsub_2146C902.7KBNVPTX-specific SELECT_CC split
NVPTX split: FP_ROUNDsub_2147770--NVPTX-specific FP rounding split
NVPTX split: BITCASTsub_2147AE0--NVPTX-specific bitcast split
NVPTXTargetLowering initsub_331467073KBPopulates legality tables
FP conversion split helpersub_20B5C20--Iterative SINT_TO_FP/UINT_TO_FP
Atomic promote helpersub_20B7F50--ATOMIC_LOAD promotion
CAS expansion decisionsub_20B7E10--CAS loop vs direct atomic
Gather/scatter alignmentsub_20BD400--MGATHER/MSCATTER alignment fixup

Reimplementation Checklist

  1. NVPTX legal type model. Define the narrow set of legal types dictated by PTX register classes (i1, i16, i32, i64, f32, f64, f16, bf16, v2f16, v2bf16, v2i16, v4i8, i128), with SM-gated legality: f16 arithmetic on SM 53+, v2f16 packed ops on SM 70+, v2bf16 on SM 80+, FP4/FP6 packed types on SM 100+.
  2. Primary legality table population. Build the 2D action table at TLI + 259 * VT + opcode + 2422 with per-opcode-per-type action bytes (0=Legal, 1=Custom, 2=Expand, 3=LibCall, 4=Promote), plus the type-supported flag array at offset +120, the promotion action table at offset +2681, and the condition-code action table at offset +18112 with 4-bit packed nibbles.
  3. Four legalization actions. Implement Promote (widen via ANY_EXTEND/ZERO_EXTEND, operate, TRUNCATE), Expand (split via shift-and-OR for integers, libcall for floats), Soften (integer emulation of unsupported FP types), and Scalarize/Split-Vector (decompose illegal vectors into scalar or half-width vector operations).
  4. Iterative fixpoint loop. Run the type legalizer worklist until every node in the DAG has only legal result and operand types, since each pass may create new nodes with illegal types (e.g., splitting a vector creates half-width vectors that may themselves require further splitting).
  5. Vector legalization for NVPTX. Handle the critical constraint that Int32HalfRegs is the only vector class (32 bits total): scalarize all vectors wider than 32 bits (v4f32, v2f32, v8i32, etc.) while keeping v2f16/v2bf16/v2i16/v4i8 legal. Implement the SplitVectorResult/SplitVectorOperand/ScalarizeVector dispatchers with their 190+/157+/~100 case switches.
  6. SimpleVT type encoding. Implement the bitwidth-to-SimpleVT conversion (11 instances in NVIDIA's monolith) and the ~100-case vector-element-type switch (6 instances) mapping MVT ranges 14--109 to their scalar element types.

Cross-References