Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Register Coalescing

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

Register coalescing in CICC v13.0 eliminates redundant copy instructions by merging the live ranges of their source and destination virtual registers. NVPTX's unlimited virtual register model (PTX has no fixed physical register file) changes the purpose of coalescing compared to CPU targets: rather than reducing physical register pressure to avoid spills, the goal is strictly copy elimination -- fewer mov instructions in the emitted PTX, which in turn gives ptxas a cleaner input with fewer live-range constraints to resolve during its own physical allocation. CICC runs two coalescing passes in sequence: the standard LLVM RegisterCoalescer at sub_2F71140 (which handles generic COPY pseudo-instructions) and a separate NVPTX-specific coalescer rooted at sub_34AF4A0 (which handles NVPTX copy instruction families in the opcode 440--503 range that the generic pass does not recognize). This page documents both, with emphasis on the NVPTX-specific pass where the bulk of the proprietary logic resides.

Standard LLVM RegisterCoalescersub_2F71140 (80KB, 2,190 lines)
NVPTX coalescing driversub_34AF4A0 (67KB, 2,373 lines)
Per-instruction coalesce attemptsub_34AE060 (28KB)
Interference checksub_34AA450 (11.5KB)
Block-level coalescingsub_34BAAF0 (31.7KB)
Live-out / weight computationsub_34B7280 (22KB)
Interval tree (red-black BST)sub_34A0610 (14.7KB)
Range rebuild after mergesub_34A46B0 (13KB)
Opcode -> copy-type mappingsub_3494EA0 (12.7KB)
Operand type classification tablebyte_444C4A0 (16-byte entries)
Address range0x3494EA0 -- 0x34BF740
Pass parameters(pass_obj*, func_info*, MF*, copy_limit, coalesce_limit)
Pass orderingAfter TwoAddressInstruction, before greedy RA

Why Coalescing Matters on a Virtual-Register Target

On CPU targets, coalescing reduces register pressure by allowing two virtual registers to share one physical register, potentially preventing a spill. On NVPTX the motivation is different. PTX is a virtual ISA with typed, unlimited registers (%r0, %r1, ... for 32-bit integers; %f0, %f1, ... for 32-bit floats). The "physical" allocation is deferred entirely to ptxas, which maps virtual registers to the hardware register file at kernel launch time based on occupancy targets. CICC's coalescing therefore serves three purposes:

  1. Copy elimination. Every mov instruction that survives into emitted PTX is dead weight -- it costs an issue slot and extends the live range of both source and destination. Coalescing removes these by unifying src and dst into a single virtual register.

  2. Reduced register name count. Even though PTX registers are virtual, ptxas must solve a graph-coloring problem on them. Fewer distinct register names (after coalescing merges equivalents) give ptxas a smaller interference graph and faster compilation.

  3. Cleaner SSA destruction. PHI elimination during the transition from SSA form to machine code inserts copies at every PHI edge. Many of these are immediately coalesceable because the PHI operand's live range does not extend past the copy point. The coalescer cleans up the mechanical output of PHI lowering.

Copies that the coalescer processes arise from three sources: PHI elimination copies, ABI/calling-convention .param register copies for kernel call boundaries, and sub-register operations (EXTRACT_SUBREG, INSERT_SUBREG).

Standard LLVM RegisterCoalescer (sub_2F71140)

CICC includes the stock LLVM RegisterCoalescer at sub_2F71140, registered as pass "register-coalescer" with debug output markers "Before register coalescing" / "After register coalescing". This pass handles the generic COPY pseudo-instruction (LLVM's TargetOpcode::COPY) using the standard worklist-driven algorithm from upstream.

The key LLVM knobs that apply to this instance:

KnobDefaultEffect
join-liveintervalstrueMaster enable for copy coalescing
join-splitedgessubtargetCoalesce copies on split critical edges
join-globalcopiessubtargetCoalesce copies that span basic blocks
terminal-ruletrueApply the terminal rule (copies at block ends)
verify-coalescingfalseVerify MachineInstrs before and after coalescing
late-remat-update-threshold100Batch live-interval updates when a def has many copy uses
large-interval-size-threshold100Intervals with more valnos than this are "large"
large-interval-freq-threshold256Stop coalescing a large interval after this many joins

The standard pass operates on COPY pseudo-instructions only. It does not understand NVPTX-specific move instruction families (opcodes 440--503), which is why the NVPTX-specific pass exists.

NVPTX-Specific Coalescer (sub_34AF4A0)

The proprietary coalescer at sub_34AF4A0 runs after the standard RegisterCoalescer and targets NVPTX copy instruction families that the generic pass skips. It operates on the MachineFunction representation and accepts two limit parameters beyond the standard pass/MF arguments: copy_limit (maximum number of copy instructions to consider) and coalesce_limit (maximum number of successful merges before bailing out). These are compile-time budget controls that prevent quadratic behavior on large functions.

Opcode Classification

The function sub_3494EA0 contains a giant switch statement mapping NVPTX instruction opcodes (range 1--0x12) to copy families in the 440--503 opcode range. Each family represents a distinct copy semantic:

  • Opcodes 440--443: Type-preserving moves within a single register class (i32-to-i32, f32-to-f32, etc.). These map from internal opcodes 12, 13, 15 in the operand-type classification table.
  • Opcodes 444--503: Cross-class moves, paired/wide register moves (128-bit pairs for tensor core paths), and ABI-related .param copies.

The return value is an __m128i pair encoding both the copy semantics and the register class constraints, which subsequent stages use to decide whether coalescing is legal.

Operand-type classification happens via sub_34961A0, which reads operands and classifies them through a lookup table at byte_444C4A0. Each entry in this table is 16 bytes:

struct OperandTypeEntry {
    uint8_t type_code;        // +0: 12=i32, 13=i64, 15=f32, etc.
    uint8_t size_class;       // +1: size in register-width units
    uint8_t register_bank;    // +2: bank identifier
    uint8_t constraint_flags; // +3: bit 0x10 = participates in coalescing
    uint8_t reserved[12];     // +4: padding/future use
};

The constraint flag at offset +3 (mask 0x10) gates whether the operand participates in coalescing at all. Operands with this bit cleared are excluded from the worklist.

Register Class Constraints

Coalescing is constrained to same-class merges. The NVPTX register classes are completely disjoint -- an Int32Regs (%r) register cannot coalesce with a Float32Regs (%f) register even though both are 32 bits wide. This is a consequence of PTX's typed register model: .reg .b32 %r0 and .reg .f32 %f0 are distinct storage locations from ptxas's perspective. The complete register class table and coalescing constraint flags are in Register Classes. All eight primary classes are same-class-only; Int128Regs is excluded from the coalescing worklist entirely (constraint flag cleared).

Cross-class copies (e.g., bitcasting an i32 to f32) use distinct cross-class copy opcodes (see the copy opcode table) and are never eliminated by the coalescer -- they must survive as explicit instructions in PTX.

Sub-Register Handling

NVPTX has a flat register file with no sub-register structure in the CPU sense. There are no %eax/%ax/%al hierarchies. The exception is wide register pairs: 128-bit values used by tensor core operations are represented as pairs of 64-bit registers. sub_3497B40 handles paired-register decomposition, and when coalescing the low half of a pair, the high half inherits corresponding constraints. The coalesce candidate record (248 bytes) stores sub-operand arrays at offset +16 (4 entries of 32 bytes each, inline SBO) specifically for tracking these pair relationships.

Coalescing Algorithm

The NVPTX coalescer follows the standard LLVM pattern of worklist-driven interval joining but uses proprietary data structures throughout.

Phase 1: Initialization (lines 494--617)

Load TargetInstrInfo, TargetRegisterInfo, and TargetSubtargetInfo from the MachineFunction vtables. Initialize approximately 15 open-addressing hash maps, 2 min-heaps, 3 interval trees (red-black BSTs), and 2 linked lists. The stack frame is approximately 4.5KB. Walk all basic blocks, filter virtual-register operands via sub_2DADC00 (the isVirtualRegister check), and collect copy instructions into the worklist hash.

Phase 2: Block-Level Scanning (lines 618--857)

For each basic block, walk instructions and identify NVPTX copy instructions (opcode field at instruction offset +68 equals 14 or 15). For each copy:

  1. Validate source type via sub_B10CD0 (extract register class).
  2. Check physical register constraints (vestigial on NVPTX but present in the code).
  3. Build a coalesce pair via sub_34A70E0, creating a 248-byte candidate record.

Track live-through registers per block using bitvectors.

Phase 3: Interference Graph Construction (lines 858--998)

Build the interval tree via sub_2DACB60 and sub_C8CD80. Cross-compare forward and backward interval lists via sub_2E564A0. Flatten into indexed format via sub_2E507D0. The result is a set of live intervals indexed by register number, stored in a red-black BST where each node is 448 bytes (0x1C0).

Phase 4: Worklist-Driven Coalescing (lines 1040--2092)

This is the core loop. Candidates are extracted from a min-heap ordered by register number (lowest first -- a standard LLVM heuristic that processes defs before uses in reverse postorder).

function CoalesceWorklistDriven(heap, intervals, hash_map):
    while heap is not empty:
        candidate = heap.extract_min()
        src_interval = lookup(hash_map, candidate.src_key)
        dst_interval = lookup(hash_map, candidate.dst_key)

        // Same-class check
        if register_class(src_interval) != register_class(dst_interval):
            continue

        // Interference check
        if CheckInterference(src_interval, dst_interval) != 0:
            push candidate to secondary_heap
            continue

        // Pre-coalesce validation
        if not ValidateCopy(candidate):
            push candidate to secondary_heap
            continue

        // Execute the merge
        merged = MergeIntervals(src_interval, dst_interval)
        RewriteOperands(candidate.copy_instr, merged)
        UpdateHashMap(hash_map, merged)

        // Verify and rebuild
        VerifyMergedInterval(merged)
        RebuildRanges(merged)

    // Double-buffer swap: retry with secondary heap
    swap(heap, secondary_heap)
    if secondary_heap was non-empty:
        repeat from top

The double-buffer swap (lines 2073--2093) alternates between two heaps (v373 and v376). After exhausting one worklist, the pass swaps and retries -- implementing the LLVM-style "iterate until convergence" pattern where an earlier merge may resolve interference that blocked a later merge.

Phase 5: Code Patching (lines 2095--2144)

For each coalesced pair, rewrite instruction operands:

  1. sub_349D6E0 -- look up the merged interval's representative register.
  2. sub_349FA50 -- find the instruction position.
  3. sub_2E31040 -- patch the operand's register field.
  4. Fix linked-list pointers using the ptr & 0xFFFFFFFFFFFFFFF8 mask (the low 3 bits encode tags on MachineOperand pointers: 0 = normal, 3 = tied operand, 4 = implicit operand).

Phase 6: Cleanup (lines 2145--2371)

Destroy interval trees (sub_349E8A0), perform final range rebuild (sub_34A46B0), finalize coalescing metadata (sub_34A2530), commit merged intervals (sub_34AA090), and deallocate all hash maps, heaps, and trees (16+ free calls).

Interference Check (sub_34AA450)

The interference check is the critical decision point. Given two intervals (identified by their register keys), it determines whether merging them would create a conflict -- that is, whether both registers are simultaneously live at any program point.

function CheckInterference(interval_A, interval_B) -> {0 = safe, 1 = interfering}:
    for each instruction I in interval_A.instruction_vector:
        if I is in the "already-coalesced" set:
            continue
        reg_class = extract_register_class(I)
        dst_interval = lookup(reg_to_interval_hash, I.dst_reg)
        if dst_interval overlaps with interval_B:
            return 1  // interfering
    return 0  // safe to coalesce

The "already-coalesced" set is an open-addressing hash map (pointer keys, hash (key >> 9) ^ (key >> 4), sentinels -4096/-8192). The sentinel check at a3+8 (a flag byte) determines whether the set uses inline or heap storage (small-buffer optimization for sets under approximately 8 entries).

Since NVPTX has no physical register file, "interference" here means purely that two virtual register live ranges overlap at a program point. On CPU targets this would also involve physical register conflict checks, but on NVPTX that dimension is absent.

Priority and Weight System

The coalescing priority determines the order in which candidates are processed when the min-heap's register-number ordering produces ties.

Weight computation (sub_34B7280):

weight = instruction_count + spill_weight[offset+240] + use_count[offset+252]

The flag at offset+254 & 1 guards weight computation: if set, the interval was pre-weighted by an earlier pass and the coalescer uses the existing weight rather than recomputing.

Higher weight means higher coalescing priority. The overall ordering is:

  1. Primary key: register number (min-heap, lowest first).
  2. Secondary key: weight (higher breaks ties in favor of more-used registers).

Block frequency integration: The pass reads a boolean from TargetPassConfig (via sub_35DDE70 at *(_QWORD*)(pass[4]+256)+856) that controls whether block frequency data influences priority. When enabled, copies in hot blocks receive higher priority, biasing the coalescer toward eliminating copies on the critical execution path.

Data Structures

Hash Maps

All hash maps use the standard DenseMap open-addressing infrastructure described in Hash Table and Collection Infrastructure. Two sentinel variants appear in this pass:

VariantKey TypeSentinel pair
Integer-keyint32_t-1 / -2 (hash: key * 37)
Pointer-keyint64_t-4096 / -8192 (hash: (key >> 9) ^ (key >> 4))

Growth policy: next_power_of_2(2 * old_capacity - 1), minimum 64 entries.

Allocator: sub_C7D670(size, alignment=8) / sub_C7D6A0(ptr, size, alignment=8) -- CICC's aligned malloc/free wrappers.

Interval Tree (Red-Black BST)

Managed by sub_34A0610. Each node is 448 bytes (0x1C0):

OffsetSizeField
+024Tree links (left, right, parent pointers)
+328Interval key (register/slot encoding)
+648Instruction vector pointer
+724Instruction count
+19216Debug name (SBO: inline if len <= 15)
+2004Sub-operand count
+2244Instruction opcode
+2402Priority/weight (uint16)

Comparator: sub_34A0190 (compares interval start positions). Rebalancing: sub_34A0330. The tree maintains count (a2[5]) and cached min/max (a2[3]/a2[4]).

Coalesce Candidate Record (248 bytes)

Built by sub_349AB40 for each potential coalescing opportunity:

OffsetSizeField
+08Source interval key
+88Destination interval key
+16128Sub-operand array (SBO, 4 entries x 32 bytes)
+64112Type-constraint array (SBO, 2 entries x 56 bytes)
+19232Debug name (SBO string)
+2244Opcode classification (1--6: copy, subreg, extract, ...)
+2324Copy source register
+2402Priority (default: 1)

MachineOperand Pointer Encoding

Throughout the coalescing code, MachineOperand pointers use low-bit tagging (8-byte alignment guarantees 3 unused low bits):

Tag (ptr & 7)Meaning
0Normal operand
3Tied operand (requires special coalescing -- both operands must map to same register)
4Implicit operand (flag bit at operand offset +44, bit 3)

The code consistently masks with & 0xFFFFFFFFFFFFFFF8 before dereferencing and checks (ptr & 7) == 3 or (ptr & 4) != 0 for branching decisions.

CSSA Coalescing (PHI-Specific)

Separate from the two coalescing passes above, CICC includes a CSSA (Conventional SSA) coalescing stage controlled by the cssa-coalesce knob (constructor at ctor_705, address 0x5BD430). This pass operates at the SSA level rather than the machine level, coalescing PHI operands before PHI elimination to reduce the number of copies that PHI lowering generates. Associated knobs:

KnobEffect
cssa-coalesceEnable/disable PHI operand coalescing
cssa-verbosityVerbosity level for CSSA debug output
dump-before-cssaDump IR before CSSA coalescing
usedessaSelect deSSA method (alternative to CSSA)

Knobs and Thresholds Summary

KnobSourceDefaultEffect
join-liveintervalsLLVMtrueMaster enable for standard RegisterCoalescer
join-splitedgesLLVMsubtargetCoalesce on split critical edges
join-globalcopiesLLVMsubtargetCoalesce cross-block copies
terminal-ruleLLVMtrueTerminal rule for block-end copies
verify-coalescingLLVMfalsePre/post verification
late-remat-update-thresholdLLVM100Batch remat update threshold
large-interval-size-thresholdLLVM100Large interval valno threshold
large-interval-freq-thresholdLLVM256Large interval coalesce limit
twoaddr-rescheduleLLVM--Coalesce copies by rescheduling in TwoAddress
copy_limitNVPTXruntimeMax copies to consider in NVPTX pass
coalesce_limitNVPTXruntimeMax merges before bailout in NVPTX pass
cssa-coalesceNVPTX--PHI operand coalescing
cssa-verbosityNVPTX--CSSA debug verbosity
block frequency flagNVPTXconfigWeight copies by block hotness

The copy_limit and coalesce_limit parameters are passed into sub_34AF4A0 at call time (not static cl::opt knobs). Their values come from the pass pipeline configuration and serve as compile-time budget caps to avoid quadratic worst-case behavior on functions with thousands of copies.

Impact on ptxas

The quality of CICC's coalescing directly affects ptxas's register allocation phase:

  • Fewer virtual registers means a smaller interference graph for ptxas to color, reducing its compilation time.
  • Eliminated copies reduce instruction count, giving ptxas's scheduler more freedom and fewer false dependencies.
  • Preserved type invariants (no cross-class coalescing) ensure ptxas never encounters type-inconsistent register usage, which would require additional conversion instructions.
  • Wide register pair tracking ensures tensor core instruction patterns remain intact -- ptxas expects specific register pair relationships for mma and wmma instructions.

A pathological case is over-aggressive coalescing that creates very long live ranges spanning many basic blocks. On NVPTX this does not cause spills (there is no physical register file to spill from), but it can increase ptxas's reported register usage, reducing occupancy. The coalesce_limit parameter and the large-interval frequency threshold exist partly to avoid this scenario.

Function Map

FunctionAddressSizeRole
Main NVPTX coalescing driversub_34AF4A067KB--
Per-instruction coalesce attemptsub_34AE06028KB--
Pre-coalesce validation (opcode 14/15 check)sub_34AB5C016KB--
Post-coalesce update (rewrite def-use chains)sub_34AC81019KB--
Constrained-copy validation variantsub_34AD8B08.5KB--
Interference checksub_34AA45011.5KB--
Range rebuild (bitvector v90[12336])sub_34A46B013KB--
Interval equivalence verifysub_34A27707.3KB--
Interval tree insert/rebalance (RB-tree)sub_34A061014.7KB--
Register-to-interval hash lookupsub_34A39102.7KB--
Build worklist from BB operand scansub_34A3D105KB--
Build worklist from instruction iterationsub_34A41A04.8KB--
Block-level coalescing driversub_34BAAF031.7KB--
Live-out analysis + weight computationsub_34B728022KB--
Per-register interference buildsub_34B662017.7KB--
Operand-type classificationsub_34961A026.6KB--
Register-pair decompositionsub_3497B4016.5KB--
Opcode -> copy-type mapping (switch)sub_3494EA012.7KB--
Build coalesce candidate listsub_349AB4024.5KB--
Merged-interval representative lookupsub_349D6E0----
Instruction position lookup/creationsub_349FA507.1KB--
Interval tree destructor (variant A)sub_349E3304KB--
Interval tree destructor (variant B)sub_349E5004KB--
Interval tree destructor (variant C)sub_349E6D04KB--
Interval tree destructor (variant D)sub_349E8A04KB--
Interval info populate from instructionsub_349F1404.7KB--
Interval structure resetsub_349F7404KB--
Generic map cleanup (callback sub_349D600)sub_34A2010----
Finalize coalescing metadatasub_34A2530----
Commit merged intervalssub_34AA090----
Secondary coalesce commitsub_34A9A60----
Register info initializersub_35065A0----
Standard LLVM RegisterCoalescersub_2F7114080KB--
RegisterCoalescer::getPassNamesub_2F60C50----

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Number of passesSingle RegisterCoalescer pass handling COPY pseudo-instructionsTwo passes in sequence: stock LLVM RegisterCoalescer (sub_2F71140) + NVPTX-specific coalescer (sub_34AF4A0)
Opcode coverageHandles only TargetOpcode::COPY (generic copy pseudo)NVPTX pass handles NVPTX copy instruction families in opcode range 440--503 that the generic pass does not recognize
Coalescing goalReduce physical register pressure to prevent spillsStrictly copy elimination (PTX has unlimited virtual registers); goal is fewer mov instructions in emitted PTX and smaller interference graphs for ptxas
Interference checkStandard LiveIntervals queryCustom interference check (sub_34AA450, 11.5 KB) with interval tree (red-black BST at sub_34A0610) for NVPTX register classes
Block-level coalescingPart of the unified worklistSeparate block-level coalescing pass (sub_34BAAF0, 31.7 KB) processes copies within each block before cross-block coalescing
Operand classificationGeneric operand handlingCustom operand type classification table (byte_444C4A0, 16-byte entries) maps NVPTX opcode families to copy semantics
Pass parametersStandard runOnMachineFunction with no limitsParameterized with explicit (copy_limit, coalesce_limit) bounds for compile-time control on large kernels

Cross-References

  • Register Allocation -- the greedy allocator that runs after coalescing; shares the register class table and interference hash pattern.
  • Instruction Scheduling -- scheduling runs after RA and benefits from reduced copy count; MRPA pressure tracking is affected by coalescing decisions.
  • LLVM Knobs -- full knob inventory including all coalescing-related flags.
  • Code Generation -- pipeline ordering showing where coalescing fits relative to other machine passes.