Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Register Allocation

Note: This page documents the embedded ptxas copy within nvlink v13.0.88. The standalone ptxas binary has its own comprehensive wiki -- see the ptxas Reverse Engineering Reference for the full compiler reference. For the standalone ptxas register allocator, see [ptxas Register Allocation overview](../../ptxas/regalloc/overview.html, algorithm, and spilling.

The register allocation subsystem in nvlink's embedded ptxas backend occupies approximately 400 KB of code across two primary address ranges: 0x189C000--0x18FC000 (core regalloc, ~120 functions) and 0x18FC000--0x191A000 (setmaxnreg/CTA-reconfig, ~55 functions). An additional ~120 verification functions at 0x196C000--0x1A00000 validate allocation correctness post-hoc. Together these form the largest single pass in the compiler backend -- the top-level driver alone (AllocateRegisters_main_driver at 0x18988D0) is 71 KB, and the per-instruction encoding function at 0x18AE2D0 is 155 KB, the largest function in the entire 1.7 MB backend core region.

Register allocation follows a graph-coloring model with iterative spilling, operating across three distinct register classes simultaneously: general-purpose R-registers (R0--R255), uniform UR-registers (UR0--UR63, SM75+), and predicate registers (P0--P6). The allocator supports two alternative spill targets -- local memory (lmem, the traditional spill slot) and shared memory (smem, an NVIDIA-proprietary optimization controlled by a ROT13-obfuscated knob). For Blackwell and later architectures (SM100+), the system integrates with the setmaxnreg CTA register reconfiguration infrastructure that enables dynamic register budget adjustment within a kernel.

Key Facts

PropertyValue
Main driversub_18988D0 (AllocateRegisters_main_driver) at 0x18988D0 (70,715 bytes / 2,408 lines)
Per-instruction encodersub_18AE2D0 at 0x18AE2D0 (155,321 bytes / 4,005 lines) -- largest function in region
Full pipelinesub_18E54B0 (AllocateRegisters_full_pipeline) at 0x18E54B0 (75,131 bytes / 2,738 lines)
Graph coloring coresub_189C3E0 at 0x189C3E0 (47,807 bytes / 1,734 lines, self-recursive)
Operand field encodersub_189F300 at 0x189F300 (37,728 bytes / 1,680 lines) -- 250-case switch encoding operand attributes into packed bitfields
No-spill passsub_18B3AE0 at 0x18B3AE0 (45,720 bytes / 1,525 lines)
SMEM spill driversub_18C8790 at 0x18C8790 (20,336 bytes / 764 lines)
Budget negotiationsub_18E3530 at 0x18E3530 (32,949 bytes / 1,250 lines, self-recursive)
setmaxnreg top-levelsub_1912A00 at 0x1912A00 (33,087 bytes / 1,091 lines)
Post-alloc verificationsub_19D6730 at 0x19D6730 (76,335 bytes / 2,728 lines)
Resource usage reportersub_140A6B0 at 0x140A6B0 (5,462 bytes / 183 lines)
Register classesR-regs (general), UR-regs (uniform), predicates (P)
SMEM spill knobranoyr_fzrz_fcvyyvat (ROT13 of enable_smem_spilling)
Total regalloc functions~120 core + ~55 setmaxnreg + ~120 verification = ~295

Pipeline Overview

The register allocation pipeline proceeds through eight stages. Each stage may iterate multiple times if spilling is required and the initial allocation fails.

  AllocateRegisters_main_driver (0x18988D0)
    |
    |  1. Classify register classes for each virtual register
    |     regalloc_classify_register_class (0x189B2D0)
    |
    |  2. Compute live ranges per basic block
    |     regalloc_compute_live_ranges (0x18A0DB0)
    |
    |  3. Build interference graph
    |     regalloc_build_interference_graph (0x189E600)
    |
    |  4. Graph coloring (self-recursive)
    |     regalloc_graph_coloring_core (0x189C3E0)
    |       -> operand field encoder (0x189F300)
    |       -> coalescing (0x189BE00)
    |       -> splitting (0x18A6860)
    |
    |  5. Iterative spill / re-allocate loop
    |     AllocateRegisters_iterative_spill (0x18DEA00)
    |       -> spill pass full (0x18DB260)
    |       -> SMEM spill driver (0x18C8790) [if eligible]
    |       -> spill-retry with budgets (0x18F51E0)
    |
    |  6. Per-instruction encoding with physical registers
    |     AllocateRegisters_per_instruction_encode (0x18AE2D0)
    |     AllocateRegisters_encode_all_instructions (0x18EF990)
    |
    |  7. Budget negotiation + setmaxnreg (Blackwell+)
    |     AllocateRegisters_negotiate_budget (0x18E3530)
    |     setmaxnreg_top_level_driver (0x1912A00)
    |
    |  8. Final reporting + verification
    |     AllocateRegisters_final_reporting (0x18F9A60)
    |     AllocateRegisters_full_verification (0x19D6730)
    v
  Allocated physical registers encoded into all instructions

Graph Coloring Core

The graph coloring allocator at sub_189C3E0 (48 KB, 1,734 lines) implements a Chaitin-Briggs-style interference-graph coloring algorithm. It is self-recursive -- the function calls itself during splitting and re-coloring attempts. The call chain:

  1. Live range computation (sub_18A0DB0, 13.5 KB) -- iterates basic blocks building per-register live intervals using an iterative dataflow analysis. Each virtual register receives a live range spanning from its definition point to its last use.

  2. Interference graph construction (sub_189E600, 11.5 KB) -- for each pair of simultaneously-live virtual registers, adds an interference edge. The interference graph is represented as a bitmatrix or adjacency list (not directly visible from decompilation, but the 4 vtable calls suggest an abstract graph interface).

  3. Coloring -- the recursive core assigns physical registers (colors) to virtual registers while respecting interference edges. When coloring fails, it selects a spill candidate based on the spill cost computation.

  4. Operand field encoding (sub_189F300, 38 KB) -- encodes per-operand properties into packed bitfields via a 250-case switch on operand attribute IDs (91--341). Each case sets specific bits in a packed int[3] descriptor. Despite the wiki name "spill cost computation" in the Key Facts table, this function is structurally an operand attribute encoder, not a cost computation. The actual spill weight computation lives in sub_18C5470 and sub_18C5B30 (see Spilling and No-Spill Regalloc).

  5. Coalescing (sub_189BE00, 7.5 KB) -- attempts to merge virtual registers connected by copy instructions into a single physical register, eliminating the copy. Pre-coloring coalescing (aggressive) and post-coloring coalescing (conservative) paths are both present.

  6. Live range splitting (sub_18A6860, 7.7 KB) -- splits long live ranges into smaller segments that may be independently colorable, reducing interference.

Reconstructed Pseudocode: Graph Coloring Core (sub_189C3E0)

The following pseudocode is reconstructed from the 1,734-line decompiled function. The function operates as a single-pass instruction-stream walker that simultaneously builds a register-to-physical-register map, attempts coalescing, and detects coloring failures. [Confidence: medium -- control flow is clear, but some field semantics are inferred from offsets.]

function graph_coloring_core(alloc_state) -> bool:
    func_ir      = alloc_state.func_ir
    config       = func_ir.regalloc_config         // at offset +1600
    func_data    = config.func_data                 // at offset +16

    // ---- Phase 0: Guard and knob reads ----
    config.coloring_attempted = false
    alloc_state.result_pair   = {1, 1}              // optimistic: both halves succeed
    if instruction_count(func_ir) <= 3:
        return false                                // trivial function, nothing to color

    knob_reader = func_ir.knob_table                // vtable at offset +1664
    // Read 5 configuration knobs via vtable dispatch (knob IDs 308--312):
    //   knob 309: enable_paired_allocation (writes config+32)
    //   knob 308: enable_split_allocation  (writes config+33)
    //   knob 311: max_coloring_iterations  (writes config+36, default from config)
    //   knob 310: max_split_depth          (writes config+40, default from config)
    //   knob 312: conservative_coalescing  (writes config+1081)
    enable_paired   = read_knob_bool(knob_reader, 309)
    enable_split    = read_knob_bool(knob_reader, 308)
    max_iterations  = read_knob_int(knob_reader, 311, default=config.max_iterations)
    max_split_depth = read_knob_int(knob_reader, 310, default=config.max_split_depth)
    conservative    = read_knob_bool(knob_reader, 312)

    // ---- Phase 1: Validate preconditions ----
    if enable_split or not enable_paired or func_data.size <= 0x4000:
        return false                                // bail: incompatible config or tiny function

    // ---- Phase 2: Initialize register map ----
    alloc_state.best_phys_reg = -1
    alloc_state.coloring_ok   = true
    reg_class_table           = config + 1112       // 16 entries, 4 bytes each (class descriptors)
    // Initialize all 16 register class slots to value 25632 (0x6420 = default "unassigned" marker)
    for slot in 0..15:
        reg_class_table[slot].packed_value = 0x6420

    // Allocate FNV-1a hash map for virtual-to-physical register mapping
    //   Initial capacity = 8 buckets
    //   Each bucket: linked list of {next_ptr, vreg_id, phys_reg, hash} entries
    alloc_state.reg_map          = arena_alloc(100)  // hash map header
    alloc_state.reg_map_aux      = arena_alloc(100)  // auxiliary map
    reg_map_buckets              = 8
    reg_map_entries              = 0
    reg_map_distinct             = 0

    // Locate opcode indices for key instruction types
    for idx in 0..num_opcodes:
        if opcode_table[idx].opcode == 25:           // ENTER/PROLOGUE
            alloc_state.enter_opcode_idx = idx
        elif opcode_table[idx].opcode == 232:         // EXIT/EPILOGUE
            alloc_state.exit_opcode_idx  = idx

    // ---- Phase 3: Instruction walk (main loop) ----
    // Walk the instruction linked list, processing each instruction by opcode class.
    // Track basic-block boundaries, register definitions/uses, and coalescing opportunities.

    pending_insert = NULL
    bb_count       = 0                               // basic block counter
    bb_use_count   = 0                               // use count in current BB
    outer_use_cnt  = 0                               // use count before first BB
    prev_bb_id     = -1
    last_vreg_def  = 0
    last_vreg_use  = 0
    enter_sym_ref  = NULL
    exit_sym_ref   = NULL

    for inst in instruction_stream(func_ir):
        opcode = inst.opcode & 0xFFFFCFFF           // mask out modifier bits

        switch opcode:
            case 53, 42:    // BASIC_BLOCK_START, LABEL
                ++alloc_state.bb_counter
                // Flush any pending instruction insertion
                if pending_insert:
                    delete_instruction(func_ir, pending_insert, /*mode=*/1)
                // Record BB linkage and set up per-BB state
                func_ir.current_bb      = inst.linked_ir_node
                func_ir.current_bb_line = inst.line_number
                // Attempt register conflict check at BB boundary
                if not alloc_state.has_first_half:
                    if alloc_state.has_second_half:
                        goto attempt_bb_boundary_check
                    // else: fall through to record BB
                elif alloc_state.has_second_half:
                    goto attempt_bb_boundary_check
                else:
                    // First-half only: emit encoding for first half
                    phys_reg_encode(alloc_state, alloc_state.first_half_id,
                                   last_vreg_use | 0x10000000)
                    alloc_state.first_half_id = 1
                alloc_state.coloring_ok = checked_value
                pending_insert = inst
                prev_bb_id = -1
                continue

            case 55:    // BASIC_BLOCK_END
                bb_count++
                is_first_bb = (bb_count == 1)
                alloc_state.coloring_ok = (outer_use_cnt == bb_use_count) or is_first_bb
                // Validate iteration bound
                if func_ir.loop_depth > 1 and not is_first_bb and prev_bb_id != inst.bb_id:
                    goto coloring_failed
                // Flush pending insertion
                if pending_insert:
                    pending_insert_next = inst
                    delete_instruction(func_ir, pending_insert, 1)
                else:
                    pending_insert = inst
                outer_use_cnt = 0
                continue

            case 97:    // FUNCTION_CALL
                // Record callee symbol reference for ABI-aware allocation
                callee_sym = sym_table[inst.operand[0] & 0xFFFFFF]
                if not enable_paired:
                    inst = inst.next; continue       // skip if paired alloc disabled

                // Emit enter/exit register encoding for callee
                line = inst.line_number
                func_ir.current_bb      = inst
                func_ir.current_bb_line = line
                phys0 = compute_phys_encoding(func_ir, 0)
                emit_register_ref(alloc_state, func_ir, 130, 11, SENTINEL, phys0)
                last_vreg_use = alloc_state.result
                enter_sym_ref = func_ir.current_bb_ref

                phys1 = compute_phys_encoding(func_ir, 0)
                emit_register_ref(alloc_state, func_ir, 130, 11, SENTINEL, phys1)
                last_vreg_def = alloc_state.result
                exit_sym_ref  = func_ir.current_bb_ref
                enable_paired = false                // consume the enter/exit pair once
                continue

            case 288:   // BRANCH / JUMP
                // Track branch destinations for BB connectivity
                if bb_count:
                    ++outer_use_cnt
                else:
                    ++bb_use_count
                // Check operand type to validate branch target is colorable
                target_operand = inst.operand[inst.num_operands - 5]
                target_type    = classify_operand(target_operand)
                if target_type != 5:                 // not a valid register operand
                    goto coloring_failed
                // ... (destination register validation) ...
                continue

            case 91:    // REGISTER USE/DEF (the hot path)
                // This is the core register assignment logic.
                // Extract the register ID and its register class.
                reg_class = (inst.operand[inst.num_operands - offset].packed >> 1) & 3
                if reg_class != 0:
                    goto process_reg_class

                dest_reg_id = inst.operand[0] & 0xFFFFFF
                src_encoded = inst.operand[1]

                // --- Check if operand is a named physical register (type tag == 1) ---
                if operand_type_tag(src_encoded) == 1:
                    definer = sym_lookup(func_ir, src_encoded & 0xFFFFFF).defining_inst
                    if definer and definer.opcode in {110, 139}:
                        // MOV or COPY instruction -- potential coalescing target
                        if definer.opcode == 139:    // COPY
                            goto try_assign_from_copy
                        // MOV: check if src can be reached from current context
                        if is_reachable(definer.operands + 27, func_ir):
                            goto try_assign_from_copy
                        // Fallback: extract base address encoding
                        base_tag = (definer.operand[2].packed >> 28) & 7
                        if base_tag in {2, 3}:
                            phys_reg = lookup_phys_for_vreg(func_ir, definer.operand[2] & 0xFFFFFF)
                            goto try_assign_from_copy
                    elif definer and definer.opcode == 274:
                        // Indirect -- check addressing mode (field at offset)
                        if ((definer.operand[2*definer.field20 + 19] >> 8) & 0xF) == 8:
                            goto try_assign_from_copy
                    // All other cases: no physical register hint available
                    phys_reg = -1

                try_assign_from_copy:
                    // Hash the dest_reg_id using FNV-1a to locate it in the register map
                    hash = FNV1a_32(dest_reg_id)     // 0x811C9DC5 ^ byte0 * 16777619 ^ byte1 ...
                    bucket = hash % reg_map_buckets
                    entry  = reg_map[bucket].find(dest_reg_id)

                    if entry exists:
                        if entry.phys_reg != phys_reg:
                            entry.phys_reg = -1      // conflict: invalidate coloring for this vreg
                    else:
                        // Insert new entry into hash map
                        entry = alloc_entry(dest_reg_id, phys_reg, hash)
                        reg_map[bucket].insert(entry)
                        reg_map_entries++
                        reg_map_distinct++

                        // Resize hash map if load factor exceeded
                        if reg_map_entries > reg_map_distinct
                           and reg_map_distinct > reg_map_buckets / 2:
                            new_capacity = 4 * reg_map_buckets
                            rehash(reg_map, new_capacity)  // rehash all entries

                // --- Attempt physical register allocation ---
                phys = allocate_physical_register(alloc_state, inst.operand_ptr,
                                                  bb_count % max_iterations)
                bb_id = inst.bb_id
                if prev_bb_id != -1 and prev_bb_id != bb_id:
                    goto coloring_failed

                offset_val = compute_register_offset(func_data, operand_ptr)
                combined   = offset_val + sign_extend_24(inst.operand[num_ops - offset - 2])

                // --- Coalescing attempt (conservative mode) ---
                if not conservative:
                    coalesced = try_coalesce(alloc_state, combined, inst, reg_class_table,
                                             inst.operand_ptr)
                    if coalesced:
                        goto record_assignment

                // --- Register class slot assignment ---
                dest_operand = inst.operand[0]
                if operand_type_tag(dest_operand) == 1 and (dest_operand & 0xFFFFFF) != 0x29:
                    // 0x29 = RZ (zero register) -- skip assignment for RZ
                    definer = sym_lookup(func_ir, dest_operand & 0xFFFFFF).defining_inst
                    if not definer or (definer.flags & 0x10) != 0:
                        goto coloring_failed         // undefined or special register

                // Check if combined offset maps to a standard register class slot (28..31 or 176..183)
                slot = combined / 4
                if slot in [28..31] or slot in [176..183]:
                    alloc_state.coloring_ok = false   // special register, cannot freely color
                    goto check_and_advance

                // Validate against existing assignment in class table
                is_same_phys = (phys == combined) and not is_precolored(alloc_state, inst.operand_ptr)
                existing = reg_class_table[combined >> 2]
                if is_same_phys and not existing:
                    // First assignment: record in class table
                    record_class_assignment(config, class_index, slot_index, combined)
                    // Also record in per-slot hash
                    mark_slot_used(reg_class_table, combined, phys)
                    config.has_class_assignment = true

                // Validate consistency: if already assigned, compare
                if not alloc_state.coloring_ok:
                    goto check_and_advance

                record_assignment:
                    delete_instruction(func_ir, inst, 1)  // consume the instruction
                    prev_bb_id = saved_bb_id

            case 183:   // SPECIAL / BARRIER
                // Check for barrier register operand type
                alloc_state.coloring_ok = false       // assume failure
                if (inst.opcode & 0x1000) == 0:
                    // Extract and classify the barrier operand
                    operand  = inst.operand[inst.num_operands - 5]
                    op_type  = classify_operand(operand)
                    if op_type == 6:                  // barrier type
                        alloc_state.coloring_ok = true
                    elif:
                        // Check secondary operand
                        sec_operand = inst.operand[inst.num_operands - offset - 5]
                        sec_type    = classify_operand(sec_operand)
                        alloc_state.coloring_ok = (sec_type == 7)
                continue

            default:
                // Unknown opcode -- preserve current coloring status
                continue

        // Advance to next instruction
        if not alloc_state.coloring_ok:
            goto coloring_failed

    // ---- Phase 4: Post-walk finalization ----
    // If we reach here, the entire instruction stream was walked successfully.
    if bb_count % max_iterations != 0:
        goto coloring_failed

    alloc_state.coloring_ok = true

    // --- Emit enter/exit register references if function calls were found ---
    if pending_insert and alloc_state.bb_counter == 0:
        // Single-BB function with ABI calls: emit register save/restore pair
        func_ir.current_bb      = pending_insert
        func_ir.current_bb_line = pending_insert.line_number
        if has_first_half and not has_second_half:
            phys = compute_phys_encoding(func_ir, alloc_state.first_half_id)
            emit_register_ref(alloc_state, func_ir, 130, 11, last_vreg_use, phys)
            delete_instruction(func_ir, enter_sym_ref, 1)
        elif has_first_half and has_second_half:
            phys = compute_phys_encoding(func_ir, alloc_state.second_half_id)
            if exit_sym_ref.bb_id == pending_insert.bb_id:
                emit_register_ref(alloc_state, func_ir, 130, 11, last_vreg_def, phys)
                delete_instruction(func_ir, exit_sym_ref, 1)
            else:
                emit_register_pair(alloc_state, func_ir, 151, 11, last_vreg_def, last_vreg_def, phys)

    // --- Emit enter/exit encoding for multi-BB functions ---
    if has_first_half:
        config.has_first_half_encoding  = true
        if not has_second_half:
            config.has_second_half_encoding = true
        // Emit SETMAXNREG-style register count adjustment if needed
        emit_regcount_instruction(alloc_state, func_ir, alloc_state.enter_opcode_idx,
                                  exit_sym_ref, ...)
        emit_branch_fixup(alloc_state, func_ir, 288, 11, ...)

    // --- Walk exception/unwind table to emit register encoding there too ---
    for unwind_entry in func_ir.unwind_table:
        if (unwind_entry.opcode & 0xFFFFFFFD) == 0xBC:  // UNWIND_ENTRY
            emit matching register encoding for unwind path
            mark func_ir.unwind_entry_type = 7

    // ---- Phase 5: Commit or rollback ----
    if alloc_state.coloring_ok:
        if pending_insert:
            delete_instruction(func_ir, pending_insert, 1)
        config.coloring_attempted = true
        config.iteration_count    = max(alloc_state.bb_counter, 1)
        return true
    else:
        coloring_failed:
            cleanup_partial_assignment(func_ir)  // sub_18B8F50
            return false

Key Data Structures in the Coloring Core

Register map (FNV-1a hash table). The coloring core uses a chained hash table with FNV-1a hashing to map virtual register IDs to their assigned physical registers. The FNV-1a constants are visible in the decompiled code:

ConstantValueRole
FNV offset basis0x811C9DC5Initial hash value
FNV prime16777619 (0x01000193)Per-byte multiply
Hash finalizer multiplier637696617Post-fold multiplier for bucket index

The hash map starts at 8 buckets and doubles (to 4 * current) when the load factor (total entries / distinct entries) exceeds the bucket count. Each entry is a 24-byte node: {next_ptr: *Node, vreg_id: u32, phys_reg: i32, hash: u32}.

Register class table. A 16-entry array (4 bytes per entry) at config + 1112 tracks which register class slot each virtual register maps to. The initial marker value 0x6420 (25632 decimal) means "unassigned." Each entry packs 4 nibble-sized fields encoding the register's class, bank, and pair status.

Instruction opcodes dispatched in the main loop:

OpcodeMeaningColoring action
42, 53LABEL / BB_STARTIncrement BB counter, flush pending
55BB_ENDRecord BB use counts, validate iteration
91REG_USE_DEFCore assignment: hash lookup, coalesce, assign
97FUNC_CALLRecord callee, emit enter/exit register pair
183BARRIER / SPECIALClassify barrier operand type
288BRANCH / JUMPValidate branch target register, track BB edges

Reconstructed Pseudocode: Coalescing (sub_189BE00)

The coalescing function attempts to merge two virtual registers connected by a MOV/COPY instruction into the same physical register. It is called from the coloring core's case-91 handler when a register definition comes from a copy.

function try_coalesce(alloc_state, combined_offset, inst, class_table, operand_ptr) -> bool:
    // Only applies to register-class offsets 100 and 104 (R-reg 64-bit pairs)
    if (combined_offset - 100) & ~4 != 0:           // not 100 or 104
        if combined_offset == 928:                   // special: forced-spill marker
            alloc_state.coloring_ok = false
            return true
        return false

    // Extract source operand from the instruction
    src_operand = operand_ptr[0]
    slot_ptr    = &class_table[combined_offset >> 2]
    existing    = *slot_ptr

    if operand_type_tag(src_operand) != 1:           // not a register reference
        goto no_coalesce
    if (src_operand >> 63) & 1:                      // high bit set = precolored
        goto no_coalesce

    // Look up the source register's defining instruction
    definer = sym_lookup(func_ir, src_operand & 0xFFFFFF).defining_inst
    if not definer:
        goto no_coalesce

    // Check if definer is a simple MOV (opcode 201) with compatible operands
    if definer.opcode != 201:
        goto no_coalesce

    src2 = definer.operand[1]
    if operand_type_tag(src2) != 1 or (src2 >> 63) & 1:
        goto no_coalesce

    // Check that the MOV's second source is an immediate 0x29 (RZ/zero register)
    src3 = definer.operand[2]
    if operand_type_tag(src3) != 1 or (src3 >> 63) & 1 or (src3 & 0xFFFFFF) != 0x29:
        goto no_coalesce

    // Walk through to the original definition
    original_def = sym_lookup(func_ir, src2 & 0xFFFFFF).defining_inst
    if not original_def:
        goto no_coalesce
    if original_def.opcode == 130:                   // COPY -- follow one more level
        original_def = sym_lookup(func_ir, original_def.operand[1] & 0xFFFFFF).defining_inst
        if not original_def:
            goto no_coalesce

    // Must be opcode 10 (REG_DEF / register creation)
    if original_def.opcode != 10:
        goto no_coalesce

    // Compute the encoding for the coalesced register
    phys_encoding = compute_phys_encoding(func_ir, operand)
    // ... (validate encoding, check for conflicts in class_table, emit new mapping)
    return success

no_coalesce:
    return false

Reconstructed Pseudocode: Interference Graph Construction (sub_189E600)

The interference graph builder emits IR instructions that represent interference edges. Rather than building a traditional adjacency matrix, ptxas uses an instruction-based representation where interference is encoded as synthetic IR nodes.

function build_interference_graph(alloc_state):
    func_ir   = alloc_state.func_ir
    config    = func_ir.regalloc_config
    func_data = config.func_data

    if func_ir.state != 3:                           // not in regalloc phase
        return

    // Check preconditions: coloring must have been attempted or split mode active
    if func_data.flags[1097] < 0                     // signed check: high bit set = enabled
       and (config.coloring_attempted
            or (config.enable_split and config.enable_paired)):

        // Locate the BARRIER opcode index (opcode 27) in the opcode table
        barrier_opcode_idx = -1
        for idx in 0..num_opcodes:
            if opcode_table[idx].opcode == 27:
                barrier_opcode_idx = idx

        // Emit a 16-byte signature into the interference graph structure
        // (constant loaded from xmmword_24170C0..xmmword_24170F0)
        signature = load_128bit_constant(INTERFERENCE_GRAPH_MAGIC)
        func_data.interference_signature = signature

        // Create anchor instruction for the interference subgraph
        anchor = emit_instruction(func_ir, 18 /*ANCHOR*/, 48 /*flags*/, barrier_opcode_idx)

        // Emit initial interference edge from anchor
        phys0 = compute_phys_encoding(func_ir, 4 * signature_index - 2)
        phys1 = compute_phys_encoding(func_ir, 2)
        emit_ternary(func_ir, 110 /*INTERFERENCE_EDGE*/, 12, SENTINEL, anchor, phys1, phys0)

        // Emit secondary interference markers
        emit_address_computation(func_ir, 16, 0, func_data.field_792, ...)
        emit_complex_edge(func_ir, 183 /*BARRIER*/, 14, ...)

        // Walk existing instructions and add interference edges for BARRIER nodes
        for inst in instruction_stream_from(func_ir.current_bb):
            opcode = inst.opcode & 0xFFFFCFFF
            if opcode == 8:                          // INTERFERENCE_CANDIDATE
                // Extract operands based on addressing mode
                operand = inst.operand[inst.num_operands - modifier_offset - 5]
                type    = classify_operand(operand)
                if type in {48, 49}:                 // valid interference-eligible types
                    if inst has indirect flag:
                        // Handle indirect addressing: extract base+offset
                        emit complex interference edge with address computation
                    // Emit direct interference
                    emit_instruction(func_ir, 8 /*update*/, inst.field19, ...)
                    emit_complex_interference(func_ir, 274 /*BARRIER_PAIR*/, 14, ...)
                    // Emit register-to-register interference
                    phys = compute_phys_encoding(func_ir, 3841)
                    emit_ternary(func_ir, 21 /*INTERFERENCE_LINK*/, 11,
                                 inst.operand[0] & 0xFFFFFF,
                                 anchor | 0x10000000, phys, inst_ref)
                    delete_instruction(func_ir, inst, 1)

            elif opcode == 183:                      // BARRIER instruction
                // Extract barrier source and destination operands
                src = inst.operand[inst.num_operands - modifier_offset - 5]
                src_type = classify_operand(src)
                if src_type == 6:                    // valid barrier operand
                    // Similar interference edge emission as above
                    ...

Reconstructed Pseudocode: Spill Cost Computation (sub_18C5470, sub_18C5B30)

The actual spill cost computation lives in two functions, not in sub_189F300 (which is an operand encoder). These functions evaluate each virtual register as a spill candidate.

function compute_spill_weights_per_block(func_ir, block, reg_class, budget, target) -> void:
    // sub_18C5470: 5 parameters, iterates a block's live set
    config    = func_ir.regalloc_config
    func_data = config.func_data

    // Build the live-in and live-out sets for this block
    live_in  = compute_live_in(func_ir, block)
    live_out = compute_live_out(func_ir, block)

    // For each virtual register live across this block:
    for vreg in (live_in UNION live_out):
        weight = 0
        // Count definitions in this block
        for inst in block.instructions:
            if inst defines vreg:
                weight += DEF_WEIGHT                 // base cost of a spill store
            if inst uses vreg:
                weight += USE_WEIGHT                 // base cost of a spill load

        // Multiply by loop nesting depth
        // Inner loops contribute multiplicatively: weight *= 10^depth (approx)
        loop_depth = block.loop_nesting_depth
        weight *= LOOP_WEIGHT_MULTIPLIER ^ loop_depth

        // Check rematerialization eligibility
        if vreg.defining_inst is rematerializable:
            weight *= REMAT_DISCOUNT                 // typically 0.1x -- very cheap to recompute

        // Record in the per-register spill weight table
        spill_weight_table[vreg] = weight


function compute_spill_benefit(func_ir, block, candidate_set) -> vreg:
    // sub_18C5B30: returns the best spill candidate from candidate_set
    best_vreg   = NULL
    best_ratio  = +infinity                          // lower is better (cost / benefit)

    for vreg in candidate_set:
        cost    = spill_weight_table[vreg]           // from compute_spill_weights_per_block
        benefit = vreg.live_range_length             // how many interference edges it removes
        ratio   = cost / benefit

        // Prefer registers with:
        //   1. Long live ranges (high benefit)
        //   2. Low use frequency (low cost)
        //   3. Outside inner loops (low cost)
        //   4. Rematerializable (low cost)
        if ratio < best_ratio:
            best_ratio = ratio
            best_vreg  = vreg

    return best_vreg

Reconstructed Pseudocode: Iterative Spill-Retry Loop

The outermost allocation loop ties together coloring and spilling. This is the high-level Chaitin-Briggs flow as implemented by the interaction between AllocateRegisters_main_driver (0x18988D0), the graph coloring core, and the iterative spill functions.

function allocate_registers_main(func_ir, target_reg_count):
    // ---- Stage 1: Classify and prepare ----
    classify_register_classes(func_ir)               // sub_189B2D0
    compute_live_ranges(func_ir)                     // sub_18A0DB0

    // ---- Stage 2: Attempt no-spill allocation ----
    success = allocate_nospill(func_ir, target_reg_count)  // sub_18B3AE0
    if success:
        goto encode_and_verify

    // ---- Stage 3: Chaitin-Briggs iterative loop ----
    for iteration in 1..MAX_ITERATIONS:
        // Build interference graph
        build_interference_graph(func_ir)            // sub_189E600

        // Attempt graph coloring
        colored = graph_coloring_core(func_ir)       // sub_189C3E0

        if colored:
            // Coloring succeeded: check register count fits budget
            if actual_reg_count <= target_reg_count:
                goto encode_and_verify
            else:
                // Over budget: select spill candidates and retry
                compute_spill_weights(func_ir)       // sub_18C5470 per block
                victim = select_spill_victim(func_ir) // sub_18C5B30
                insert_spill_code(func_ir, victim)   // sub_18AD450 (STL at defs)
                insert_refill_code(func_ir, victim)  // sub_18BC670 (LDL at uses)
                assign_spill_slot(func_ir, victim)   // sub_18A6E40
                // Loop continues: rebuild interference graph with new spill code
        else:
            // Coloring failed: too many high-degree nodes
            // Select the highest-degree uncolorable node as spill candidate
            compute_spill_weights(func_ir)
            victim = select_spill_victim(func_ir)
            insert_spill_code(func_ir, victim)
            insert_refill_code(func_ir, victim)
            assign_spill_slot(func_ir, victim)
            // Loop continues

    // Spill optimization: clean up redundant spill/refill pairs
    optimize_spill_code(func_ir)                     // sub_18CCB10
    coalesce_spill_stores(func_ir)                   // sub_18CD130

    encode_and_verify:
        // ---- Stage 4: Physical register encoding ----
        encode_all_instructions(func_ir)             // sub_18EF990
        per_instruction_encode(func_ir)              // sub_18AE2D0

        // ---- Stage 5: Verification ----
        full_verification(func_ir)                   // sub_19D6730

        // ---- Stage 6: Report resource usage ----
        report_resource_usage(func_ir)               // sub_140A6B0

Spilling and No-Spill Regalloc

The allocator supports two distinct modes: a "no-spill" attempt that tries to fit everything into the register budget without any spill code, and a full spilling mode that iteratively inserts spill/refill instructions until allocation succeeds.

No-Spill Pass

AllocateRegisters_nospill_pass at sub_18B3AE0 (46 KB, 1,525 lines) attempts allocation without introducing any spill code. String references include "Smem spilling..." and "Register allocation failed...", indicating that this pass reports its success or failure. The no-spill result is reported via regalloc_nospill_report at sub_18F9330 (9.3 KB), which emits "NOSPILL REGALLOC: attemp" (sic -- the truncated string is verbatim from the binary).

The no-spill pass is attempted first as an optimistic strategy. If it succeeds, the function uses the minimum possible register count with zero spill overhead. If it fails, the full spilling pipeline takes over.

Full Spilling Pipeline

The full spilling pipeline involves several interconnected functions:

FunctionAddressSizeRole
AllocateRegisters_spill_pass_full0x18DB26055 KBFull allocation with spill insertion
AllocateRegisters_iterative_spill0x18DEA0050 KBIterative spill/re-allocate loop (self-recursive)
AllocateRegisters_spill_iteration_loop0x18EB9D048 KBInner spill iteration, 44 callees
AllocateRegisters_full_with_spill_retry0x18F51E084 KBFull allocation with retry on failure
regalloc_spilling_pass_driver0x18C748025 KBPer-function spill pass coordination
regalloc_spilling_pass_per_block0x18C69D017 KBPer-basic-block spill insertion
regalloc_spilling_regalloc_driver0x18DD9D026 KBReports "-CLASS SPILLING REGALLOC ("

The spilling flow:

  1. Candidate selection -- regalloc_compute_spill_weights_per_block (0x18C5470) and regalloc_compute_spill_benefit (0x18C5B30) evaluate each virtual register as a spill candidate. Registers with long live ranges, low use frequency, and placement outside inner loops are preferred.

  2. Spill code generation -- regalloc_insert_spill_code (0x18AD450) inserts STL (store-local) instructions at each definition of the spilled register, and regalloc_emit_spill_load (0x18BC670) inserts LDL (load-local) instructions before each use. Spill stores and loads target local memory (stack frame).

  3. Spill slot assignment -- regalloc_compute_spill_slot (0x18A6E40) assigns frame offsets for spilled registers. regalloc_compute_spill_offset (0x18ADE70) computes the final byte offset within the stack frame.

  4. Iteration -- if the allocation still fails after spilling (the register budget is exceeded even with spill code), the loop tries again with additional candidates. The iterative nature is reflected in the self-recursive call in AllocateRegisters_iterative_spill.

  5. Spill optimization -- regalloc_optimize_spill_code (0x18CCB10, 9 KB) and regalloc_coalesce_spill_stores (0x18CD130) clean up redundant spill/refill sequences after the iterative loop converges.

On failure, the allocator emits: "Register allocation failed with register count of '%d'...".

Rematerialization

regalloc_handle_rematerialization at 0x18BE000 (4.9 KB) identifies instructions that can be cheaply recomputed rather than spilled. Constant loads, address computations, and other low-cost operations are marked as rematerializable -- the allocator regenerates them at each use site instead of inserting a spill/refill pair. This is verified post-hoc by the rematerialization verification pass (see Post-Allocation Verification).

SMEM Spilling

NVIDIA's shared-memory spilling is a proprietary optimization that uses on-chip shared memory (smem) as a spill target instead of local memory (which is backed by the L1/L2 cache hierarchy and ultimately DRAM). Shared memory has deterministic latency (typically 20-30 cycles vs 200+ cycles for an L2 miss), making it significantly faster for spill/refill patterns with high access frequency.

The ROT13 Knob

The feature is controlled by the knob enable_smem_spilling, which is stored in the binary as the ROT13-encoded string ranoyr_fzrz_fcvyyvat. This obfuscation is consistent with NVIDIA's practice of encoding internal knob names via ROT13 to discourage end-user tampering (see the ROT13 decoder at sub_1A40AC0). The knob is read via the standard knob-value reader functions sub_166B370/sub_166B340.

SMEM Spill Infrastructure

The SMEM spilling subsystem spans approximately 15 functions:

FunctionAddressSizeRole
regalloc_smem_spilling_driver0x18C879020 KBTop-level SMEM spill driver
regalloc_smem_spill_eligibility0x18D1FF010 KBChecks ABI eligibility
regalloc_smem_spill_driver_per_function0x18D2AA015 KBPer-function SMEM spill
regalloc_smem_spill_transform0x18D3C109 KBTransforms spill code for SMEM
regalloc_smem_compute_offsets0x18C9C805 KBComputes SMEM offsets
regalloc_smem_compute_slot_size0x18D36904 KBSlot size calculation
regalloc_smem_allocate_slot0x18D9CD04 KBSlot allocation
regalloc_smem_free_slot0x18DA3803.5 KBSlot deallocation
regalloc_smem_emit_load0x18DABC04 KBEmits SMEM load (refill)
regalloc_perform_smem_spill0x18CA9F013 KBPerforms individual spill
regalloc_smem_insert_barriers0x18D54508 KBInserts memory barriers
regalloc_smem_fixup_addressing0x18D79A06 KBFixes up SMEM addresses
regalloc_smem_handle_aliasing0x18D92208 KBAlias analysis for SMEM
regalloc_smem_compute_base_address0x18D8F504 KBBase address computation
regalloc_smem_validate0x18D7FD04 KBValidates SMEM spill

ABI Restriction

The eligibility checker at 0x18D1FF0 enforces a critical constraint: SMEM spilling is not permitted when the function uses ABI calls (device function calls with standard calling conventions). The diagnostic string is explicit:

"Smem spilling should not be enabled when functions use abi."

This restriction exists because SMEM spilling reserves a portion of shared memory for spill slots. If a called function also uses shared memory (either for its own spill slots or for user-declared __shared__ variables), the spill region could overlap with the callee's SMEM usage. Since the ABI does not standardize shared memory layout across call boundaries, SMEM spilling is disabled for any function that makes non-inlined device calls.

The reserved SMEM region is referenced by the symbol __nv_reservedSMEM_offset_0_alias, which appears in several resource-reporting and codegen functions.

SMEM Spill Flow

When SMEM spilling is eligible:

  1. regalloc_smem_compute_base_address -- computes the base address within shared memory for spill slots, after all user-declared __shared__ variables
  2. regalloc_smem_compute_slot_size -- determines the size of each spill slot (typically 4 bytes per 32-bit register)
  3. regalloc_smem_allocate_slot -- assigns SMEM offsets to spilled registers
  4. regalloc_smem_spill_transform -- rewrites STL/LDL spill instructions to STS/LDS (shared-memory store/load)
  5. regalloc_smem_insert_barriers -- inserts memory barriers (LDGDEPBAR or similar) to ensure SMEM writes are visible before subsequent reads
  6. regalloc_smem_fixup_addressing -- adjusts addressing modes for SMEM spill slots

ABI Call Register Pressure Analysis

When a function contains device calls (ABI calls), the register allocator must account for callee-clobbered registers and the caller-save/callee-save contract. The ABI pressure analysis subsystem reports per-register-class pressure at each call site.

Key Functions

FunctionAddressSizeRole
regalloc_ABI_call_pressure_report0x18D5EC015 KBReports "-CLASS ABI CALL PRESSURE for func"
AllocateRegisters_pressure_analysis0x18CE9F029 KBPer-function pressure computation
regalloc_report_ABI_pressure_per_class0x18E0C7012 KBPer-class pressure breakdown
regalloc_compute_ABI_pressure0x18E15D08 KBComputes actual pressure values
regalloc_handle_function_call_pressure0x18E50807 KBPressure at call sites
regalloc_adjust_for_ABI_call0x18E4E004 KBAdjusts budget for ABI calls
regalloc_check_ABI_constraints0x18A97004 KBValidates ABI register constraints

Pressure Reporting Format

The ABI pressure report uses the format string:

"-CLASS ABI CALL PRESSURE for func" ... " at line " ... " regs\n"

This is emitted per register class (R-class, UR-class, predicate-class) for each function that contains ABI calls. The report identifies the call site with the highest register pressure, which is the bottleneck for register budget allocation.

The caller-save convention on NVIDIA GPUs requires the caller to save all live registers across a call. The more live registers at a call site, the more spill code is needed, so the ABI pressure metric directly predicts spill cost at call boundaries.

CTA Reconfiguration (setmaxnreg) for Blackwell+

Starting with Blackwell (SM100), NVIDIA introduced the setmaxnreg instruction that allows a kernel to dynamically adjust its register budget at runtime. This enables a programming pattern where different phases of a kernel use different register counts, trading register capacity for thread occupancy within a single kernel launch.

Concept

In traditional CUDA compilation, each kernel has a fixed register count determined at compile time. The hardware allocates ceil(regs/granularity) * granularity registers per thread, which determines the maximum number of concurrent CTAs (thread blocks) per SM. A kernel with 128 registers per thread can run fewer CTAs than one with 32 registers.

setmaxnreg allows a kernel to:

  • Allocate registers: setmaxnreg.inc or setmaxnreg.alloc increases the register budget, blocking until the hardware can provide the requested registers
  • Deallocate registers: setmaxnreg.dec or setmaxnreg.dealloc/setmaxnreg.release releases registers back to the hardware, potentially allowing more CTAs to launch

setmaxnreg Infrastructure

The setmaxnreg subsystem spans 0x18FB000--0x191A000 (~55 functions):

FunctionAddressSizeRole
setmaxnreg_top_level_driver0x1912A0033 KBTop-level driver
CTA_reconfig_master_driver0x190E08024 KBCTA reconfig orchestration
setmaxnreg_enforcement_driver0x18FC09022 KBEnforces register constraints
setmaxnreg_full_transform0x19149A019 KBFull transformation pass
CTA_reconfig_insert_setmaxnreg_instructions0x190FDA018 KBInstruction insertion
CTA_reconfig_full_analysis0x190C99017 KBFull analysis pass
setmaxnreg_driver_per_module0x1904F0016 KBPer-module driver
CTA_reconfig_insert_alloc_dealloc0x19082A014 KBInsert alloc/dealloc pairs
CTA_reconfig_validate_pragmas0x1906DE07 KBPragma validation
setmaxnreg_emit_all_warnings0x190650014 KBWarning emission

Pragma Validation

The CTA reconfig system validates pragma annotations placed in PTX source code. CTA_reconfig_validate_pragmas (0x1906DE0) checks for:

  • "Conflicting CTA Reconfig pragmas..." -- multiple incompatible pragmas on the same function
  • "Found an 'alloc' pragma after 'dealloc'" -- incorrect ordering
  • "Found a 'dealloc' pragma after 'alloc'" -- incorrect ordering

Pragmas must follow a strict alloc-before-dealloc ordering within a function's control flow.

Register Budget Computation

setmaxnreg_compute_register_budget_A and _B (0x18FB430, 0x18FBA60, each 9 KB) compute the allowed register range for each code region. The _B variant emits the diagnostic:

"setmaxnreg ignored; unable to determine register count at entry"

when the entry register count cannot be statically determined (e.g., due to indirect calls or complex control flow).

Instruction Generation

Several functions emit the actual SASS SETMAXNREG instructions:

  • setmaxnreg_emit_alloc_instruction (0x18FF510) -- emits setmaxnreg.alloc/.inc
  • setmaxnreg_emit_dealloc_instruction (0x18FFA80) -- emits setmaxnreg.dealloc/.release/.dec
  • setmaxnreg_emit_reconfig_code (0x1902670) -- emits surrounding reconfig glue

The register count in setmaxnreg.dec instructions is validated:

"setmaxnreg.dec has register count..."
"setmaxreg.dealloc/release has register count..."

Minimum Register Requirements

setmaxnreg_check_minimum_requirements (0x18FE630) enforces a floor on the register count:

"setmaxnreg ignored to maintain minimum register requirements"

Certain register counts are too low for the function to execute correctly (e.g., if the function needs at least N registers for its innermost loop). The compiler silently ignores the setmaxnreg pragma rather than generating incorrect code.

Debugging Interaction

setmaxnreg_emit_all_warnings (0x1906500) includes:

"setmaxnreg ignored to allow debugging"

When device-debugging is enabled (-G flag), setmaxnreg is disabled because debug information requires consistent register layouts across the entire function.

Compatibility with Extern Calls

setmaxnreg_handle_extern_calls (0x1902BD0, 10 KB) handles the interaction between dynamic register budgets and external function calls. When a function calls an external symbol whose register requirements are unknown at compile time, the compiler must ensure the register budget at the call site is sufficient for the callee's worst-case requirements.

Multi-Class Register Allocation

The NVIDIA GPU register file is divided into three distinct classes, each allocated independently but subject to a shared per-thread register budget:

R-Registers (General-Purpose)

  • Range: R0--R255 (SM75+: up to 255 general registers per thread)
  • Width: 32 bits each; 64-bit values use even-odd pairs (e.g., R4:R5)
  • Usage: ALU operands, memory addresses, intermediate values
  • Allocation: The primary target of graph coloring; most of the 120 regalloc functions handle R-regs
  • Special registers: R255 = RZ (zero register, hardcoded)
  • Pair constraints: regalloc_handle_register_pairs (0x18C3300) and register_pair_allocator (0x1ABBCC0) enforce even-register alignment for 64-bit pairs
  • Bank conflicts: register_bank_conflict_resolver (0x1ABA8E0) resolves bank conflicts -- the register file is organized into banks (configurable, typically 4 or 8), and simultaneous reads from the same bank cause stalls

UR-Registers (Uniform)

  • Range: UR0--UR63 (SM75+ Turing and later)
  • Width: 32 bits each
  • Usage: Values that are uniform across all threads in a warp (e.g., loop bounds, base addresses). Reading a UR-reg does not consume an R-reg read port
  • Allocation: uniform_register_allocator (0x1AB93C0, 13 KB) handles UR-reg allocation separately from R-regs
  • Initialization: regalloc_init_uniform_state (0x18B63E0) and regalloc_handle_uniform_registers (0x18B5A90)
  • Verification: verify_uniform_register_usage (0x19CC6D0) checks that uniform registers were not used when disallowed, emitting "Uniform registers were disallowed, but the compiler required..."

Predicate Registers (P)

  • Range: P0--P6 (7 predicate registers per thread)
  • Width: 1 bit each (boolean)
  • Usage: Conditional execution (predicated instructions), branch conditions
  • Allocation: regalloc_handle_predicate_registers (0x18B67B0, 6 KB)
  • Special: P7 = PT (always-true predicate, hardcoded); encoding value 31 = PT
  • Spilling: Predicate spilling uses the P2R (predicate-to-register) and R2P (register-to-predicate) instructions, packing multiple predicates into a single R-register. The verification pass at 0x19DE150 checks "Failed to establish match for P2R-R2P pattern..."

Multi-Class Driver

regalloc_multi_class_driver at 0x18D69F0 (15 KB, 10 vtable calls, 16 callees) orchestrates allocation across all three classes. The driver iterates: allocate R-regs, allocate UR-regs, allocate predicates, check combined budget, and retry if the combined allocation exceeds the target register count.

Register Budget Negotiation

The budget negotiation system determines the final register count per function, balancing multiple competing constraints:

  1. --maxrregcount -- user-specified maximum register count (CLI option)
  2. __launch_bounds__ -- PTX-level annotation specifying (maxThreadsPerBlock, minBlocksPerMultiprocessor)
  3. setmaxnreg pragmas -- Blackwell+ dynamic register budget
  4. ABI requirements -- minimum registers needed for calling convention
  5. Occupancy targets -- higher register counts reduce occupancy
  6. Spill cost -- lower register counts increase spill overhead

AllocateRegisters_negotiate_budget at 0x18E3530 (33 KB, self-recursive) runs the negotiation loop:

  1. Start with the target register count (from --maxrregcount or occupancy heuristic)
  2. Attempt allocation with that budget via AllocateRegisters_with_target_count (0x18E1BF0, 46 KB)
  3. If allocation fails, increase the budget and retry
  4. If allocation succeeds but exceeds launch-bounds constraints, reduce the budget and retry with more aggressive spilling
  5. Iterate until a feasible allocation is found or the maximum iteration count is reached

The function regalloc_compute_register_budget at 0x18C28E0 (6 KB) computes the initial budget. regalloc_check_budget_feasibility (0x18D4F90) validates that a proposed budget is achievable. regalloc_attempt_with_budget (0x18D4350) makes a single allocation attempt with a fixed budget.

Launch Bounds Compatibility

Launch bounds set a hard floor on occupancy, which translates to a hard ceiling on register count. If __launch_bounds__(256, 2) specifies at least 2 blocks per SM with 256 threads each, and the SM has 65,536 registers, the maximum per-thread register count is 65536 / (256 * 2) = 128. The budget negotiation system respects this ceiling, preferring to spill rather than violate the launch bounds.

Per-Instruction Encoding

After allocation, every instruction in the function must be rewritten to replace virtual registers with physical register numbers. Two massive functions handle this:

  • AllocateRegisters_per_instruction_encode (0x18AE2D0, 155 KB, 4,005 lines, 61 callees, 18 vtable calls) -- processes each instruction individually, substituting virtual-to-physical register mappings into operand fields. Handles special cases: tied operands, implicit definitions, constant operands, register pairs, and CTA-reconfig pragmas.

  • AllocateRegisters_encode_all_instructions (0x18EF990, 108 KB, 3,724 lines, 78 callees) -- the second-largest function in the region. Iterates all functions and all basic blocks, calling the per-instruction encoder. Handles error recovery: on "Internal compiler error.", attempts fallback encoding. References setmaxnreg and no-spill diagnostics.

Supporting functions:

FunctionAddressSizeRole
regalloc_assign_physical_register0x18A65E03 KBAssigns a physical register
regalloc_apply_register_assignment0x18BF0E05 KBApplies assignment to instruction
regalloc_compute_operand_encoding0x18A199030 KBComputes operand encoding (self-recursive)
regalloc_build_register_map0x18A7C4025 KBVirtual-to-physical register map
regalloc_fixup_phi_registers0x18C0CD06 KBFixes up phi-node register assignments
regalloc_resolve_phi_copies0x18C395016 KBResolves phi copies to register moves
regalloc_handle_tied_operands0x18AB9C07.5 KBHandles operands tied to same physical register
regalloc_handle_partial_writes0x18C1A905 KBHandles partial-width writes

Post-Allocation Verification

The post-allocation verification system at 0x19D0000--0x1A00000 is one of the most thorough compiler verification subsystems discovered in the binary. It spans approximately 120 functions and validates that register allocation did not introduce correctness bugs.

Top-Level Verification

FunctionAddressSizeEvidence
AllocateRegisters_full_verification0x19D673076 KB"TOTAL MISMATCH", "POTENTIAL PROBLEM", "BENIGN (explainable)"
AllocateRegisters_post_verify_driver0x19E12A049 KB"TOTAL MISMATCH %d MISMATCH ON OLD %d\n"
verify_regalloc_spill_correctness0x19D34D049 KBSpill/refill pattern verification
verify_rematerialization_correctness0x19D52F029 KB"REMATERIALIZATION PROBLEM..."
verify_post_regalloc_reaching_defs0x19D1D4035 KBReaching-definition analysis
verify_register_allocation_correctness0x198A35044 KBTop-level allocation validation

Verification Strategy

The verifier works by computing reaching definitions before and after register allocation, then comparing them. Any mismatch indicates a potential bug:

  1. Reaching-definition snapshot -- before register allocation, the verifier records which definitions reach each use point.
  2. Post-allocation comparison -- after allocation, it recomputes reaching definitions on the physical-register code and compares against the pre-allocation snapshot.
  3. Mismatch classification:
    • "BENIGN (explainable)" -- the mismatch is explained by a known transformation (spill/refill, rematerialization, P2R/R2P predicate packing)
    • "POTENTIAL PROBLEM" -- the mismatch cannot be explained and may indicate a compiler bug
    • "TOTAL MISMATCH" -- complete failure to match any definitions

Specific Pattern Checks

The verifier checks several specific correctness patterns:

  • Bit-spill-refill (0x19DDC90) -- validates that spill stores are correctly paired with refill loads: "Failed to establish match for bit-spill-refill pattern..."
  • P2R-R2P (0x19DE150) -- validates predicate packing/unpacking correctness: "Failed to establish match for P2R-R2P pattern..."
  • Upper-bit clobbering (0x19DE510) -- checks that 32-bit writes to a register do not inadvertently clobber the upper 32 bits when the register is used as part of a 64-bit pair: "Some instruction(s) are destroying the base of..."
  • Rematerialization (0x19DBE60) -- verifies that rematerialized values match the original computation: "REMATERIALIZATION PROBLEM. New Instruction..."

Debug Knob

The verification subsystem is activated by the knob -knob DUMPIR=AllocateRegisters. When enabled, detailed diagnostics are emitted:

"Please use -knob DUMPIR=AllocateRegisters for debugging"
"This def [%d] represents uninitialized value..."

The memcheck knob enables additional checking granularity. The "Found %d potentially uninitialized register(s) in function %s" diagnostic at 0x1984920 reports uninitialized-register usage detected during verification.

Resource Usage Reporting

After register allocation and codegen complete, the resource usage reporter at sub_140A6B0 (5.5 KB) emits a comprehensive summary of the function's resource consumption. This is the output visible when compiling with --resource-usage or verbose mode.

Reported Resources

ResourceFormat StringDescription
Registers"Used %d registers"Physical R-registers allocated
Barriers", used %d barriers"Hardware barriers consumed
Global memory"%lld bytes gmem"Global memory accessed
Constant memory", %lld bytes cmem[%d]"Per-bank constant memory (18 banks, indices 0--17)
Shared memory", %lld bytes smem"Shared memory allocated (including spill if SMEM spilling active)
Local memory", %lld bytes lmem"Stack frame size (spill slots + local variables)
Stack size", %d bytes cumulative stack size"Cumulative stack including callees (verbose mode)
Textures", %d textures"Texture references bound
Samplers", %d samplers"Sampler objects bound
Surfaces", %d surfaces"Surface references bound
Compile time"Compile time = %.3f ms"Per-function compilation time

The resource counts are retrieved via dedicated accessor functions:

  • sub_43CAA0 -- register count
  • sub_43CBC0 -- barrier count
  • sub_43CD80 -- stack size
  • sub_43C680 -- shared memory size
  • sub_43C780 -- local memory size

Extended Metrics Header

The metrics emission system at 0x19A1B30 (36 KB) writes a richer statistics header as PTX comments. This is the # N instructions, M R-regs comment visible in PTX output:

# %d instructions, %d R-regs
# [inst=%d] [texInst=%d] [tepid=%d] [rregs=%d] [urregs=%d] [_lat2inst=%.1f]
# [est latency = %d] [LSpillB=%d] [LRefillB=%d]...
# [est adu=%d] [est alu=%d] [est cbu=%d] [est fma2x=%d]...
# [issue thru=%f] [adu thru=%f] [alu thru=%f]...
# [Occupancy = %f]
# [FP16 inst=%d]

The metrics include:

  • Instruction count and register count
  • Texture instruction count, estimated latency
  • Spill bytes (LSpillB) and refill bytes (LRefillB)
  • Per-unit instruction estimates: adu, alu, cbu, fma2x, fma, half, transcendental, ipa, lsu, redux, schedDisp, tex, ttu, udp
  • Tensor core instruction counts: imma, hmma, dmma, bmma
  • Throughput estimates per functional unit
  • Occupancy estimate
  • FP16 instruction count
  • Loop analysis statistics
  • SharedMem allocation throughput

REGALLOC GUIDANCE Output

The final reporting pass at 0x18F9A60 (AllocateRegisters_final_reporting, 30 KB) emits structured guidance data:

REGALLOC GUIDANCE:
ALLOCATION: ...

This includes the final register count, whether spilling occurred, the spill strategy used (local vs SMEM), and any setmaxnreg interactions. The companion SCHEDULING GUIDANCE: output (at 0x19C1A70) provides scheduling-related metrics.

Function Address Summary

The complete register allocation subsystem, listed by pipeline stage:

Core Regalloc (0x189C000--0x18FC000)

AddressSizeFunction
0x18988D071 KBAllocateRegisters_main_driver
0x189B2D04 KBregalloc_classify_register_class
0x189B6B03 KBregalloc_allocate_physical_register
0x189BB303 KBregalloc_check_register_conflict
0x189BE007.5 KBregalloc_try_coalesce
0x189C3E048 KBregalloc_graph_coloring_core (self-recursive)
0x189E60011.5 KBregalloc_build_interference_graph
0x189F30038 KBregalloc_operand_field_encoder -- 250-case operand attribute packer
0x18A0DB013.5 KBregalloc_compute_live_ranges
0x18A199030 KBregalloc_compute_operand_encoding (self-recursive)
0x18A65E03 KBregalloc_assign_physical_register
0x18A68608 KBregalloc_handle_live_range_split
0x18A7C4025 KBregalloc_build_register_map
0x18AE2D0155 KBAllocateRegisters_per_instruction_encode
0x18B3AE046 KBAllocateRegisters_nospill_pass
0x18C28E06 KBregalloc_compute_register_budget
0x18C748025 KBregalloc_spilling_pass_driver
0x18C879020 KBregalloc_smem_spilling_driver
0x18CE9F029 KBAllocateRegisters_pressure_analysis
0x18D5EC015 KBregalloc_ABI_call_pressure_report
0x18D69F015 KBregalloc_multi_class_driver
0x18DB26055 KBAllocateRegisters_spill_pass_full
0x18DD9D026 KBregalloc_spilling_regalloc_driver
0x18DEA0050 KBAllocateRegisters_iterative_spill (self-recursive)
0x18E1BF046 KBAllocateRegisters_with_target_count
0x18E353033 KBAllocateRegisters_negotiate_budget (self-recursive)
0x18E54B075 KBAllocateRegisters_full_pipeline
0x18EB9D048 KBAllocateRegisters_spill_iteration_loop
0x18EF990108 KBAllocateRegisters_encode_all_instructions
0x18F51E084 KBAllocateRegisters_full_with_spill_retry
0x18F9A6030 KBAllocateRegisters_final_reporting

setmaxnreg / CTA Reconfig (0x18FC000--0x191A000)

AddressSizeFunction
0x18FB4309 KBsetmaxnreg_compute_register_budget_A
0x18FBA609 KBsetmaxnreg_compute_register_budget_B
0x18FC09022 KBsetmaxnreg_enforcement_driver
0x18FD92011 KBsetmaxnreg_check_compatibility
0x1904F0016 KBsetmaxnreg_driver_per_module
0x190650014 KBsetmaxnreg_emit_all_warnings
0x1906DE07 KBCTA_reconfig_validate_pragmas
0x19082A014 KBCTA_reconfig_insert_alloc_dealloc
0x190C99017 KBCTA_reconfig_full_analysis
0x190E08024 KBCTA_reconfig_master_driver
0x1912A0033 KBsetmaxnreg_top_level_driver
0x19149A019 KBsetmaxnreg_full_transform

Second-Pass Regalloc (0x1AAA960--0x1AD5B00, from p1.19)

AddressSizeFunction
0x1AAA960118 KBregister_allocation_pass (ABI-aware, "max_abi_regs")
0x1AB341015 KBregister_spill_manager
0x1AB59F021 KBregister_interference_graph_builder
0x1AB679014 KBregister_coalescing_pass
0x1AB7E5019 KBregister_coloring_pass
0x1AB93C013 KBuniform_register_allocator
0x1ABC36023 KBregister_liveness_analysis

Verification (0x19D0000--0x1A00000)

AddressSizeFunction
0x198492030 KBverify_uninitialized_registers_per_function
0x198A35044 KBverify_register_allocation_correctness
0x19D1D4035 KBverify_post_regalloc_reaching_defs
0x19D34D049 KBverify_regalloc_spill_correctness
0x19D52F029 KBverify_rematerialization_correctness
0x19D673076 KBAllocateRegisters_full_verification
0x19E12A049 KBAllocateRegisters_post_verify_driver
0x19FA01022 KBverify_full_pass_driver
0x19FC1105 KBverify_post_regalloc_driver

Cross-References

Sibling Wikis