Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Allocator Architecture

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas register allocator is a fat-point greedy allocator, not a graph-coloring allocator. There is no interference graph, no Chaitin-Briggs simplify-select-spill loop, and no graph coloring in the main allocation path. Instead, the allocator maintains per-physical-register pressure histograms (512-DWORD arrays) and greedily assigns each virtual register to the physical slot with the lowest interference count. This design trades theoretical optimality for speed on the very large register files of NVIDIA GPUs (up to 255 GPRs per thread).

A secondary live-range-based infrastructure (~80 functions at 0x994000--0x9A1000) supports coalescing, splitting, and pre-coloring but feeds results into the fat-point allocator rather than replacing it.

Entry pointsub_9721C0 (1086 lines)
Per-class driversub_971A90 (355 lines) -- NOSPILL then SPILL retry
Core allocatorsub_957160 (1658 lines) -- fat-point coloring engine
Assignmentsub_94FDD0 (155 lines) -- write physical reg, propagate aliases
Spill guidancesub_96D940 (2983 lines) -- per-class priority queues
Spill codegensub_94F150 (561 lines) -- emit spill/reload instructions
Pre-coloringsub_991790 (2677 lines) -- full-function pre-assignment
Address range0x8FE000 -- 0x9D3000 (~860 KB, ~950 functions)
Knobs87 OCG knobs (RegAlloc* / RegTgt* / RegUsageLevel, indices 613--699)

Pipeline Position

The register allocator runs in the late pipeline, after all optimization passes and instruction scheduling preparation, but before final SASS encoding:

... optimization passes ...
  Late Legalization / Expansion
  AdvancedPhaseAllocReg gate         <-- pipeline entry guard
  HoistInvariants                    <-- sub_8FFDE0 (optional)
  ConvertMemoryToRegisterOrUniform   <-- sub_910840
  Pre-coloring                       <-- sub_991790
  Instruction lowering               <-- sub_98F430 / sub_98B160
  Register allocation entry          <-- sub_9721C0
    Per-class allocation x 7         <-- sub_971A90 for classes 1..6
      Core fat-point allocator       <-- sub_957160
  Post-allocation fixup
  Instruction scheduling
  SASS encoding

Register Classes

The allocator processes 7 register classes. Class 0 (unified) is skipped in the normal per-class loop; it is used for cross-class constraint propagation. Classes 1--6 are allocated independently in order:

IDNameWidthHW LimitDescription
0------Unified / cross-class (skipped in main loop)
1R32-bit255General-purpose registers (R0--R254)
2R (alt)32-bit255GPR variant (RZ sentinel, stat collector alternate)
3UR32-bit63Uniform general-purpose registers (UR0--UR62)
4UR (ext)32-bit63Uniform GPR variant (extended uniform)
5P / UP1-bit7Predicate registers (P0--P6, UP0--UP6)
6Tensor/Acc32-bitvariesTensor/accumulator registers (MMA/WGMMA)

Barrier registers (B, UB) have reg_type = 9, which is above the <= 6 allocator cutoff and are handled by a separate mechanism.

Special registers that are always skipped during allocation:

  • Indices 41--44: PT, P0--P3 (architectural predicates)
  • Index 39: special register

The class ID is the reg_type value at vreg+64. The allocator distribution loop in sub_9721C0 reads this field directly and uses it as the bucket index.

Pair modes (vreg+48, bits 20--21): 0 = single, 1 = lo-half of pair, 3 = double-width (consumes two physical slots).

Entry Point: sub_9721C0

The top-level register allocation driver (1086 lines). Called once per function after the AdvancedPhaseAllocReg pipeline gate.

function regalloc_entry(alloc_state, compilation_ctx):
    // 1. Rebuild liveness
    rebuild_basic_blocks(compilation_ctx, 1)          // sub_781F80
    compute_liveness(compilation_ctx, 1)              // sub_A10160

    // 2. Initialize 7 register classes
    for class_id in 1..6:
        vtable[896](alloc_state, class_id)            // init register file state

    // 3. Sort instructions by priority
    sort_instructions_by_priority(alloc_state)        // sub_9375C0

    // 4. Distribute vregs into per-class linked lists
    for each vreg in function:
        class = vreg.register_class
        append(class_lists[class], vreg)

    debug("\nREGALLOC GUIDANCE:\n")

    // 5. Allocate each class independently
    for class_id in 1..6:
        alloc_with_spill_retry(                       // sub_971A90
            alloc_state, compilation_ctx, class_id)

    // 6. Post-allocation fixup
    fix_load_opcode_187(alloc_state)
    fix_call_saved_registers(alloc_state)

    // 7. Handle OptixIR mode (ctx+896 == 4 or 5)
    if is_optix_ir(compilation_ctx):
        record_register_counts(compilation_ctx)

The entry point calls sub_789280 when a pre-allocation fixup bit (flag bit 2) is set, handles live-through-call register counting at lines 343--352, and sets up rematerialization lists at alloc_state[161..175].

Per-Class Driver: sub_971A90

The outer retry loop (355 lines) that wraps the core allocator with a two-phase strategy:

Phase 1 -- NOSPILL: Attempt allocation without allowing spills. Debug string: "-CLASS NOSPILL REGALLOC: attemp " (note the typo -- present in the binary).

Phase 2 -- SPILL: If NOSPILL fails, invoke spill guidance (sub_96D940) and retry with spilling enabled.

function alloc_with_spill_retry(alloc_state, ctx, class_id):
    no_retarget = query_knob(638)                     // RegAllocNoRetargetPrefs (bool)
    num_trials  = query_knob(639)                     // RegAllocNumNonSpillTrials (int)

    // Phase 1: NOSPILL
    pre_allocation_pass(alloc_state)                  // sub_94A020
    secondary_driver(alloc_state, ctx)                // sub_95DC10
    result = fatpoint_allocate(alloc_state, ctx, NOSPILL)  // sub_957160
    record_best_result(alloc_state, result)            // sub_93D070

    if result == SUCCESS:
        return

    // Phase 2: SPILL retry loop
    for attempt in 1..num_trials:
        guidance = compute_spill_guidance(ctx, attempt)    // sub_96D940
        result = fatpoint_allocate(alloc_state, ctx, SPILL)
        record_best_result(alloc_state, result)

        if result == SUCCESS:
            break

    if result == FAILURE:
        final_fallback(alloc_state)                   // sub_936FD0

    post_allocation_finalize(alloc_state)             // sub_9714E0

For SMEM spilling (modes 3/6 when ctx+896 == 5), the driver activates sub_939BD0 (spill setup) followed by sub_94F150 (spill codegen) before entering the retry loop.

Core Fat-Point Allocator: sub_957160

The central allocation function (1658 lines). This is where physical registers are actually chosen.

Data Structures

Two 2056-byte arrays (512 DWORDs + 2-DWORD sentinel each):

ArrayRole
Primary (v12)Per-physical-register interference count
Secondary (v225)Per-physical-register secondary cost (tie-breaking)

Both arrays are zeroed with SSE2 vectorized loops at the start of each allocation round.

Algorithm

function fatpoint_allocate(alloc_state, ctx, mode):
    maxRegs = alloc_state.hw_limit + 7               // from alloc+756
    if mode == CSSA_PAIRED (6):  maxRegs *= 2
    if mode == CSSA (3):         maxRegs *= 4

    primary[512]   = {0}                              // SSE2 memset
    secondary[512] = {0}

    threshold = query_knob(684)                       // RegAllocThresholdForDiscardConflicts, default 50

    for each vreg in alloc_state.register_list:       // linked list at +744
        // Populate interference bitmaps for this vreg
        build_interference_bitmaps(vreg, primary, secondary)   // sub_957020

        // Scan for minimum-pressure physical register
        best_slot = -1
        best_cost = MAX_INT
        for slot in 0..maxRegs:
            if primary[slot] > threshold:
                continue                              // too congested
            cost = primary[slot]
            if cost < best_cost:
                best_cost = cost
                best_slot = slot
            elif cost == best_cost:
                // tie-break on secondary bitmap
                if secondary[slot] < secondary[best_slot]:
                    best_slot = slot

        if best_slot == -1:
            emit_error("Register allocation failed with register count of '%d'")
            return FAILURE

        // Assign physical register
        assign_register(alloc_state, ctx, mode,       // sub_94FDD0
                        vreg, best_slot)

    return alloc_state.register_count + 1

The interference threshold (RegAllocThresholdForDiscardConflicts, knob 684, default 50) is the key heuristic parameter. Slots with interference above this value are discarded (skipped entirely), forcing the allocator toward less-contested register slots even if they are not globally minimal.

Register Assignment: sub_94FDD0

The assignment function (155 lines) writes the physical register and propagates through alias chains:

function assign_register(alloc, ctx, mode, vreg, regclass_info, slot, cost):
    max_regs = regclass_info.max_regs                 // at +16

    if slot >= max_regs and not vreg.is_spilled():    // flag 0x4000
        vreg.set_needs_spill()                        // flag 0x40000
        return

    if vreg.needs_spill():                            // flag 0x40000
        setup_spill_allocator(alloc)                  // sub_939BD0
        generate_spill_code(alloc, vreg)              // sub_94F150
        return

    // Non-spill path: commit assignment
    consumption = compute_consumption(vreg)            // sub_939CE0
    update_peak_usage(alloc, consumption)
    vreg.physical_register = slot

    // Check for pre-allocated candidate
    apply_preallocated_candidate(alloc, vreg)         // sub_950100

    // Propagate through alias chain
    alias = vreg.alias_parent                         // vreg+36
    while alias != NULL:
        alias.physical_register = slot
        alias = alias.alias_parent

Register consumption computation (sub_939CE0, 23 lines) accounts for paired registers: it returns assignment + (1 << (pair_mode == 3)) - 1, effectively consuming two slots for double-width registers.

Constraint System

The fat-point interference builder (sub_926A30, 4005 lines) processes 15+ constraint types extracted from instruction operand descriptors. Each operand encodes: bits 28--30 = operand type, bits 0--23 = register index.

TypeNameDescription
0Point interferenceSingle-instruction conflict at a specific program point
1Register operandStandard read/write interference
2Immediate operandNo register interference generated
3Paired registerDouble-width; bit 23 distinguishes hi/lo half
4Exclude-oneSpecific physical register excluded from assignment
5Exclude-all-butOnly one physical register permitted
6Below-pointInterference active below the current program point
7RangeInterference over an interval of program points
8Phi-relatedCSSA phi instruction (opcode 195) constraint
9BarrierBarrier register class constraint
10--15ExtendedAdditional constraint variants

The builder uses FNV-1a hashing (seed 0x811C9DC5, prime 16777619) for hash-table lookups into the pre-allocation candidate table. It contains SSE2-vectorized inner loops for bulk interference weight accumulation and dispatches through 7+ vtable entries for OCG knob queries.

Spilling Overview

Spilling triggers when the fat-point allocator cannot find a physical register within the budget. The subsystem has three components:

Spill guidance (sub_96D940, 2983 lines): Computes which registers to spill and in what order. Builds a 7-element guidance array (one per register class), each backed by an 11112-byte working structure containing 128-element bitmask arrays. Constructs priority queues of spill candidates using bitvector-based live range analysis. The function contains 7 near-identical code blocks (one per class), likely unrolled from a template.

Spill codegen (sub_94F150, 561 lines): Emits actual spill/reload instructions. Allocates a per-register spill info array (12 bytes per entry, initialized to {0, -1, -1}). Default spill cost is 15.0, reduced to 3.0 for certain architecture modes. Handles loop nesting via block frequency callbacks (vtable offset +8) and provides special handling for uniform registers (bit 0x200 in flags).

Spill memory targets:

TargetDescription
LMEM (local memory)Default spill destination. Per-thread private memory.
SMEM (shared memory)Alternative spill destination. Faster but shared across CTA. Assertion: "Smem spilling should not be enabled when functions use abi."

Spill setup (sub_939BD0, 65 lines) selects configuration based on RegAllocEstimatedLoopIterations (knob 623) and the cost threshold at alloc+776:

ConditionBucket sizeAlignmentMax size
Cost threshold == 0841 MB
Cost threshold != 016161 MB

See Spilling for the full spill subsystem analysis.

Pre-Allocation and Mem-to-Reg

Two important pre-passes run before the main allocator:

ConvertMemoryToRegisterOrUniform

Entry: sub_910840 (327 lines). Promotes stack variables to registers or uniform registers. Gated by sub_8F3EA0 (eligibility check) and NumOptPhasesBudget (knob 487, budget type).

sub_910840 (entry, string: "ConvertMemoryToRegisterOrUniform")
  sub_905B50 (1046 lines)  build promotion candidates
  sub_911030 (2408 lines)  detailed analysis engine (def-use chains, dominance)
  sub_90FBA0 (653 lines)   execute promotion, insert phi nodes
  sub_914B40 (1737 lines)  post-promotion rewrite / phi-resolution

Pre-Allocation Pass

Entry: sub_94A020 (331 lines). Assigns physical registers to high-priority operands before the main allocator runs. Gated by RegAllocMacForce (knob 628, bool), RegAllocMacVregAllocOrder (knob 629, int), and RegAllocCoalescing (knob 618, bool).

For allocation modes 3, 5, or 6: iterates basic blocks calling sub_9499E0 (per-block scanner) and sub_93ECB0 (per-operand pre-assigner). Priority levels from RegAllocPrefMacOperands (knob 646): 1 = read operands, 2 = write operands, 3 = both.

Uses an opcode eligibility bitmask table (shift-based membership test on opcode - 22) to filter which instructions are candidates for pre-assignment.

Live Range Infrastructure

An interval-based live range system at 0x994000--0x9A1000 (~80 functions) supports auxiliary operations. This is not the main allocator but feeds results into it:

SubsystemRangeCountKey Functions
Live range primitives0x994000--0x996000~25Constructor, interval queries, weight, color get/set
Interference graph0x996000--0x99A000~18Node/edge construction, adjacency, degree, coloring
Range operations0x99C000--0x9A1000~35Merge, split, interference add/remove, copy detection
Register coalescingsub_9B12001Copy elimination pass (800 lines)
Live range splittingsub_9AEF601Interference graph update (900 lines, self-recursive)
Range merge enginesub_9AD2201Coalescing with cost heuristics (700 lines)
Range constructionsub_9A51701Build ranges from def-use chains (750 lines)

Allocator State Object Layout

Full reconstruction from the constructor sub_947150 (1088 lines), cross-referenced with the core allocator, per-class driver, entry point, and spill subsystem. The object is at least 1748 bytes (last initialized field at +1744). The constructor is called once per function before the allocation pipeline runs.

Header and Compilation Context (+0 -- +24)

OffsetSizeTypeInitField
+08ptr&off_21E1648Vtable pointer (strategy dispatch, 40+ virtual methods)
+88ptrargCompilation context (parent object)
+168ptroff_21DBEF8Secondary vtable (allocation sub-strategy)
+248ptrctx->funcFunction object pointer (from ctx+16)

Pre-Allocation Candidate Tables (+32 -- +443)

Arena-allocated hash tables for pre-assigned registers. Each table is a 3-QWORD header {base, size, capacity} plus an arena node (24 bytes, allocated from the function memory pool with an incrementing class tag).

OffsetSizeTypeInitField
+328ptr0Pre-alloc candidate list A head
+408ptr0Pre-alloc candidate list B head
+484DWORD0Pre-alloc candidate count A
+56 -- +208160--0Per-class registration slots (6 x {ptr, ptr, DWORD} = 24B each)
+2168ptr0Registration slots tail
+2248ptralloc(24)Exclusion set arena node (class tag = 1)
+2328ptralloc(24)Pre-alloc hash table A arena node (class tag = 2)
+2408ptr0Pre-alloc hash table A: base pointer
+2488ptr0Pre-alloc hash table A: count
+2568ptr0Pre-alloc hash table A: capacity
+2728ptralloc(24)Pre-alloc hash table B arena node
+28024--0Pre-alloc hash table B: {base, count, capacity}
+3128ptralloc(24)Pre-alloc hash table C arena node
+32024--0Pre-alloc hash table C: {base, count, capacity}
+3528ptralloc(24)Exclusion set hash table arena node (class tag = 3)
+3608ptr0Exclusion set: base pointer
+3688ptr0Exclusion set: count
+3768ptr0Exclusion set: capacity
+3844DWORD0Exclusion set: element count
+3928ptr=+352Exclusion alias A (points to same node)
+40024--0Exclusion secondary: {base, count, capacity}
+4244DWORD0Exclusion secondary: element count
+4328ptr=+352Exclusion alias B
+4401BYTE0MAC force pre-alloc flag (RegAllocMacForce, knob 628)
+4411BYTE0Coalescing enable flag (RegAllocCoalescing, knob 618)
+4421BYTE0MAC vreg alloc order (RegAllocMacVregAllocOrder, knob 629)
+4431BYTE0Per-class mode flag (set by vtable+296 callback)

Per-Class Bitvector Sets (+448 -- +695)

An array of 6 bitvector set entries (one per allocatable register class, classes 1--6). Each entry is 40 bytes: a linked-list header {head, data, tail, count} (32 bytes) plus an arena node pointer (8 bytes). The arena nodes carry incrementing class tags (4, 6, 8, 10, 12, 14). The constructor loop starts at +456 and increments by 40 until +656.

OffsetSizeTypeInitField
+4488QWORD0 -> 6Bitvector set count (incremented in init loop)
+456240array--6 x BitvectorSet (40B each): classes 1--6
+69624--0Remat candidate list: {base, data, tail}
+7204DWORD0Remat candidate list: count
+7288ptralloc(24)Remat candidate arena node (class tag = 2)

Core Allocation State (+736 -- +872)

OffsetSizeTypeInitField
+7368ptr0Register linked list: secondary head
+7448ptr0Register linked list head (main walk list for sub_957160)
+7521BYTE0Register list initialized flag
+7564DWORD-1Hardware register limit (max physical regs, per-class)
+7604DWORD-1Secondary HW limit
+7644DWORD-1Pre-alloc constraint count
+7768double-1.0Spill cost threshold
+7884DWORD-1Best allocation result (reset to 0 per allocation round)
+7921BYTE0Allocation-in-progress flag
+8001BYTE0Retry-active flag
+8084DWORD(dynamic)Live range interference state
+8168ptr(dynamic)Live range secondary structure (4-byte DWORD array at +816)
+8241BYTE0Pre-coloring done flag
+8328ptr0 -> dynPer-function spill info array pointer
+8408ptr0 -> dynPer-function spill info arena node
+8488ptr0Spill info secondary
+8568ptr0Spill info tertiary
+8641BYTE0Bank conflict awareness flag
+8651BYTE0Spill-already-triggered flag
+8728ptr0Debug / trace output state

Per-Class Register File Descriptors (+880 -- +1103)

An array of 7 register class descriptors (one per class 0--6), each 32 bytes. Indexed as alloc + 880 + 32 * class_id. The per-class driver (sub_971A90) accesses max_regs as a1[32 * class_id + 884] and base_offset as a1[32 * class_id + 880].

RegClassDesc (32 bytes):

OffsetSizeTypeInitField
+04DWORD0Base register offset (first physical reg in class)
+44DWORD-1Max regs / HW limit (set by vtable[896] init callback)
+84DWORD0Current allocation count
+121BYTE0Class active flag
+131BYTE0Class overflow flag
+141BYTE0Class spill flag
+151----Padding
+164DWORD148Phase ID begin (148 = unset sentinel)
+204DWORD148Phase ID end (148 = unset sentinel)
+248QWORD-1Class auxiliary link

Concrete addresses:

ClassOffset RangeDescription
0 (unified)+880 -- +911Cross-class (skipped in main loop)
1 (R)+912 -- +943GPR 32-bit
2 (R alt)+944 -- +975GPR variant
3 (UR)+976 -- +1007Uniform GPR
4 (UR ext)+1008 -- +1039Uniform GPR variant
5 (P/UP)+1040 -- +1071Predicate registers
6 (Tensor)+1072 -- +1103Tensor / accumulator

Extended Class Metadata (+1096 -- +1127)

OffsetSizeTypeInitField
+10968QWORD-1Class 6 extended auxiliary link
+11048ptr0Extended class info: pointer A
+11128ptr0Extended class info: pointer B
+11204DWORD0Extended class info: count

Per-Class Rematerialization Lists (+1128 -- +1271)

Six rematerialization candidate lists (one per allocatable class), each 24 bytes {ptr base, ptr data, DWORD count}. Initialized to zero. Populated before the allocation loop in sub_9721C0 for classes that support rematerialization.

ClassOffset Range
1+1128 -- +1147
2+1152 -- +1175
3+1176 -- +1199
4+1200 -- +1219
5+1224 -- +1243
6+1248 -- +1267

Coalescing / Live Range Lists (+1272 -- +1432)

Self-referential circular linked lists used for register coalescing and live range splitting. Each list has a sentinel structure where prev and next point into the list body.

OffsetSizeTypeInitField
+12728ptrarg2Back-pointer to compilation context
+12808ptr0Coalesce list A: sentinel head
+12888ptrself+1296Coalesce list A: prev (self-referential)
+12968ptrself+1280Coalesce list A: next (circular)
+13048ptr0Coalesce list A: data
+13124DWORD(checked)Coalesce list A: count (bit 0 = non-empty flag)
+13208ptrself+1296Coalesce list A: end marker
+13284DWORD2Coalesce list A: type tag
+13368ptralloc(24)Coalesce list A: arena node
+13448ptr0Coalesce list B: sentinel head
+13528ptrself+1360Coalesce list B: prev
+13608ptrself+1344Coalesce list B: next
+13688ptr0Coalesce list B: data (bit 2 checked as ABI flag)
+13768ptrself+1344Coalesce list B: tail
+13848ptrself+1360Coalesce list B: end marker
+13924DWORD2Coalesce list B: type tag
+14008ptralloc(24)Coalesce list B: arena node
+14088ptralloc(24)Interference graph arena node (bit 1 = call-saved mode)
+14168ptr0Interference graph: base
+14248ptr0Interference graph: data (bit 7 checked in sub_97EC60)
+14328ptr0Interference graph: capacity

Debug / Rematerialization Infrastructure (+1440 -- +1496)

OffsetSizeTypeInitField
+14408--(tree)Remat exclusion set (tree root, queried via sub_99C5B0)
+14481BYTE0Remat exclusion: active flag (checked in sub_962840, sub_94E620)
+14524DWORD0Remat exclusion: instruction threshold
+146416OWORD0Remat exclusion: data block B
+14728ptr0Remat candidate: linked list (freed in sub_99D190)
+148016--0Remat candidate list (iterated by sub_94BDF0)
+14884DWORD0Remat candidate: count (checked in sub_99C690)
+14968ptr0Remat candidate: root pointer

Spill / Retry Control Block (+1504 -- +1594)

The core state for the NOSPILL / SPILL retry loop. Zeroed at allocation start, populated by the per-class driver (sub_971A90), read/written by the fat-point allocator (sub_957160).

OffsetSizeTypeInitField
+15044DWORD0Allocation mode (0=normal, 3=CSSA, 5=SMEM, 6=paired)
+15084DWORD0Spill attempt counter
+15124DWORD0 -> 44Spill instruction count (knob 635, default 44)
+15164DWORD-1Budget lower bound
+15204DWORD-1Budget lower bound secondary (part of 128-bit at +1516)
+15244DWORD-1Register budget (from per-class desc max_regs)
+15284DWORD(dynamic)Peak register usage (copied from +1532 per round)
+153216__m128i(global)Strategy parameters (loaded from xmmword_21E17F0)
+15404DWORD0Secondary budget limit (knob 633)
+15444DWORD0Tertiary budget limit (knob 632)
+15484float4.0Spill cost multiplier (knob 680)
+15524DWORD-1Rollback sentinel
+15564DWORD-1Max regs aligned: (budget + 4) & ~3
+15604DWORD-1Best result sentinel
+15644DWORD0Current max assignment (zeroed per allocation round)
+15688double0.0Total spill cost accumulator (zeroed per round)
+15764DWORD0Spill event counter (zeroed per round)
+15804DWORD(dynamic)Effective budget: max(budget, SMEM_min)
+15844DWORD(dynamic)Adjusted budget (from vtable+256 callback)

Mode Flags (+1588 -- +1594)

Knob-derived boolean flags controlling allocation strategy. When the function has more than one basic block (sub_7DDB50 > 1), flags +1588, +1589, +1590 are all forced to 1.

OffsetSizeTypeInitKnobField
+15881BYTE0682Epoch-aware allocation mode
+15891BYTE0683Paired-register allocation mode
+15901BYTE0619SMEM spill enable
+15911BYTE0627Bank-aware allocation
+15921BYTE0--Spill status / has-spilled flag
+15931BYTE1636Precolor reuse (default enabled)
+15941BYTE1649ABI compatibility (default enabled; cleared for small kernels)

Budget Pressure Model (+1600 -- +1744)

Occupancy-aware register budget interpolation. Computes a dynamic register budget based on thread occupancy, using knob-derived coefficients and a linear interpolation model. The slope at +1736 is (coeffB - coeffC) / (maxOccupancy - minOccupancy), enabling the allocator to trade register count for occupancy.

OffsetSizeTypeInitField
+16008ptrctx[2]->+208Function object pair pointer
+16088ptr0Budget model: auxiliary pointer
+16168QWORD0xFFFFFFFFBudget model: occupancy upper bound
+16244DWORD119 / knobMax threads per block (default 119)
+16284DWORD160 / knobPressure threshold (default 160)
+16328double0.2Interpolation coefficient A (knob-overridable)
+16408double1.0Interpolation coefficient B (knob-overridable)
+16488double0.3Interpolation coefficient C (knob-overridable)
+16568double(computed)Total threads as double
+16648double= coeff AInterpolation point [0]
+16728double(computed)Interpolation point [1]: max_threads as double
+16808double= coeff AInterpolation point [2]
+16888double(computed)Interpolation point [3]: threshold as double
+16968double= coeff AInterpolation point [4]
+17048double(computed)Interpolation point [5]: 255 minus vtable result
+17128double= coeff BInterpolation point [6]
+17208double(computed)Linear model: x_min (thread count)
+17288double= coeff CLinear model: y_min
+17368double(computed)Linear model: slope
+17448ptr0Budget model: tail sentinel

Virtual Register Object Layout

OffsetSizeField
+08Next pointer (linked list)
+124Register class index
+201Flags byte (bit 0x20 = live)
+368Alias chain (coalesced parent)
+404Spill cost (float, accumulated)
+488Flags qword (see below)
+644Register type (1=GPR, 3=pred, 9=barrier)
+684Physical assignment (-1 = unassigned)
+721Size byte (0 = scalar)
+764Secondary spill cost (float)
+804Spill flag (0 = not spilled, 1 = spilled)
+1048Use chain head
+1128Def chain
+1288Next in linked-register chain
+1448Constraint list

Flag bits at +48:

BitMaskMeaning
90x200Pre-assigned / fixed register
100x400Coalesced source
110x800Coalesced target
140x4000Spill marker
180x40000Needs-spill flag
20--21--Pair mode (0=single, 1=lo-half, 3=double-width)
220x400000Constrained to architecture limit
230x800000Hi-half of pair
270x8000000Special handling flag

Key Knobs

87 OCG knobs (indices 613--699) control register allocation heuristics. The complete catalog with sub-category grouping is in Knobs System -- Register Allocation Knobs. The most important ones:

KnobNameTypeRole
381(not yet decoded)--HoistInvariants policy: 0=always, 1=inner loops, 3=never
487NumOptPhasesBudgetBDGTBudget counter that gates ConvertMemoryToRegisterOrUniform
618RegAllocCoalescingboolEnables register coalescing in the allocator
623RegAllocEstimatedLoopIterationsSTRLoop iteration estimate hint for spill cost weighting
628RegAllocMacForceboolForces MAC-level pre-allocation path
629RegAllocMacVregAllocOrderINTVreg processing order during MAC allocation
638RegAllocNoRetargetPrefsboolDisables retarget-preference optimization
639RegAllocNumNonSpillTrialsINTNon-spill allocation trials before allowing spills
646RegAllocPrefMacOperandsINTMAC operand preference level (1=read, 2=write, 3=both)
684RegAllocThresholdForDiscardConflictsINTInterference discard threshold. Default 50
934UseNewLoopInvariantRoutineForHoistingboolSelects new LICM routine for HoistInvariants pre-pass

Function Map

AddressLinesRole
sub_8FFDE0119HoistInvariants entry
sub_905B501046Mem-to-reg candidate builder
sub_910840327ConvertMemoryToRegisterOrUniform entry
sub_9110302408Mem-to-reg analysis engine
sub_914B401737Post-promotion rewrite
sub_926A304005Fat-point interference builder
sub_9471501088Allocator state constructor (initializes 1748-byte object)
sub_939BD065Spill allocator setup
sub_939CE023Register consumption counter
sub_93D070155Best result recorder
sub_93ECB0194Pre-assign registers
sub_93FBE0940Spill slot assignment
sub_94A020331Pre-allocation pass
sub_94E620617Spill cost accumulator
sub_94F150561Spill code generation
sub_94FDD0155Register assignment + alias propagation
sub_950100205Pre-allocated candidate applier
sub_9571601658Core fat-point allocator
sub_9539C01873Shared-memory spill allocator
sub_95A3501390Cost / benefit evaluator
sub_95BC901250Allocation retry / refinement
sub_95DC102738Multi-class ABI-aware driver
sub_9680F03722Per-instruction assignment core loop
sub_96D9402983Spill guidance (7-class priority queues)
sub_971A90355NOSPILL / SPILL retry driver
sub_9721C01086Register allocation entry point
sub_9917902677Pre-coloring pass
sub_9A5170750Live range construction
sub_9AD220700Live range merge / coalescing engine
sub_9AEF60900Live range splitting
sub_9B1200800Register coalescing / copy elimination