Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

MachineBlockPlacement for GPU

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

MachineBlockPlacement decides the physical ordering of basic blocks in a MachineFunction. On CPU, it is primarily an I-cache optimization. On GPU, block ordering has deeper consequences: PTX is a structured ISA where every taken branch stalls the SM instruction fetch pipeline, warp divergence must reconverge at post-dominators, and instruction cache capacity is measured in tens of kilobytes per SM partition. cicc carries two separate instances of this pass -- a stock LLVM copy for internal use and an NVPTX-pipeline copy at sub_3521FF0 that participates in GPU-specific analysis. The NVPTX instance queries a divergence flag on the MachineFunction to decide whether tail duplication is profitable, and adds an alternative layout proposal path (sub_34BEDF0 / sub_34C7080) that is absent from upstream LLVM.

Key Facts

PropertyValue
Entry pointsub_3521FF0 (82 KB decompiled, 2435 lines)
Pass name"Branch Probability Basic Block Placement"
Pass ID"block-placement"
Registration (NVPTX)sub_350FE30 (pass), sub_350FEE0 (stats)
Registration (generic)sub_1DE8060 (pass), sub_1DE8500 (stats)
Stats pass ID"block-placement-stats", callback sub_3517680
Knob constructorctor_671_0 at 0x5A0470
Required analysesMachineBlockFrequencyInfo, MachineBranchProbabilityInfo, MachinePostDominatorTree, MachineLoopInfo, TargetPassConfig

Why Block Placement Matters on GPU

Three properties of GPU execution make block ordering non-trivial.

Instruction fetch pipeline. GPU SMs fetch instructions sequentially. A taken branch introduces a fetch bubble -- the warp scheduler cannot issue from the new target until the instruction cache services the request. Every fall-through edge is free; every taken branch costs at least one cycle of fetch latency. The misfetch-cost (default 1) and jump-inst-cost (default 1) knobs model this cost. Maximizing fall-through sequences directly reduces warp stall cycles at branch points.

Instruction cache pressure. GPU instruction caches are small (typically 32-128 KB per SM partition). Code duplication through tail-dup increases I-cache working set. The tail-dup-placement-penalty (default 2%) penalizes code copies that improve fall-through at the expense of I-cache pressure. The ext-TSP model, when enabled, explicitly optimizes for I-cache utilization by modeling forward/backward reference distances.

Warp divergence. When a branch is divergent (different lanes take different paths), all paths must execute serially, and the warp reconverges at the post-dominator. Block ordering cannot eliminate the divergence cost, but it determines which side of the branch falls through vs. takes a jump. The divergence flag at MF+8+688 bit 0 gates whether tail duplication is even attempted: duplicating a tail block that sits below a divergent branch wastes code size because divergent warps execute both paths regardless of which one falls through.

Pass Object Layout

The pass object at a1 is populated during runOnMachineFunction:

OffsetTypeContent
+488ptrLoop chain working data (cleared by sub_35142F0)
+520MachineFunction*Current function being processed
+528ptrMachineBlockFrequencyInfo* (adjusted +169 from raw analysis pointer)
+536ptrMachineBranchProbabilityInfo* (40-byte struct at +200)
+544ptrMachinePostDominatorTree* (+200)
+552u64Working state (cleared to 0)
+560ptrTargetInstrInfo* (nullptr if default vtable)
+568ptrTargetRegisterInfo* (nullptr if default vtable)
+576ptrTailDuplicator* (from unk_50209DC analysis, +200)
+584ptrMachineLoopInfo*
+592ptrTargetPassConfig*
+600inlineChain-builder state (initialized by sub_2FD5DC0)
+776u64Profile-derived hot threshold
+784i32Tail-dup threshold (2 or 4)
+788boolProfile count was explicitly provided
+792ptrBump allocator base (for chain node allocation)
+800u64Bump allocator capacity
+872u64Bump allocator total allocation counter
+888structChain-map (BB-to-chain DenseMap, queried via sub_3515040)

Chain nodes are 64 bytes each, allocated from the bump allocator:

struct ChainNode {          // 64 bytes
    MachineBasicBlock** bb_array;   // +0:  pointer to BB array (initially +16)
    uint32_t count;                 // +8:  number of BBs in chain
    uint32_t capacity;              // +12: capacity (initial: 1)
    MachineBasicBlock* inline_bb;   // +16: inline storage for single-BB chain
    uint8_t  padding[24];           // +24: space for up to 3 more inline BBs
    void*    chain_map;             // +48: pointer to parent chain-map
    uint64_t flags;                 // +56: chain flags
};

Algorithm Overview

The entry point sub_3521FF0 dispatches to one of two layout algorithms: the standard chain-based placement, or the ext-TSP layout when explicitly enabled. The overall flow:

runOnMachineFunction(MF):
    if MF.empty(): return 0

    // Fetch analyses
    MBFI  = getAnalysis<MachineBlockFrequencyInfo>()
    MBPI  = getAnalysis<MachineBranchProbabilityInfo>()
    MPDT  = getAnalysis<MachinePostDominatorTree>()
    MLI   = getAnalysis<MachineLoopInfo>()
    TPC   = getAnalysis<TargetPassConfig>()
    TII   = MF.getSubtarget().getInstrInfo()
    TRI   = MF.getSubtarget().getRegisterInfo()

    // Compute tail-dup threshold
    threshold = computeTailDupThreshold(optLevel, TII)

    // Decide layout algorithm
    if enable-ext-tsp-block-placement AND MF.size() fits:
        applyExtTsp(MF)
    else:
        buildChains(MF)           // sub_3521900
        tailDupPlacement(MF)      // sub_35185B0 (if enabled + not divergent)
        tryAlternativeLayout(MF)  // sub_34BEDF0 + sub_34C7080 (NVIDIA addition)

    // Post-placement
    optimizeBranches()            // flip branches for fall-through
    alignBlocks()                 // sub_3516980
    cleanup()
    return 1

Chain-Based Placement (Standard Path)

sub_3521900 (buildChains) is the workhorse. It operates in four steps.

Step 1 -- Initial Chain Construction

For every BB in the MachineFunction (iterated via the doubly-linked intrusive list from MF+328 to sentinel MF+320), the builder:

  1. Allocates a 64-byte chain node from the bump allocator at pass+792. The node is initialized with count=1, capacity=1, the inline BB pointer set to the current BB, and the chain-map pointer set to pass+888.
  2. Inserts the BB-to-chain mapping into the chain-map via sub_3515040 (DenseMap insert with pointer hash ((ptr >> 9) ^ (ptr >> 4)) & (bucket_count - 1)).
  3. Attempts to extend the chain forward: calls TII->analyzeBranch() (vtable+344) on the current BB. If analyzable and a fall-through successor exists, calls sub_2E32580 to verify the successor is valid for chaining (not already claimed by a different chain, not a landing pad, not the function entry if it would create a cycle). If valid, the successor is appended to the chain's BB array (growing from inline storage to heap allocation via sub_C7D6A0 when needed), and the walk continues from the successor.

The result is a set of maximal fall-through chains -- each chain represents a sequence of BBs where every transition is a fall-through edge according to analyzeBranch.

Step 2 -- Loop Chain Merging

Read the MachineLoopInfo structure at pass+584. Iterate loops from innermost outward. For each loop, call sub_351EBB0 (buildLoopChains), which:

  1. Identifies all chains that contain BBs belonging to the loop.
  2. Merges these chains into a single loop chain, ordering them to maximize fall-through within the loop body.
  3. Applies loop rotation via sub_351C710 (rotateLoop) to place the exiting block at the bottom, making the back-edge a fall-through and the exit a taken branch (or the reverse, whichever minimizes cost according to the profile data).

Cold blocks within the loop (where loop_freq / block_freq > loop-to-cold-block-ratio) are ejected from the loop chain and will be placed at the function's end during the commit step.

Step 3 -- Global Successor Ordering

Call sub_35157A0 (selectBestSuccessor) for each BB to find the globally best successor chain ordering. The selection considers:

  • Edge probability from sub_2E441D0 (getEdgeProbability)
  • Whether the successor is already the fall-through (free) or would require a taken branch (cost = misfetch-cost + jump-inst-cost)
  • Whether chaining the successor would break an existing profitable chain connection

Then sub_351D700 (buildChainForBlock) performs a greedy walk from the function entry, building the top-level chain by repeatedly selecting the best unchained successor and appending it.

Step 4 -- Commit

Walk the final chain's BB array and splice each BB into position using intrusive-list pointer swaps on the MachineFunction's BB list (pointer updates at BB+0 and BB+8 -- the prev/next pointers of the doubly-linked list).

Ext-TSP Layout (Optional Path)

When enable-ext-tsp-block-placement is true (default: false), the pass uses the Extended Travelling Salesman Problem formulation from LLVM's CodeLayout.h. This is a profile-guided model that explicitly optimizes I-cache utilization by penalizing backward references and rewarding fall-through edges.

The ext-TSP path builds a BB index hash-map using LLVM's DenseMap pattern (hash: (ptr >> 9) ^ (ptr >> 4), 75% load factor), computes block frequencies and edge weights, then runs three solver functions:

FunctionRole
sub_29BAF70calcExtTspScore() -- score the original layout
sub_29BAC40calcExtTspScore() -- score the alternative layout
sub_29BB2B0computeExtTspLayout() -- reorder chains by ext-TSP objective

The pass compares original vs. reordered cost and commits the better ordering via sub_3519A10 (applyBlockOrder). Additional ext-TSP tuning knobs (registered in ctor_492 at 0x5545a0):

KnobDescription
ext-tsp-forward-weight-cond / uncondWeight for conditional/unconditional forward jumps
ext-tsp-backward-weight-cond / uncondWeight for conditional/unconditional backward jumps
ext-tsp-fallthrough-weight-cond / uncondWeight for fall-through edges
ext-tsp-forward-distance / backward-distanceDistance thresholds for cache modeling
ext-tsp-max-chain-sizeMaximum chain size for ext-TSP merging
ext-tsp-chain-split-thresholdThreshold for splitting chains
ext-tsp-max-merge-density-ratioDensity ratio cap for chain merges
ext-tsp-apply-without-profileRun ext-TSP even without PGO data
cdsort-cache-entries / cache-sizeCDSort cache model parameters
cdsort-max-chain-sizeCDSort chain size limit
cdsort-distance-power / frequency-scaleCDSort cost model tuning

NVIDIA-Specific Modifications

Divergence-Gated Tail Duplication

The most significant GPU-specific behavior is the divergence check before tail duplication. At step (G) in the algorithm, the pass reads MF+8+688 bit 0 -- a flag set by earlier divergence analysis passes indicating the function contains warp-divergent branches. When this bit is set, sub_35185B0 (tailDupPlacement) is skipped entirely.

The rationale: tail duplication creates an additional copy of a basic block to convert a diamond-shaped CFG into a straight-line fall-through. On CPU, this eliminates a taken branch on the hot path. On GPU with divergent branches, both sides of the diamond execute regardless (the warp mask simply toggles), so duplicating the tail block doubles code size for zero fall-through benefit. The divergence flag is a conservative gate -- it disables tail-dup for the entire function, not per-branch.

Alternative Layout Proposal Algorithm

When the standard chain-based path is selected (not ext-TSP), and the function has more than 3 basic blocks with profile data and is not marked divergent, the pass runs a complete alternative layout evaluation through a pipeline absent from upstream LLVM. This is one of cicc's most significant code-layout additions.

Activation Gate

if (byte_503C568 is set AND MF.size() > 3):
    evaluator = sub_34BEDF0(state, profile_flag, MBFI, TII, MBPI)
    changed   = sub_34C7080(evaluator, MF, chain_data, ...)
    if changed:
        commit(evaluator_layout)

The gate variable byte_503C568 corresponds to the branch-fold-placement knob (default true). When branch-fold-placement is active and the function has enough basic blocks to justify the extra analysis cost, the alternative path fires.

State Object Initialization -- sub_34BEDF0 (321 bytes)

sub_34BEDF0 is a constructor that initializes a 0x100-byte evaluator state object. It takes six arguments: (rdi=state, rsi=profile_available, rdx=?, rcx=MBFI*, r8=TII*, r9=MBPI*). The initialization zeroes the majority of the structure and sets up internal storage pointers:

struct LayoutEvaluatorState {      // 0x100 bytes, initialized by sub_34BEDF0
    void*    bb_array_ptr;         // +0x00: BB ordering array (initially null)
    uint64_t bb_array_size;        // +0x08: count
    uint64_t bb_array_cap;         // +0x10: capacity
    uint64_t iteration_count;      // +0x18: cleared to 0
    void*    inline_storage_ptr;   // +0x20: points to +0x38 (inline array)
    uint64_t initial_capacity;     // +0x28: set to 2
    uint32_t current_count;        // +0x30: set to 0
    uint8_t  is_fresh;             // +0x34: set to 1 (first-run flag)
    uint8_t  padding[3];           // +0x35
    uint8_t  inline_array[72];     // +0x38: inline storage for small chains
    uint8_t  profile_available;    // +0x80: bit 0 = profile flag
    uint8_t  force_mode;           // +0x81: set from qword_503AD08 knob
    uint8_t  divergence_aware;     // +0x82: set from dl argument
    uint8_t  needs_reconverge;     // +0x83: cleared to 0, set during evaluation
    uint32_t bb_limit;             // +0x84: from stack argument (BB count cap)
    // +0x88..+0xA8: five qword slots, all zeroed
    void*    bb_ptr_array;         // +0xB0: points to +0xC8 (inline)
    uint64_t bb_ptr_array_pad;     // +0xB8: cleared
    uint64_t bb_ptr_array_cap;     // +0xC0: set to 8
    uint8_t  bb_ptr_inline[24];    // +0xC8: inline BB pointer storage
    uint64_t total_cost;           // +0xD8: cleared
    uint32_t cost_flags;           // +0xE0: cleared
    void*    mbfi_ptr;             // +0xE8: MachineBlockFrequencyInfo*
    void*    tii_ptr;              // +0xF0: TargetInstrInfo*
    void*    mbpi_ptr;             // +0xF8: MachineBranchProbabilityInfo*
};

The force_mode field at offset +0x81 is set based on the global qword_503AD08. When this global equals 0, the force mode takes the profile_available argument. When it equals 1, force mode is unconditionally set to 1 (always evaluate). Any other value causes a straight return (skip evaluation). This provides a three-way override: 0=auto, 1=always, other=never.

Dispatch Wrapper -- sub_34C7080 (17 bytes)

sub_34C7080 is a thin guard:

// sub_34C7080(rdi=evaluator, rsi=MF, rdx=chain_data, rcx=..., r8=..., r9=changed_flag)
if (rdx == NULL) return 0;     // no chain data -> nothing to evaluate
return sub_34C6AF0(rdi, rsi, rdx, rcx, r8, (bool)r9);

The NULL check on rdx (the chain-data pointer) provides a fast exit when the chain builder produced no intermediate state worth re-evaluating.

Core Layout Evaluator -- sub_34C6AF0 (1419 bytes)

sub_34C6AF0 is the real body of the alternative layout evaluator. It operates on the evaluator state object (from sub_34BEDF0) and the MachineFunction, performing a complete re-evaluation of the chain-based layout against a different cost model. The algorithm proceeds in six steps:

Step 1 -- Iteration counter and hash table reset. Increment the iteration count at state+0x18. If the hash table at state+0x20 is not fresh (byte at state+0x34 is 0), compute a minimum table size as max(32, 4 * (capacity - count)), and if the current table is undersized, fill it with 0xFF sentinels via memset. This hash table tracks which BBs have been visited during the current evaluation pass.

Step 2 -- State initialization from MachineFunction. Clear the running cost accumulator at state+0x2C..+0x30. Read the first BB from the MachineFunction's chain data. Store the chain data pointer, the iteration limit from state+0x84, and the analysis pointers (MBPI at state+0x98, TII at state+0xA0) into the evaluator's working slots.

Read MF->getSubtarget()->something at offset +0x220 and subtract 0x2A (decimal 42). This produces an SM-generation index (sm_70=0, sm_75=1, sm_80=2, ..., sm_90=6, sm_100=16 based on this encoding). This index determines which cost table row is used for the fetch-penalty model.

Step 3 -- Divergence-aware block scanning. For the first BB in the chain, check bit 2 of the flags at BB[0]+0x158. If set, dispatch to TII->vtable+0x210 (which is compared against sub_2FF52D0 -- the default stub). If the target overrides this vtable slot, call the override with the MachineFunction to determine if the block needs special handling. When the default is in use, set state+0x83 (needs_reconverge) to 1 unconditionally. This appears to be an NVPTX check for whether the block is in a reconvergence region where layout ordering has correctness implications, not just performance.

Step 4 -- Main evaluation loop. Call sub_34BA1B0 to snapshot the current chain state into a temporary structure on the stack. Then enter the main loop:

while (true):
    status = sub_34C4890(state, MF)   // advance to next BB in evaluation order
    changed_bit = (state->profile_available XOR 1) OR status
    if changed_bit == 0:
        // Reached the evaluation boundary without changes
        if (sm_index <= 1):           // sm_70 or sm_75
            check qword_503AA68 knob  // additional gate for older archs
            if set: call sub_34C0690(state, loop) for each loop in MF
        if state->divergence_aware:
            call sub_34C56D0(state, loop) for each loop in MF
        break if no further changes

    // A change was proposed
    call sub_34C2D70(state, MF)       // apply the proposed reordering step
    accumulate changed flags

    if (sm_index <= 1):               // sm_70/sm_75
        check qword_503AA68 knob
        if set: call sub_34C0690 for each loop
    if state->divergence_aware:
        call sub_34C56D0 for each loop

    if changed_this_iteration:
        continue loop

sub_34C4890 advances through the MachineFunction's basic blocks in frequency-priority order, proposing a reordering when a higher-frequency successor is not the current fall-through. sub_34C2D70 performs the actual chain manipulation to implement the proposed swap.

Step 5 -- Loop-level re-evaluation. The calls to sub_34C56D0 (5137 bytes, called from sub_34C6AF0 via the loop-iteration path at 0x34C6E90) perform loop-level cost re-evaluation. This function:

  • Walks the MachineFunction's loop tree (from MF+0x148, the MachineLoopInfo block list)
  • For each loop, evaluates whether the proposed layout improves or degrades the loop body's fall-through density
  • Calls sub_34C0EE0 for block-level cost queries
  • Calls sub_34BE7F0 for chain adjacency analysis
  • Queries sub_2E88AF0 (divergence analysis) and sub_2E88FE0 for convergence properties
  • Uses sub_2FDC710/sub_2FDC700 for target-specific cost overrides via the TII vtable
  • Calls sub_3509790 for reconvergence point identification

sub_34C0690 (called on the sm_70/sm_75 path gated by qword_503AA68) is a lighter variant that omits the divergence-aware sub-evaluations, appropriate for older SM architectures where divergence reconvergence is handled differently.

Step 6 -- Final cost comparison and bitvector scan. After the evaluation loop terminates, build a bitvector tracking which BBs changed position. The bitvector uses 64-bit words with word index = bb_index >> 6 and bit position = bb_index & 63. Walk the MachineFunction's loop tree blocks (MF+0x148 linked list):

  • For each block in the loop, walk the instruction list starting at BB+0x20
  • For each instruction, mask the opcode with 0xFFFFFF and compute opcode * 5 as a stride
  • If the instruction byte at offset 0 is 0x08 (a branch instruction), set the corresponding bit in the bitvector

Then scan the bitvector against the evaluator's proposed ordering to detect any BB that would need to move. If at least one BB is displaced, set the return flag.

On the final cost-comparison path (at 0x34C6FD3), the evaluator reads TII->vtable+0x5D8 and compares against sub_2FDC810. If the target overrides this slot, the override is called to provide a final accept/reject decision. Otherwise, a default threshold of 3 is used: the proposed layout is accepted only if the cost reduction exceeds the acceptance threshold. The stat-based knobs at dword_503AAC8 and qword_503AB48 provide tuning for the threshold lookup via the sub_C52410/sub_C959E0 statistics infrastructure.

Shared Infrastructure with Register Allocation

A surprising discovery: sub_34BEDF0 and sub_34C7080/sub_34C6AF0 are also called from sub_34ED530 (RegAllocGreedy, 91KB) via sub_34F1190. The register allocator uses the same layout evaluator to assess whether a spill-induced block split would degrade code layout quality. This sharing means the cost model is consistent between register allocation decisions and post-RA block placement, preventing the two passes from working at cross purposes. The evaluator state is separate per invocation (stack-allocated), so there is no state leakage between the two callers.

SM-Generation-Dependent Behavior

The SM index computation ((MF->getSubtarget()+0x220) - 0x2A) creates generation-dependent behavior:

SM GenerationIndexLoop EvaluatorDivergence Sub-Eval
sm_70 (Volta)0sub_34C0690 if qword_503AA68Only if divergence flag
sm_75 (Turing)1sub_34C0690 if qword_503AA68Only if divergence flag
sm_80+ (Ampere+)2+Skipped (only sub_34C56D0)Always if divergence flag

This split reflects the hardware difference: Volta and Turing use a stack-based reconvergence mechanism that benefits from the lighter sub_34C0690 analysis, while Ampere and later use the uniform warp scheduler where the more thorough sub_34C56D0 evaluation is worthwhile.

Dual Pass Registration

The binary contains two complete instances of MachineBlockPlacement:

InstanceRegistrationPurpose
sub_350FE30 (NVPTX)NVPTX backend pipelineGPU-specific analysis results, divergence-aware
sub_1DE8060 (generic)Default LLVM pipelineStandard pass for any non-GPU path

Having a separate NVPTX instance allows NVIDIA to control pass ordering independently. The NVPTX version is inserted at a specific point in the backend pipeline where divergence analysis results are available.

Target Tail-Dup Threshold Override

The tail-dup threshold (how many instructions a tail block can have before duplication is rejected) is determined by a multi-level decision:

default_threshold = 2                               // tail-dup-placement-threshold
aggressive_threshold = 4                            // tail-dup-placement-aggressive-threshold

if TII->getTailDupThreshold(optLevel) overrides:    // vtable+1488
    threshold = TII_override                        // NVPTX can take full control
elif optLevel > 2 (-O3):
    threshold = aggressive_threshold                // 4
else:
    threshold = default_threshold                   // 2

The default stub at sub_2FDC800 returns 2 * ((optLevel > 2) + 1), i.e., 2 at -O2 and 4 at -O3. If NVPTX's TargetInstrInfo overrides this (the pass explicitly checks whether the vtable slot points to sub_2FDC800), the override takes full control. This allows the NVPTX backend to set a different tail-dup aggressiveness based on SM generation or kernel properties.

Loop Rotation and Header Placement

Loop rotation (sub_351C710, called from buildLoopChains) determines whether the loop header is placed at the top or bottom of the loop chain. The goal is to place the exiting block at the bottom so the back-edge is a fall-through and the exit is a taken branch (or vice versa, whichever is more profitable).

Two rotation strategies exist:

Basic rotation (default): Place the exiting block last. Skip rotation if the header already has a viable fall-through from outside the loop, unless the exit edge frequency exceeds the fall-through frequency. This avoids introducing an unnecessary branch at loop entry.

Profile-guided rotation (precise-rotation-cost): Enumerate all possible rotations, compute fall-through cost for each (missed fall-through from loop entry, missed fall-throughs at exit points, missed back-edge fall-through), and select the rotation with minimum total cost. Controlled by two knobs:

  • precise-rotation-cost (default false): enable profile-guided rotation cost model
  • force-precise-rotation-cost (default false): force it even without good profile data

For GPU kernels where loops are the dominant compute pattern, correct loop rotation determines whether the loop body executes as a straight fall-through sequence or requires a taken back-edge branch every iteration. Since the misfetch-cost is low (default 1), the benefit is modest per iteration but accumulates over millions of iterations typical in GPU compute.

Hot/Cold Splitting

cicc does not perform function-level hot/cold splitting. This is expected: GPU kernels are designed for all threads in a warp to execute the same path. There is no equivalent of a CPU "cold" exception handler that should be placed far from hot code. The loop-to-cold-block-ratio knob (default 5) does enable outlining individual cold blocks from loop chains -- moving them to the end of the function -- but this is intra-function block reordering, not function splitting.

The knob force-loop-cold-block (default false) forces cold block outlining from loops regardless of the frequency ratio. When loop_freq / block_freq > loop-to-cold-block-ratio, the block is moved out of the loop chain to reduce the loop body's I-cache footprint.

Post-Placement Passes

After layout is committed, two post-processing steps run:

Branch optimization. Walk the final BB ordering. For each analyzable branch with profile info, check whether reversing the branch direction would improve fall-through. Call TII->reverseBranchCondition() (vtable+880) to flip the condition, then update the branch targets via vtable+360/368. This is controlled by sub_2EE6AD0 which checks profitability by comparing edge costs with sub_2E441D0 (getEdgeProbability).

Block alignment (sub_3516980). Walk each BB and set alignment based on block frequency, loop depth, and whether the block is a fall-through target. Controlled by:

  • align-all-blocks (default 0): force log2 alignment on every block
  • align-all-nofallthru-blocks (default 0): force alignment on blocks without fall-through predecessors
  • max-bytes-for-alignment (default 0): cap padding bytes

On GPU, block alignment is generally not useful -- PTX does not expose alignment constraints on basic blocks, and the hardware instruction fetch unit does not benefit from aligned block boundaries the way a CPU I-cache line does.

Configuration Knobs

All knobs are LLVM-standard with stock defaults. The NVIDIA delta is behavioral, not configurational.

KnobTypeDefaultEffect
disable-block-placementboolfalseDisable the pass entirely
enable-block-placement-statsboolfalseCollect placement statistics
tail-dup-placementbooltrueEnable tail duplication during placement
tail-dup-placement-thresholdint2Max instructions for tail-dup candidate
tail-dup-placement-aggressive-thresholdint4Aggressive threshold at -O3
tail-dup-placement-penaltyint2I-cache pressure penalty (percent)
tail-dup-profile-percent-thresholdint50Min hot-count percentage for profile-guided tail-dup
triangle-chain-countint2Consecutive triangles before triangle heuristic activates
branch-fold-placementbooltrueFold branches during placement
misfetch-costint1Taken-branch fetch penalty
jump-inst-costint1Cost of a jump instruction
block-placement-exit-block-biasint0Frequency percentage for loop exit replacement
loop-to-cold-block-ratioint5Ratio threshold for cold block outlining
force-loop-cold-blockboolfalseForce outlining cold blocks from loops
precise-rotation-costboolfalseProfile-guided loop rotation cost
force-precise-rotation-costboolfalseForce precise rotation cost
align-all-blocksint0Force block alignment (log2)
align-all-nofallthru-blocksint0Force alignment on non-fall-through blocks
max-bytes-for-alignmentint0Max padding for alignment
enable-ext-tsp-block-placementboolfalseEnable ext-TSP layout algorithm
ext-tsp-block-placement-max-blocksint-1Max BB count for ext-TSP (unlimited)
apply-ext-tsp-for-sizeboolfalseUse ext-TSP for code size optimization
renumber-blocks-before-viewboolfalseRenumber BBs before dot-graph output

DenseMap Implementation Pattern

The pass uses LLVM's DenseMap for BB-to-chain and BB-to-index lookups. The open-addressing hash-map pattern appears 20+ times in the decompiled code:

// Hash function for pointer keys
size_t hash = ((ptr >> 9) ^ (ptr >> 4)) & (bucket_count - 1);

// Probing: linear with increment counter
// Empty sentinel:   0xFFFFFFFFFFFFF000 (-4096)
// Deleted sentinel: 0xFFFFFFFFFFFFE000 (-8192)
// Rehash trigger:   4 * (count + 1) >= 3 * bucket_count  (75% load)
// Rehash function:  sub_2E3E470(map, new_capacity)

GPU-Specific Placement Considerations

Why This Pass Matters More on GPU Than on CPU

On CPU, MachineBlockPlacement is primarily an I-cache optimization -- placing hot blocks contiguously reduces cache misses. On GPU, the stakes are higher for three reasons:

  1. No branch prediction. GPU SMs do not speculate. Every taken branch is a guaranteed fetch stall. The ratio of taken branches to fall-throughs directly translates to warp scheduler utilization. Optimal block placement can eliminate 10-30% of fetch bubbles in branch-heavy kernels.

  2. Instruction cache is tiny and shared. A single SM partition has 32-128 KB of instruction cache shared across all active warps. Code duplication (tail-dup, loop unrolling) competes with warp occupancy for this shared resource. The tail-dup-placement-penalty (2%) is conservative -- on kernels with high warp counts, even small code size increases can cause I-cache thrashing.

  3. Reconvergence is layout-sensitive. On architectures before Ampere (sm_70, sm_75), the stack-based reconvergence mechanism depends on the post-dominator being reachable from both sides of a divergent branch. Block placement that separates a post-dominator from its divergent predecessors can increase the live warp state, consuming scarce convergence stack entries. The alternative layout evaluator's sub_34C0690 path specifically addresses this by evaluating reconvergence distance.

Structured Control Flow Constraint

Unlike CPU backends where block placement has complete freedom, the NVPTX backend runs StructurizeCFG before MachineBlockPlacement. This means:

  • All irreducible control flow has already been eliminated
  • Structured regions (loops, if-then-else diamonds) are contiguous in the CFG
  • Block placement cannot violate structured region boundaries without re-structurizing

This constraint actually simplifies placement in some cases (fewer valid orderings to consider) but eliminates certain profitable reorderings that would be legal on CPU (e.g., outlining a cold exception handler to a distant location that breaks region contiguity).

Interaction with PTX Emission

The final block ordering directly determines which branches in the PTX output are bra instructions (taken) vs. fall-throughs (implicit). The AsmPrinter (see AsmPrinter) emits bra only for non-fall-through edges. Since ptxas performs its own block scheduling on the PTX input, the cicc block ordering serves as a strong hint rather than a final answer -- but ptxas generally respects the input ordering for blocks within the same structured region.

Function Map

FunctionAddressSizeRole
runOnMachineFunctionsub_3521FF0--Entry point, 82 KB
buildChainssub_3521900--Initial chain construction
tailDupPlacementsub_35185B0--Tail-dup-aware chain merging
applyBlockOrdersub_3519A10--Commit final BB ordering to MF
alignBlockssub_3516980--Post-placement alignment
buildLoopChainssub_351EBB0--Loop-aware chain merging
buildChainForBlocksub_351D700--Greedy successor chain walk
selectBestSuccessorsub_35157A0--Pick best fall-through successor
chainLookupsub_3515040--DenseMap BB-to-chain lookup
rotateLoopsub_351C710--Loop rotation heuristic
mergeTailssub_351A710--Chain tail merge logic
lowerChainsub_35161F0--Final lowering of chain to BB list
(helper)sub_3515CB0--Chain cost model evaluation
(helper)sub_3515280--Chain building iteration
(helper)sub_3516000--Chain length query
(NVIDIA addition)sub_34BEDF0--Layout evaluator state constructor (321 bytes)
(NVIDIA addition)sub_34C7080--Layout evaluator dispatch wrapper (17 bytes, guards sub_34C6AF0)
(NVIDIA addition)sub_34C6AF0--Core layout evaluator body (1419 bytes, SM-aware)
(NVIDIA addition)sub_34C4890--Frequency-priority BB advancement
(NVIDIA addition)sub_34C2D70--Chain swap application
(NVIDIA addition)sub_34C56D0--Loop-level cost re-evaluation (5137 bytes, divergence-aware)
(NVIDIA addition)sub_34C0690--Lightweight loop evaluator (sm_70/sm_75 path)
(NVIDIA addition)sub_34BA1B0--Chain state snapshot
(NVIDIA addition)sub_34C0EE0--Block-level cost query
(NVIDIA addition)sub_34BE7F0--Chain adjacency analysis
(NVPTX)sub_350FE30--Pass registration
(NVPTX)sub_350FEE0--Stats pass registration
(generic)sub_1DE8060--Generic LLVM pass registration
(generic)sub_1DE8500--Generic LLVM stats registration
cleanupsub_3511770--Chain-map teardown
cleanupsub_35142F0--Loop chain data teardown
cleanupsub_3510940--Bump allocator teardown
calcExtTspScoresub_29BAF70--Ext-TSP score (original layout)
calcExtTspScoresub_29BAC40--Ext-TSP score (alternative layout)
computeExtTspLayoutsub_29BB2B0--Ext-TSP chain reordering solver
(helper)sub_2EE6520--Ext-TSP enable decision
(helper)sub_2EE6AD0--Branch redirect profitability check
getEdgeProbabilitysub_2E441D0--Edge probability query
(default stub)sub_2FDC800--Default getTailDupThreshold implementation
(default stub)sub_2FF52D0--Default reconvergence-region query
(default stub)sub_2FDC810--Default layout-accept threshold query

Differences from Upstream LLVM

AspectUpstream LLVMCICC v13.0
Pass instancesSingle MachineBlockPlacement per pipelineTwo instances: stock LLVM copy + NVPTX-pipeline copy at sub_3521FF0
Divergence awarenessNo divergence concept; layout optimizes for I-cache localityQueries warp divergence flag on MachineFunction; divergent branches affect tail duplication profitability
Alternative layout proposalAbsent; single layout path onlyAdditional proposal path (sub_34BEDF0 / sub_34C7080) evaluates alternative orderings with SM-aware cost
Tail duplication thresholdTailDupPlacementThreshold (default 2)GPU-specific threshold via vtable query (sub_2FDC810); controlled by reconvergence-region analysis
Loop cost evaluationFrequency-weighted chain costDivergence-aware loop cost re-evaluation (sub_34C56D0, 5137 bytes) considers warp reconvergence overhead
Ext-TSP scoringStandard profile-guided layout scoringSame Ext-TSP solver but gated by NVPTX-specific enable decision (sub_2EE6520)
Structured CFG constraintNo structured CFG requirement (targets like x86 have arbitrary CFG)Must preserve structured regions from StructurizeCFG; contiguous structured blocks cannot be interleaved

Cross-References

  • StructurizeCFG -- runs before block placement; produces the structured CFG that constrains which block orderings are legal. Structured regions must remain contiguous.
  • BranchFolding -- runs after placement; performs tail merging and branch folding on the committed layout. See sub_2F336B0.
  • Instruction Scheduling -- block ordering affects scheduling windows. Post-placement scheduling operates within the committed layout.
  • Register Allocation -- register pressure is affected by block ordering through live range extent.
  • AsmPrinter -- emits PTX from the final block ordering, generating bra instructions for taken branches and fall-through for sequential blocks.