Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

JumpThreading

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

LLVM version note: Based on LLVM 20.0.0 JumpThreading.cpp. Evidence: DFA JumpThreading variant (dfa-jump-threading) present as a separate pass matches LLVM 14+; early-exit-heuristic knob matches LLVM 16+. Core algorithm is unmodified; NVIDIA changes are configuration-level (adjusted thresholds, three pipeline positions, OCG disable flag).

CICC v13.0 ships LLVM's JumpThreadingPass at sub_2DC4260 (12,932 bytes, address range 0x2DC4260--0x2DC74E4). The pass duplicates basic blocks so that predecessors whose branch conditions can be statically resolved jump directly to the correct successor, eliminating a conditional branch from the critical path. On a GPU, this directly reduces warp divergence: a branch that was previously data-dependent becomes unconditional along each incoming edge, so the warp scheduler never needs to serialize the two paths.

The pass is fundamentally at odds with PTX's requirement for reducible control flow. Block duplication can create multi-entry loops (irreducible cycles) when the duplicated block is a loop header or when the threading target sits inside a loop whose header is not the threading source. CICC addresses this through three layered mechanisms -- loop header protection, conservative duplication thresholds, and a late-pipeline StructurizeCFG safety net -- that collectively keep the CFG reducible without sacrificing the pass's optimization value.

PropertyValue
Pass name (pipeline parser)"jump-threading"
Pass classllvm::JumpThreadingPass
Entry functionsub_2DC4260
Binary size12,932 bytes
Stack frame0x748 (1,864) bytes
Block duplication helpersub_2DC22F0 (2,797 bytes)
CFG finalizationsub_2DC30A0 (1,094 bytes)
Single-instruction threadingsub_2DC37C0 (2,288 bytes)
Select unfoldingsub_2DC40B0 (420 bytes)
Pipeline positionsThree invocations: ~position 234, ~278, and a late tier-3 position (~239)
NVVMPassOptions disable offset+320
Upstream LLVM sourcelib/Transforms/Scalar/JumpThreading.cpp

Why JumpThreading Matters on GPU

Consider a CUDA kernel containing:

if (threadIdx.x < threshold)
    val = computeA();
else
    val = computeB();

if (val > 0)
    result = pathX(val);
else
    result = pathY(val);

The second branch depends on val, which is a PHI of computeA() and computeB(). If JumpThreading can determine that computeA() always returns a positive value, it duplicates the second if block and wires the computeA predecessor directly to pathX. Threads that took the first branch path never execute the second conditional at all.

On a CPU this saves a branch misprediction. On a GPU the payoff is larger: eliminating the second branch prevents a second point of warp divergence. If both branches would diverge on different thread subsets, removing one cuts the total serialization overhead in half. The threads that took computeA proceed straight to pathX without waiting for the computeB threads to rejoin.

Knob Inventory

Six cl::opt globals control the pass, registered in ctor_456 at 0x544220:

KnobDefaultGlobalDescription
jump-threading-threshold6qword_4FFDBA0Max instructions in a block eligible for duplication
jump-threading-implication-search-threshold3qword_4FFDAC0Max predecessors to search for condition implications
jump-threading-phi-threshold76 (0x4C)qword_4FFD9E0Max PHI nodes in a block eligible for duplication
jump-threading-across-loop-headersfalseqword_4FFD900Allow threading across loop headers (testing only)
jump-threading-disable-select-unfoldingfalseqword_4FFDC80Disable unfolding select instructions into branches
print-lvi-after-jump-threadingfalse--Debug: print LazyValueInfo cache after pass completes

The block-size threshold of 6 matches upstream LLVM. The PHI threshold of 76 is significantly higher than upstream's default (which is typically lower), reflecting GPU kernels' tendency toward wider PHI nodes due to predication and convergence patterns. The implication search depth of 3 is conservative, limiting compile-time cost from predecessor chain analysis in the typically shorter basic-block chains of GPU code.

Two Disable Flags

CICC registers two independent cl::opt flags that suppress jump threading behavior. They live in different subsystems and control different things:

FlagRegistrationSubsystemEffect
"disable-JumpThreadingPass"ctor_637 @ 0x5934A7JumpThreading pass itselfDisables the standalone JumpThreadingPass invocations in the pipeline
"disable-jump-threading"ctor_073 @ 0x49A91E (also ctor_243 @ 0x4ED0C0)SimplifyCFGDisables jump threading logic within SimplifyCFG -- the per-block branch-through-PHI threading that SimplifyCFG performs as part of its CFG simplification

The "disable-jump-threading" flag carries the annotation "Disable jump threading for OCG experiments", where OCG is NVIDIA's Optimizing Code Generation research infrastructure. This is a SimplifyCFG option, not a JumpThreadingPass option -- SimplifyCFG has its own internal implementation of branch threading through PHI nodes that is separate from the standalone pass. NVIDIA engineers can disable either or both independently.

The "fold-with-var-cond" flag is registered alongside "disable-jump-threading" in the same SimplifyCFG constructor group, controlling a related NVIDIA-specific extension for folding branches with variance conditions.

Interaction with StructurizeCFG

The fundamental tension: JumpThreading duplicates blocks to bypass conditionals, which can transform a reducible loop into an irreducible cycle. PTX requires all loops to be natural (single-entry, reducible). An irreducible CFG causes StructurizeCFG to emit "UnsupportedIrreducibleCFG" and bail out, leaving the function in a state that ptxas will likely reject.

CICC addresses this through three layered mechanisms:

1. Loop Header Protection via LoopInfo

The jump-threading-across-loop-headers flag defaults to false. Before threading any block, the pass queries LoopInfo through a red-black tree lookup at 0x2DC4781 using dword_501D5A8 as the analysis key. If the target block is a loop header (the LoopInfo query returns a non-null loop containing the block as its header), the pass skips it entirely.

A parallel DominatorTree query at 0x2DC4839 (using dword_501D4C8) verifies loop membership and nesting depth. If the block is found within a loop, a threshold override is loaded from qword_501D628, replacing the standard duplication threshold with a loop-specific one. A second override from qword_501D548 applies to blocks found via the DominatorTree-based lookup.

This double check -- LoopInfo for header identification, DominatorTree for membership -- prevents the most common source of irreducibility: duplicating a loop header creates a second entry into the loop body.

2. Conservative Duplication Thresholds

The three thresholds (6 instructions, 3 predecessors, 76 PHIs) restrict duplication to small, simple blocks where the CFG outcome is highly predictable and the duplication cost is bounded. A block must satisfy all three limits simultaneously. These thresholds interact multiplicatively: even a 6-instruction block with 4 predecessors would exceed the implication search depth and be rejected, while a 5-instruction block with 100 PHIs would exceed the PHI threshold.

3. StructurizeCFG Safety Net

StructurizeCFG (sub_35CC920) runs late in the pipeline, after all IR-level scalar and loop transforms. Its irreducibility detector (sub_35CA2C0) checks every back-edge: if the target does not dominate the source, the loop has multiple entries and is irreducible. If JumpThreading or any other pass creates an irreducible cycle that slipped past the loop header protection, StructurizeCFG will catch it.

This is defense-in-depth: the threading constraints prevent most irreducible cases, and structurization catches the rest. The design deliberately tolerates a small number of "false acceptances" at the JumpThreading level because the cost of occasionally running StructurizeCFG's rejection path is far lower than the cost of being too conservative and missing profitable threading opportunities.

Cost Model

The pass enforces a multi-level cost model that bounds total code growth per function.

Global Budget

At 0x2DC4887, the pass initializes a global instruction budget:

mov ebx, 200h    ; 512 instructions total budget

Each block duplication charges the duplicated block's instruction count against this budget. The budget is tracked in var_460 and checked before each duplication. Once exhausted, no further threading occurs in that invocation regardless of how profitable individual candidates might be.

Per-Predecessor Cost Division

When threading involves multiple predecessors, the per-predecessor cost is the block instruction count divided by the number of predecessors being threaded, with ceiling rounding:

cost_per_pred = block_instr_count / num_predecessors
; ceiling via: sbb eax, -1 (adds 1 if remainder was nonzero)

This division at 0x2DC4D78--0x2DC4D8E means a 6-instruction block being threaded for 3 predecessors costs only 2 instructions per predecessor against the global budget. The logic recognizes that multi-predecessor threading amortizes the code growth across more eliminated branches.

Special Cases

  • Single-instruction blocks (checked at 0x2DC4D94): Always eligible, regardless of budget. A block containing only a terminator instruction costs nothing to duplicate.
  • Empty blocks (checked at 0x2DC4D70): Skipped entirely.
  • Blocks with <=1 effective instructions (0x2DC4BF1): The comparison cmp edx, 1; jbe gates a fast path where the pass bypasses the full cost analysis.

LazyValueInfo Integration

The pass accepts a LazyValueInfo pointer as its third parameter (rdx). When non-null (checked at 0x2DC42BD), LVI provides range-based condition evaluation that enables threading even when the branch condition is not a simple constant comparison.

LVI State

The LVI cache occupies approximately 600 bytes (0x258) of local state:

FieldOffsetPurpose
Cache structurevar_2F0 through var_98LVI range cache local state
Valid flagvar_C0Set to 1 when LVI is initialized
Cached rangesvar_B0SmallVector-like structure
Initial capacityvar_A88 entries

Range-Based Threading

For ICMP_NE conditions (opcode 0xBA = 186), the pass calls sub_11F3070 (LVI::getPredicateAt) with the ICmp operand and a comparison predicate of 2, followed by sub_DFABC0 (evaluateConditionOnEdge) to resolve the branch direction along a specific incoming edge.

For alternate opcode paths (opcode 0x165 = 357), the pass uses sub_988330 (getConstantOnEdge) instead, which returns a concrete constant value if LVI can prove the condition evaluates to a known value along that edge.

The virtual dispatch at 0x2DC67D6 (call qword ptr [rax+78h]) invokes LVI::getPredicateOnEdge. If the vtable matches sub_920130 (the default implementation), a fallback path calls sub_AC4810 (isImpliedCondition) with predicate 0x27 (39), and if that also fails, sub_AA93C0 (SimplifyICmpInst).

Cleanup

On exit, if LVI was used, three cleanup calls occur:

  • sub_FFCE90 -- LVI::eraseBlock (invalidation)
  • sub_FFD870 -- LVI::clear
  • sub_FFBC40 -- LVI::releaseMemory

Main Algorithm

Outer Loop

The pass iterates over the function's basic block list via a linked-list traversal (BB->next chain at [BB+8]):

run(result_ptr, function, lvi_ptr, tli, ...):
    if lvi_ptr != null:
        initialize_lvi_cache(lvi_ptr)

    budget = 512
    changed = false

    loop:
        current_bb = function.entry_block    // sub_B2BEC0
        end = function + 0x48               // end sentinel

        while current_bb != end:
            if try_thread_block(current_bb, budget):
                changed = true
            current_bb = current_bb.next     // [current_bb + 8]

        if changed:
            changed = false
            goto loop    // restart: threading may expose new opportunities

    cleanup_lvi()
    return results

The restart-on-change behavior means threading is iterative: eliminating one branch can expose a new statically-determinable branch downstream.

Per-Block Classification

For each basic block, the pass examines the terminator instruction:

  1. Opcode check (0x2DC443E): The instruction opcode byte is compared against 0x55 (85), which is LLVM's BranchInst opcode. Only conditional branches are considered.

  2. Metadata check (0x2DC4449--0x2DC446E): Two calls to sub_A73ED0 check for metadata kinds 0x17 (23, "prof" branch weights) and 0x04 (debug). Then sub_B49560 (hasMetadataOtherThanDebugLoc) is called on the branch instruction.

  3. Condition extraction (0x2DC45F8--0x2DC4636): sub_981210 (getBranchCondition) returns a success flag and a condition code. Two condition codes are handled:

    • 0x165 (357): likely CmpInst::ICMP_EQ or a switch opcode
    • 0x0BA (186): likely CmpInst::ICMP_NE

    Other condition codes cause the block to be skipped.

  4. Operand analysis (0x2DC465F--0x2DC467C): The operand count is extracted (AND with 0x7FFFFFF mask -- the use-count field in LLVM's Value layout). If the branch condition is an ICmp with a constant operand (type byte 0x11 = 17 = ConstantInt), threading is potentially profitable.

Condition-Specific Threading Paths

The pass contains four specialized threading strategies:

Constant-value threading (0x2DC66B7): When a predecessor can determine the branch outcome via a constant PHI incoming value, the simplest path. Creates a direct unconditional branch.

Single-instruction threading (sub_2DC37C0, 2,288 bytes): For blocks containing exactly one instruction (the terminator), called at 0x2DC6704. Creates a direct branch bypass.

Switch threading (0x2DC6A76--0x2DC6B0C): When the terminator is a SwitchInst (opcode byte 0x37 = 55), calls sub_2DC40B0 (tryToUnfoldSelect). This checks for SelectInst (opcode 0x52 = 82) and unfolds the select into explicit branches that can be individually threaded.

Implication-based threading (0x2DC6E71--0x2DC6EB3): For ICmpInst variants (opcode 0x28 = 40), the pass checks whether the predicate implies the branch condition via sub_B532B0, creates the threaded edge via sub_B52EF0, and wires the new block via sub_92B530.

All-Ones Constant Detection

Four sites (0x2DC71B0, 0x2DC71CA, 0x2DC7380, 0x2DC74DA) check for all-ones constants as PHI incoming values:

or rax, -1          ; create all-ones mask
shr rax, cl         ; cl = 64 - bitwidth, shift to match width
cmp [rdx+18h], rax  ; compare against actual constant value
setz al             ; true if constant is all-ones

For an i1 type, all-ones means true. This handles the common pattern where a PHI incoming value from one predecessor is the constant true (all bits set), allowing the pass to resolve the branch direction for that predecessor.

PHI Operand Iteration

Two nearly identical loops at 0x2DC7206--0x2DC726E and 0x2DC7456--0x2DC74CD iterate PHI operands to determine if all incoming values from relevant predecessors resolve to the same constant:

for pred_idx in range(phi.num_operands):    // var_668
    incoming = phi.getIncomingValueForBlock(pred)  // sub_AD69F0
    type_tag = incoming.type_byte

    if type_tag == 0x0D:     // ConstantInt::getTrue()
        continue
    if type_tag == 0x11:     // ConstantInt with bitwidth check
        if bitwidth <= 64:
            if value == all_ones_for_width:
                continue     // resolves to true
        else:
            skip             // wide integers, bail out

    // If any incoming value is non-constant, threading is unprofitable
    bail_out()

If every relevant predecessor provides the same constant value, the branch direction is fully determined and threading proceeds.

Created Block Names

When threading occurs, the pass creates new basic blocks with diagnostic names:

NameString addressPurpose
"endblock"0x42E9094Terminal block of the threaded path; created via sub_F36990 (SplitBlockAndInsertIfThen)
"phi.res"0x42E90C0PHI resolution node for merged values; created via sub_D5C860 (PHINode::Create)
"res_block"0x42E909DResult block for the threaded path; allocated as 0x50-byte BasicBlock via sub_22077B0
"loadbb"0x42E90B9Load basic block for load-bearing threading; created in a loop at 0x2DC4F05--0x2DC4FFB
"phi.src1"0x42E90A7First PHI source block
"phi.src2"0x42E90B0Second PHI source block

The "loadbb" blocks are created in a dynamic loop for multi-way threading, where each iteration allocates a 0x50-byte (sizeof(BasicBlock)) object and wires it into the CFG via sub_AA4D50 (BasicBlock::insertInto).

Block Duplication Engine: sub_2DC22F0

The 2,797-byte helper performs actual block cloning. Parameters:

RegisterRole
rdiDuplication context structure (at var_490)
rsiSource block's value table
rdxDestination hash table
rcxPHI operand map
r8dInstruction count for the source block

The cloning process:

  1. Clone each instruction from the source block
  2. Insert cloned instructions into use-def chains (0x2DC59A1--0x2DC59E7: linked-list surgery on LLVM's Value use-list)
  3. Update PHI operands to reference the new predecessor (0x2DC5E1E onward)
  4. Update branch targets in the predecessor blocks

CFG Finalization: sub_2DC30A0

The 1,094-byte helper, called at 0x2DC5015 and 0x2DC6408 after threading completes for a block, performs:

  • Successor edge updates
  • Dead block elimination for blocks made unreachable by the threading
  • DominatorTree updates if available (via sub_FFB3D0, DominatorTree::changeImmediateDominator)

Pipeline Positions

JumpThreading appears three times in the CICC pipeline, at different stages with different surrounding context:

PositionPipeline contextParameterPurpose
~234After ADCE, within the main function simplification loopsub_198DF00(-1)First opportunity: thread branches exposed by dead code elimination
~278After NVVMPeephole2 and optionally GVN, in the NVIDIA-specific tier-2 sequencesub_198DF00(-1)Second opportunity: thread branches exposed by value numbering and peephole
Late tier-3Within the ADCE/MemCpyOpt/DSE sequencesub_198DF00(t)Final opportunity: catch any remaining threadable branches before StructurizeCFG

The sub_198DF00 function is the combined CorrelatedValuePropagation/JumpThreading registration wrapper. The -1 parameter likely selects the default mode; the t parameter in the third position may be an optimization-level-dependent configuration.

All three positions are conditional on NVVMPassOptions offset +320 not being set to disable. Each invocation resets the 512-instruction global budget, so the total code growth across all three invocations can reach up to 1,536 instructions per function.

DFA JumpThreading

A separate DFA-based JumpThreading variant exists at sub_276AF50, registered as "dfa-jump-threading" (llvm::DFAJumpThreadingPass). This pass is controlled by:

KnobRegistrationDescription
enable-dfa-jump-threadctor_445 @ 0x53F5C0Enable/disable the DFA variant
dfa-jump-view-cfg-beforector_445Debug: dump CFG before DFA threading
dfa-early-exit-heuristicctor_445Early-exit heuristic for compile time

DFA JumpThreading handles state-machine patterns (switch statements in loops with predictable transitions between cases) that the standard JumpThreading cannot resolve. It is a separate pass with its own pipeline registration and does not share the budget or thresholds of the standard JumpThreading pass.

Before/After IR Example

Consider a kernel with a two-branch diamond:

Before JumpThreading:

entry:
  %cond1 = icmp sgt i32 %x, 0
  br i1 %cond1, label %positive, label %negative

positive:
  %a = call i32 @computeA()
  br label %merge

negative:
  %b = call i32 @computeB()
  br label %merge

merge:
  %val = phi i32 [ %a, %positive ], [ %b, %negative ]
  %cond2 = icmp eq i32 %val, 42
  br i1 %cond2, label %match, label %nomatch

match:
  ...
nomatch:
  ...

If LVI can prove that computeA() always returns 42 (e.g., it is a known constant), JumpThreading duplicates the merge block for the %positive predecessor:

After JumpThreading:

entry:
  %cond1 = icmp sgt i32 %x, 0
  br i1 %cond1, label %positive, label %negative

positive:
  %a = call i32 @computeA()
  br label %match              ; threaded: skip %merge entirely

negative:
  %b = call i32 @computeB()
  br label %merge

merge:                          ; now has only one predecessor
  %val = phi i32 [ %b, %negative ]
  %cond2 = icmp eq i32 %val, 42
  br i1 %cond2, label %match, label %nomatch

match:
  ...
nomatch:
  ...

The %positive path no longer passes through merge. The second branch is eliminated for threads that took the first path.

Differences from Upstream LLVM

AspectCICC v13.0Upstream LLVM 20
PHI threshold default76Lower (typically ~32 or similar)
disable-jump-threading in SimplifyCFGPresent, annotated for OCG experimentsPresent (standard LLVM flag)
Annotation"Disable jump threading for OCG experiments"No OCG reference
Pipeline invocationsThree positions, combined with CVP via sub_198DF00Typically two (early and late in the function simplification pipeline)
NVVMPassOptions disableOffset +320N/A
Loop header override thresholdsqword_501D628, qword_501D548Standard LoopInfo check only
fold-with-var-condNVIDIA-specific SimplifyCFG companion flagNot present

The core algorithm is unmodified from upstream. NVIDIA's changes are configuration-level: adjusted thresholds, additional pipeline positions, the OCG disable flag, and integration with the NVVMPassOptions system.

Function Map

FunctionAddressSizeRole
JumpThreadingPass::run (main pass body)sub_2DC426012,932 bytes--
Block cloning engine (duplicateBlock)sub_2DC22F02,797 bytes--
CFG finalization after threadingsub_2DC30A01,094 bytes--
Single-instruction threadingsub_2DC37C02,288 bytes--
tryToUnfoldSelectsub_2DC40B0420 bytes--
SmallVector append/copy for instruction mapsub_2DC1F40349 bytes--
LVI::getPredicateAtsub_11F3070----
evaluateConditionOnEdgesub_DFABC0----
getConstantOnEdgesub_988330----
isImpliedConditionsub_AC4810----
SimplifyICmpInstsub_AA93C0----
getBranchConditionsub_981210----
BranchInst::getConditionsub_B43CB0----
BranchInst::Create (conditional)sub_B4C9A0----
BranchInst::Create (unconditional)sub_B4C8F0----
PHINode::addIncomingsub_B99FD0----
PHINode::Createsub_D5C860----
SplitBlockAndInsertIfThensub_F36990----
BasicBlock::getContextsub_BD5C60----
operator new(0x50) (allocate BasicBlock)sub_22077B0----
BasicBlock::insertIntosub_AA4D50----
Value::replaceAllUsesWithsub_BD84D0----
Instruction::eraseFromParentsub_B43D60----
DominatorTree::changeImmediateDominatorsub_FFB3D0----
PHINode::getIncomingValueForBlocksub_AD69F0----
LoopInfo pass lookupsub_C959E0----
Predicate implies branch checksub_B532B0----
ConstantExpr::getICmp or create threaded edgesub_B52EF0----
CloneBasicBlock or wire new blocksub_92B530----
CloneBasicBlock (alternate path)sub_929DE0----

Cross-References

  • StructurizeCFG -- the late-pipeline safety net that catches irreducible CFG created by threading or other passes
  • Scalar Passes Hub -- hub page linking SROA, EarlyCSE, and JumpThreading with GPU-context summaries
  • GVN -- runs between JumpThreading invocations in the tier-2 sequence; can expose new threadable branches
  • Pipeline & Ordering -- tier-dependent scheduling of all three invocations
  • Knobs -- master knob inventory including all six JumpThreading knobs