Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Branch Distribution (Dead Synchronization Elimination)

Despite its name, the branch-dist pass does not distribute or restructure branches. It is a GPU-specific dead synchronization elimination pass that removes __syncthreads() barriers and fence intrinsics when no actual memory hazard exists across the barrier boundary. In CUDA kernels, programmers often insert barriers conservatively to guarantee correctness, but many of these barriers protect code regions that have no conflicting read/write patterns on shared or global memory. Removing them eliminates warp serialization points and reduces the latency cost of unnecessary thread coordination.

The pass works by classifying every instruction in the function as a shared/global memory read, a write, or neither. It then propagates this information through the control flow graph using a standard dataflow fixed-point iteration. For each synchronization instruction, it examines the memory access patterns above and below the barrier; if no read-after-write, write-after-read, or write-after-write hazard exists, the barrier is dead and is deleted. Because removing one barrier may expose others as redundant, the entire analysis restarts after each deletion until no more dead barriers remain.

Pipeline Position

FieldValue
Pass namebranch-dist
Pass typeFunctionPass (NVIDIA-custom, not in upstream LLVM)
RegistrationNew PM #377, line 2217 in sub_2342890
Runtime positionsTier 1/2/3 #78, #82 (NVVMBranchDist via sub_1CB73C0, gated by !opts[2080] && !opts[2120]); see Pipeline
Core functionsub_1C47810 (2357 lines)
Pass wrappersub_1C49D10 (179 lines)
Knob constructorctor_525_0 at 0x563730 (493 lines)
Global enable flagbyte_4FBB6C0 (initialized to 0 in ctor_261)

The pass runs during the NVIDIA IR optimization pipeline. The global enable flag at byte_4FBB6C0 is set by the pipeline setup when appropriate for the current optimization level.

IR Before/After Example

The pass removes __syncthreads() barriers that protect no actual shared/global memory hazard.

Before (conservative barrier placement):

define void @kernel(ptr addrspace(3) %smem) {
entry:
  %x = add i32 %tid, 1               ; pure register computation
  %y = mul i32 %x, 42                ; pure register computation
  call void @llvm.nvvm.barrier0()     ; __syncthreads() -- no shared/global R/W above
  %z = add i32 %y, %x                ; pure register computation
  ret void
}

After (dead barrier removed):

define void @kernel(ptr addrspace(3) %smem) {
entry:
  %x = add i32 %tid, 1
  %y = mul i32 %x, 42
  ; barrier removed: no shared/global reads or writes above or below
  %z = add i32 %y, %x
  ret void
}

When the dataflow analysis determines that neither side of the barrier accesses shared or global memory, the barrier is dead and removed. The pass restarts after each removal since deleting one barrier may expose another as redundant.

Algorithm

Phase 1: Instruction Classification (sub_1C46330)

The classifier (sub_1C45690, 117 lines) examines each instruction's opcode byte at offset +16 and determines whether it reads or writes shared/global memory:

OpcodeHexMeaningAction
0x36'6'LoadCheck address space; mark as read if shared/global
0x37'7'StoreCheck address space; mark as write
0x3A':'Memory opCheck address space
0x3B';'Memory opCheck address space
0x4E'N' (78)CallComplex analysis: filter sync intrinsics, check callee attributes

The classifier is invoked twice per basic block:

  • Forward scan (a3=1): iterates from the last instruction backward to the first sync instruction. Everything after the sync is classified as "above" the barrier.
  • Backward scan (a3=0): iterates from the first instruction forward to the first sync instruction. Everything before the sync is classified as "below" the barrier.

This produces four boolean flags per block, stored in red-black tree maps: reads_above, writes_above, reads_below, writes_below.

Phase 2: CFG Propagation (sub_1C46620)

A classic dataflow fixed-point iteration propagates memory access information through successor edges. For each basic block, the read/write flags from its successors' "below" maps are OR-combined into the current block's "above" maps. The iteration repeats until no flags change (convergence). This ensures that a barrier's necessity accounts for memory accesses reachable through any control flow path, not just the local block.

The branch-dist-norm knob modifies the dataflow meet operator: the default (0) uses OR-propagation (conservative), while a non-zero value likely switches to AND-normalization (more aggressive, requiring all paths to access memory before considering a sync necessary).

Phase 3: Dead Sync Identification and Removal

After propagation, the main function (sub_1C47810) iterates over all blocks and instructions. For each synchronization intrinsic, it looks up the four per-instruction flags:

ra = inst_read_above[I]    wa = inst_write_above[I]
rb = inst_read_below[I]    wb = inst_write_below[I]

A sync is dead (removable) when any of these conditions holds:

ConditionMeaning
!ra && !waNothing above the barrier accesses shared/global memory
!rb && !wbNothing below the barrier accesses shared/global memory
!ra && !wbNo read-after-write or write-after-write hazard
!wa && !rbNo write-after-read or write-after-write hazard

When a sync is removed, the pass calls sub_15F20C0 to delete it from the IR, then restarts the entire algorithm (goto LABEL_2). This restart is necessary because removing one barrier may cause another to become dead.

Special Cases

Barrier variants that carry data -- __syncthreads_count, __syncthreads_and, __syncthreads_or (intrinsic IDs 3734--3736) -- are explicitly excluded from removal. Their return values encode lane participation information, so they cannot be elided even when no memory hazard exists.

Address Space Filtering

The pass only considers memory accesses to shared and global address spaces as relevant for synchronization. The address space check in sub_1C45690:

  • Address space IDs <= 0x1FF (511) or in the 0x300 range: considered local/private -- do not require synchronization.
  • Address space IDs > 511 and not in the 0x3xx range: considered shared/global -- these are the accesses that justify keeping a barrier.

This distinction is critical: local memory is per-thread and never visible to other threads in the warp, so barriers protecting only local accesses are always dead.

Intrinsic Classification

Two predicates classify synchronization-related intrinsics:

sub_1C301F0 (is-sync-intrinsic): Returns true for intrinsic IDs representing barrier operations:

IDLikely Mapping
34llvm.nvvm.barrier0 (basic __syncthreads)
3718--3720barrier.sync / bar.warp.sync variants
3731--3736__syncthreads_count/and/or, bar.arrive

sub_1C30240 (is-fence-intrinsic): Returns true for IDs 4046 and 4242, which are memory fence/membar intrinsics. These are excluded from the sync test -- they impose memory ordering but are not full barriers that can be elided by this pass.

Configuration Knobs

All registered in ctor_525_0 at 0x563730. All are cl::opt<> with hidden visibility.

KnobTypeDefaultDescription
dump-branch-distboolfalseEmit diagnostic output on each removed sync
ignore-call-safetybooltrueTreat function calls as non-memory-accessing (aggressive)
ignore-variance-condint0Ignore warp divergence on branch conditions
ignore-address-space-checkint0Treat all memory accesses as requiring sync (conservative)
ignore-phi-overheadint0Ignore PHI node overhead from sync removal in cost model
disable-complex-branch-distint0Disable inter-block CFG propagation (Phase 2)
no-branch-diststring(empty)Comma-separated list of function names to skip
branch-dist-func-limitint-1Max functions to process (-1 = unlimited)
branch-dist-block-limitint-1Max blocks per function (-1 = unlimited)
branch-dist-normint0Dataflow meet operator mode (0 = OR, non-zero = AND)

The default for ignore-call-safety is notably true (aggressive): device function calls are assumed not to access shared/global memory unless proven otherwise. This is reasonable for typical CUDA kernels where helper functions operate on registers and local memory.

Diagnostic Strings

Diagnostic strings recovered from p2b.3-01-branchdist.txt. All runtime diagnostics are gated by the dump-branch-dist knob (default false).

StringSourceCategoryTrigger
"[filename:line] Removed dead synch: Read above: X, Write above: Y, Read below: Z, Write below: W in function NAME"sub_1C47810 phase 3Debugdump-branch-dist enabled and a barrier is removed; prints the four read/write flags and the function name
"Dump information from Branch Distribution"ctor_525_0 at 0x563730Knobdump-branch-dist knob description
"Ignore calls safety in branch Distribution"ctor_525_0Knobignore-call-safety knob description
"Ignore variance condition in branch Distribution"ctor_525_0Knobignore-variance-cond knob description
"Ignore address-space checks in branch Distribution"ctor_525_0Knobignore-address-space-check knob description
"Ignore the overhead due to phis"ctor_525_0Knobignore-phi-overhead knob description
"Disable more complex branch Distribution"ctor_525_0Knobdisable-complex-branch-dist knob description
"Do not do Branch Distribution on some functions"ctor_525_0Knobno-branch-dist knob description (value format: "function1,function2,...")
"Control number of functions to apply"ctor_525_0Knobbranch-dist-func-limit knob description
"Control number of blocks to apply"ctor_525_0Knobbranch-dist-block-limit knob description
"Control normalization for branch dist"ctor_525_0Knobbranch-dist-norm knob description

Data Structures

The pass allocates a large state object (~696 bytes, 87 QWORDs) containing 13 red-black tree maps organized in three tiers:

MapsKeysValuesPurpose
a1[3..14] (2 maps)Block pointerboolHas-sync-above/below per block
a1[15..38] (4 maps)Block pointerboolPropagated read/write above/below (Phase 2 output)
a1[39..62] (4 maps)Block pointerboolInitial read/write above/below (Phase 1 output)
a1[63..86] (4 maps)Instruction pointerboolPer-instruction read/write above/below (Phase 3)

All maps are std::map-like red-black trees with 48-byte nodes (left/right/parent pointers + key + 1-byte boolean value at offset 40). Tree operations are implemented in sub_1C46280 (insert-or-find for block maps), sub_1C47760 (insert-or-find for instruction maps), sub_1C45B10 (erase), and sub_1C45C70/sub_1C45940 (recursive destructors).

Function Map

FunctionAddressSizeRole
--0x1C478102357LCore algorithm: classify + propagate + remove
--0x1C49D10179LPass wrapper: init state, call core, cleanup
--0x1C46330197LPhase 1: forward/backward instruction scan
--0x1C466201157LPhase 2: CFG successor propagation (fixed-point)
--0x1C45690117LInstruction classifier: determines R/W flags
--0x1C458C028LHelper: classify all instructions in a block
--0x1C4628038LMap insert-or-find (block-level maps)
--0x1C4776037LMap insert-or-find (instruction-level maps)
--0x1C475C043LMap lower_bound lookup
--0x1C4766050LMap find with hint
--0x1C45B10113LMap erase operation
--0x1C45C70133LTree destructor (recursive free)
--0x1C45940133LTree destructor (recursive free, alt type)
--0x1C301F015LIs-sync-intrinsic predicate
--0x1C3024013LIs-fence-intrinsic predicate
--0x563730493LCLI knob registration (ctor_525_0)

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent dead barrier elimination pass using CFG dataflow.

1. Using address-level tracking instead of boolean per-category flags. The pass tracks four boolean flags per block (reads_above, writes_above, reads_below, writes_below) for shared/global memory, not specific addresses. A reimplementation that attempts to track precise addresses ("smem[0] is only written above, smem[1] is only read below") will appear to find more dead barriers but is fundamentally unsound for GPU execution. Different threads access different addresses through the same pointer expression (smem[tid] vs smem[tid-1]), making address-based alias analysis across threads impossible at compile time. The boolean-per-category approach is the correct conservative abstraction.

2. Not excluding __syncthreads_count/and/or (IDs 3734--3736) from removal. These barrier variants return a value that encodes lane participation information (__syncthreads_count returns the number of threads that passed a non-zero predicate). Even when no memory hazard exists across the barrier, the return value carries data that the program depends on. A reimplementation that removes these barriers based solely on memory analysis will break programs that use the return value for algorithmic purposes (e.g., warp-level voting patterns, early-exit counting).

3. Treating the ignore-call-safety default as conservative. The default for ignore-call-safety is true (aggressive): function calls are assumed not to access shared/global memory. This is correct for typical CUDA helper functions that operate on registers and local memory, but a reimplementation that uses false as the default will retain nearly all barriers in code that calls device functions, defeating the optimization. Conversely, a reimplementation that uses true but does not also check the callee's isSharedMemoryAccess attribute when available will miss cases where a called function does access shared memory through a pointer argument.

4. Not restarting the analysis after removing a barrier. The pass restarts from Phase 1 (goto LABEL_2) after each barrier deletion because removing one barrier merges the regions it separated, potentially exposing adjacent barriers as dead. A reimplementation that collects all dead barriers in one pass and removes them simultaneously will miss cascading redundancies. Worse, it may remove barriers in the wrong order: if barrier B2 is dead only because barrier B1 separates it from a hazard, removing both simultaneously removes B1's protection while the hazard still exists.

5. Conflating address space filtering with memory visibility. The pass considers only shared and global memory accesses (address spaces > 511 and not in the 0x3xx range) as relevant for barrier justification. Local/private memory (per-thread, invisible to other threads) is correctly excluded. A reimplementation that includes local memory accesses in the analysis will never remove any barrier in code that uses local arrays, since every function with local variables would show "read+write above and below." The address space filter is essential for the optimization to have any effect.

GPU-Specific Motivation

On NVIDIA GPUs, __syncthreads() forces all threads in a thread block to reach the barrier before any can proceed. This is one of the most expensive control flow operations in CUDA -- it serializes warp execution and creates a pipeline stall. In practice, CUDA programmers insert barriers conservatively (every shared memory access pattern gets a barrier "just in case"), leading to significant over-synchronization. This pass recovers the performance lost to unnecessary barriers by proving, through static dataflow analysis, that specific barriers protect no actual memory hazard.

The ignore-variance-cond knob connects to warp divergence analysis: when a branch condition is provably uniform (all lanes take the same path), synchronization across that branch is trivially unnecessary regardless of memory access patterns. This is a common case in well-structured CUDA code where control flow depends on blockIdx or compile-time constants.