Branch Distribution (Dead Synchronization Elimination)
Despite its name, the branch-dist pass does not distribute or restructure branches. It is a GPU-specific dead synchronization elimination pass that removes __syncthreads() barriers and fence intrinsics when no actual memory hazard exists across the barrier boundary. In CUDA kernels, programmers often insert barriers conservatively to guarantee correctness, but many of these barriers protect code regions that have no conflicting read/write patterns on shared or global memory. Removing them eliminates warp serialization points and reduces the latency cost of unnecessary thread coordination.
The pass works by classifying every instruction in the function as a shared/global memory read, a write, or neither. It then propagates this information through the control flow graph using a standard dataflow fixed-point iteration. For each synchronization instruction, it examines the memory access patterns above and below the barrier; if no read-after-write, write-after-read, or write-after-write hazard exists, the barrier is dead and is deleted. Because removing one barrier may expose others as redundant, the entire analysis restarts after each deletion until no more dead barriers remain.
Pipeline Position
| Field | Value |
|---|---|
| Pass name | branch-dist |
| Pass type | FunctionPass (NVIDIA-custom, not in upstream LLVM) |
| Registration | New PM #377, line 2217 in sub_2342890 |
| Runtime positions | Tier 1/2/3 #78, #82 (NVVMBranchDist via sub_1CB73C0, gated by !opts[2080] && !opts[2120]); see Pipeline |
| Core function | sub_1C47810 (2357 lines) |
| Pass wrapper | sub_1C49D10 (179 lines) |
| Knob constructor | ctor_525_0 at 0x563730 (493 lines) |
| Global enable flag | byte_4FBB6C0 (initialized to 0 in ctor_261) |
The pass runs during the NVIDIA IR optimization pipeline. The global enable flag at byte_4FBB6C0 is set by the pipeline setup when appropriate for the current optimization level.
IR Before/After Example
The pass removes __syncthreads() barriers that protect no actual shared/global memory hazard.
Before (conservative barrier placement):
define void @kernel(ptr addrspace(3) %smem) {
entry:
%x = add i32 %tid, 1 ; pure register computation
%y = mul i32 %x, 42 ; pure register computation
call void @llvm.nvvm.barrier0() ; __syncthreads() -- no shared/global R/W above
%z = add i32 %y, %x ; pure register computation
ret void
}
After (dead barrier removed):
define void @kernel(ptr addrspace(3) %smem) {
entry:
%x = add i32 %tid, 1
%y = mul i32 %x, 42
; barrier removed: no shared/global reads or writes above or below
%z = add i32 %y, %x
ret void
}
When the dataflow analysis determines that neither side of the barrier accesses shared or global memory, the barrier is dead and removed. The pass restarts after each removal since deleting one barrier may expose another as redundant.
Algorithm
Phase 1: Instruction Classification (sub_1C46330)
The classifier (sub_1C45690, 117 lines) examines each instruction's opcode byte at offset +16 and determines whether it reads or writes shared/global memory:
| Opcode | Hex | Meaning | Action |
|---|---|---|---|
0x36 | '6' | Load | Check address space; mark as read if shared/global |
0x37 | '7' | Store | Check address space; mark as write |
0x3A | ':' | Memory op | Check address space |
0x3B | ';' | Memory op | Check address space |
0x4E | 'N' (78) | Call | Complex analysis: filter sync intrinsics, check callee attributes |
The classifier is invoked twice per basic block:
- Forward scan (a3=1): iterates from the last instruction backward to the first sync instruction. Everything after the sync is classified as "above" the barrier.
- Backward scan (a3=0): iterates from the first instruction forward to the first sync instruction. Everything before the sync is classified as "below" the barrier.
This produces four boolean flags per block, stored in red-black tree maps: reads_above, writes_above, reads_below, writes_below.
Phase 2: CFG Propagation (sub_1C46620)
A classic dataflow fixed-point iteration propagates memory access information through successor edges. For each basic block, the read/write flags from its successors' "below" maps are OR-combined into the current block's "above" maps. The iteration repeats until no flags change (convergence). This ensures that a barrier's necessity accounts for memory accesses reachable through any control flow path, not just the local block.
The branch-dist-norm knob modifies the dataflow meet operator: the default (0) uses OR-propagation (conservative), while a non-zero value likely switches to AND-normalization (more aggressive, requiring all paths to access memory before considering a sync necessary).
Phase 3: Dead Sync Identification and Removal
After propagation, the main function (sub_1C47810) iterates over all blocks and instructions. For each synchronization intrinsic, it looks up the four per-instruction flags:
ra = inst_read_above[I] wa = inst_write_above[I]
rb = inst_read_below[I] wb = inst_write_below[I]
A sync is dead (removable) when any of these conditions holds:
| Condition | Meaning |
|---|---|
!ra && !wa | Nothing above the barrier accesses shared/global memory |
!rb && !wb | Nothing below the barrier accesses shared/global memory |
!ra && !wb | No read-after-write or write-after-write hazard |
!wa && !rb | No write-after-read or write-after-write hazard |
When a sync is removed, the pass calls sub_15F20C0 to delete it from the IR, then restarts the entire algorithm (goto LABEL_2). This restart is necessary because removing one barrier may cause another to become dead.
Special Cases
Barrier variants that carry data -- __syncthreads_count, __syncthreads_and, __syncthreads_or (intrinsic IDs 3734--3736) -- are explicitly excluded from removal. Their return values encode lane participation information, so they cannot be elided even when no memory hazard exists.
Address Space Filtering
The pass only considers memory accesses to shared and global address spaces as relevant for synchronization. The address space check in sub_1C45690:
- Address space IDs <=
0x1FF(511) or in the0x300range: considered local/private -- do not require synchronization. - Address space IDs > 511 and not in the
0x3xxrange: considered shared/global -- these are the accesses that justify keeping a barrier.
This distinction is critical: local memory is per-thread and never visible to other threads in the warp, so barriers protecting only local accesses are always dead.
Intrinsic Classification
Two predicates classify synchronization-related intrinsics:
sub_1C301F0 (is-sync-intrinsic): Returns true for intrinsic IDs representing barrier operations:
| ID | Likely Mapping |
|---|---|
| 34 | llvm.nvvm.barrier0 (basic __syncthreads) |
| 3718--3720 | barrier.sync / bar.warp.sync variants |
| 3731--3736 | __syncthreads_count/and/or, bar.arrive |
sub_1C30240 (is-fence-intrinsic): Returns true for IDs 4046 and 4242, which are memory fence/membar intrinsics. These are excluded from the sync test -- they impose memory ordering but are not full barriers that can be elided by this pass.
Configuration Knobs
All registered in ctor_525_0 at 0x563730. All are cl::opt<> with hidden visibility.
| Knob | Type | Default | Description |
|---|---|---|---|
dump-branch-dist | bool | false | Emit diagnostic output on each removed sync |
ignore-call-safety | bool | true | Treat function calls as non-memory-accessing (aggressive) |
ignore-variance-cond | int | 0 | Ignore warp divergence on branch conditions |
ignore-address-space-check | int | 0 | Treat all memory accesses as requiring sync (conservative) |
ignore-phi-overhead | int | 0 | Ignore PHI node overhead from sync removal in cost model |
disable-complex-branch-dist | int | 0 | Disable inter-block CFG propagation (Phase 2) |
no-branch-dist | string | (empty) | Comma-separated list of function names to skip |
branch-dist-func-limit | int | -1 | Max functions to process (-1 = unlimited) |
branch-dist-block-limit | int | -1 | Max blocks per function (-1 = unlimited) |
branch-dist-norm | int | 0 | Dataflow meet operator mode (0 = OR, non-zero = AND) |
The default for ignore-call-safety is notably true (aggressive): device function calls are assumed not to access shared/global memory unless proven otherwise. This is reasonable for typical CUDA kernels where helper functions operate on registers and local memory.
Diagnostic Strings
Diagnostic strings recovered from p2b.3-01-branchdist.txt. All runtime diagnostics are gated by the dump-branch-dist knob (default false).
| String | Source | Category | Trigger |
|---|---|---|---|
"[filename:line] Removed dead synch: Read above: X, Write above: Y, Read below: Z, Write below: W in function NAME" | sub_1C47810 phase 3 | Debug | dump-branch-dist enabled and a barrier is removed; prints the four read/write flags and the function name |
"Dump information from Branch Distribution" | ctor_525_0 at 0x563730 | Knob | dump-branch-dist knob description |
"Ignore calls safety in branch Distribution" | ctor_525_0 | Knob | ignore-call-safety knob description |
"Ignore variance condition in branch Distribution" | ctor_525_0 | Knob | ignore-variance-cond knob description |
"Ignore address-space checks in branch Distribution" | ctor_525_0 | Knob | ignore-address-space-check knob description |
"Ignore the overhead due to phis" | ctor_525_0 | Knob | ignore-phi-overhead knob description |
"Disable more complex branch Distribution" | ctor_525_0 | Knob | disable-complex-branch-dist knob description |
"Do not do Branch Distribution on some functions" | ctor_525_0 | Knob | no-branch-dist knob description (value format: "function1,function2,...") |
"Control number of functions to apply" | ctor_525_0 | Knob | branch-dist-func-limit knob description |
"Control number of blocks to apply" | ctor_525_0 | Knob | branch-dist-block-limit knob description |
"Control normalization for branch dist" | ctor_525_0 | Knob | branch-dist-norm knob description |
Data Structures
The pass allocates a large state object (~696 bytes, 87 QWORDs) containing 13 red-black tree maps organized in three tiers:
| Maps | Keys | Values | Purpose |
|---|---|---|---|
a1[3..14] (2 maps) | Block pointer | bool | Has-sync-above/below per block |
a1[15..38] (4 maps) | Block pointer | bool | Propagated read/write above/below (Phase 2 output) |
a1[39..62] (4 maps) | Block pointer | bool | Initial read/write above/below (Phase 1 output) |
a1[63..86] (4 maps) | Instruction pointer | bool | Per-instruction read/write above/below (Phase 3) |
All maps are std::map-like red-black trees with 48-byte nodes (left/right/parent pointers + key + 1-byte boolean value at offset 40). Tree operations are implemented in sub_1C46280 (insert-or-find for block maps), sub_1C47760 (insert-or-find for instruction maps), sub_1C45B10 (erase), and sub_1C45C70/sub_1C45940 (recursive destructors).
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
| -- | 0x1C47810 | 2357L | Core algorithm: classify + propagate + remove |
| -- | 0x1C49D10 | 179L | Pass wrapper: init state, call core, cleanup |
| -- | 0x1C46330 | 197L | Phase 1: forward/backward instruction scan |
| -- | 0x1C46620 | 1157L | Phase 2: CFG successor propagation (fixed-point) |
| -- | 0x1C45690 | 117L | Instruction classifier: determines R/W flags |
| -- | 0x1C458C0 | 28L | Helper: classify all instructions in a block |
| -- | 0x1C46280 | 38L | Map insert-or-find (block-level maps) |
| -- | 0x1C47760 | 37L | Map insert-or-find (instruction-level maps) |
| -- | 0x1C475C0 | 43L | Map lower_bound lookup |
| -- | 0x1C47660 | 50L | Map find with hint |
| -- | 0x1C45B10 | 113L | Map erase operation |
| -- | 0x1C45C70 | 133L | Tree destructor (recursive free) |
| -- | 0x1C45940 | 133L | Tree destructor (recursive free, alt type) |
| -- | 0x1C301F0 | 15L | Is-sync-intrinsic predicate |
| -- | 0x1C30240 | 13L | Is-fence-intrinsic predicate |
| -- | 0x563730 | 493L | CLI knob registration (ctor_525_0) |
Common Pitfalls
These are mistakes a reimplementor is likely to make when building an equivalent dead barrier elimination pass using CFG dataflow.
1. Using address-level tracking instead of boolean per-category flags. The pass tracks four boolean flags per block (reads_above, writes_above, reads_below, writes_below) for shared/global memory, not specific addresses. A reimplementation that attempts to track precise addresses ("smem[0] is only written above, smem[1] is only read below") will appear to find more dead barriers but is fundamentally unsound for GPU execution. Different threads access different addresses through the same pointer expression (smem[tid] vs smem[tid-1]), making address-based alias analysis across threads impossible at compile time. The boolean-per-category approach is the correct conservative abstraction.
2. Not excluding __syncthreads_count/and/or (IDs 3734--3736) from removal. These barrier variants return a value that encodes lane participation information (__syncthreads_count returns the number of threads that passed a non-zero predicate). Even when no memory hazard exists across the barrier, the return value carries data that the program depends on. A reimplementation that removes these barriers based solely on memory analysis will break programs that use the return value for algorithmic purposes (e.g., warp-level voting patterns, early-exit counting).
3. Treating the ignore-call-safety default as conservative. The default for ignore-call-safety is true (aggressive): function calls are assumed not to access shared/global memory. This is correct for typical CUDA helper functions that operate on registers and local memory, but a reimplementation that uses false as the default will retain nearly all barriers in code that calls device functions, defeating the optimization. Conversely, a reimplementation that uses true but does not also check the callee's isSharedMemoryAccess attribute when available will miss cases where a called function does access shared memory through a pointer argument.
4. Not restarting the analysis after removing a barrier. The pass restarts from Phase 1 (goto LABEL_2) after each barrier deletion because removing one barrier merges the regions it separated, potentially exposing adjacent barriers as dead. A reimplementation that collects all dead barriers in one pass and removes them simultaneously will miss cascading redundancies. Worse, it may remove barriers in the wrong order: if barrier B2 is dead only because barrier B1 separates it from a hazard, removing both simultaneously removes B1's protection while the hazard still exists.
5. Conflating address space filtering with memory visibility. The pass considers only shared and global memory accesses (address spaces > 511 and not in the 0x3xx range) as relevant for barrier justification. Local/private memory (per-thread, invisible to other threads) is correctly excluded. A reimplementation that includes local memory accesses in the analysis will never remove any barrier in code that uses local arrays, since every function with local variables would show "read+write above and below." The address space filter is essential for the optimization to have any effect.
GPU-Specific Motivation
On NVIDIA GPUs, __syncthreads() forces all threads in a thread block to reach the barrier before any can proceed. This is one of the most expensive control flow operations in CUDA -- it serializes warp execution and creates a pipeline stall. In practice, CUDA programmers insert barriers conservatively (every shared memory access pattern gets a barrier "just in case"), leading to significant over-synchronization. This pass recovers the performance lost to unnecessary barriers by proving, through static dataflow analysis, that specific barriers protect no actual memory hazard.
The ignore-variance-cond knob connects to warp divergence analysis: when a branch condition is provably uniform (all lanes take the same path), synchronization across that branch is trivially unnecessary regardless of memory access patterns. This is a common case in well-structured CUDA code where control flow depends on blockIdx or compile-time constants.