Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Dead Synchronization Elimination

The dead synchronization elimination engine at sub_2C84BA0 is the largest NVIDIA-custom pass in cicc at 96KB (~3,400 decompiled lines). It removes __syncthreads() barriers that provably do not order any memory hazard, reducing warp stall cycles in CUDA kernels without affecting correctness. The algorithm performs a bidirectional fixed-point dataflow analysis across the entire function's CFG, tracking four memory access categories per basic block through eight red-black tree maps. After convergence, it evaluates every barrier against the computed access sets and deletes those that protect no actual hazard. Each deletion triggers a full restart of the analysis, handling cascading redundancies at the cost of quadratic worst-case complexity.

This pass is distinct from the lightweight basic-dbe pass (slot 376, llvm::BasicDeadBarrierEliminationPass) and from the branch-dist pass. All three target dead barriers, but only this engine performs full inter-block dataflow with complete restart -- the other two handle simpler local or single-pass cases.

Key Facts

PropertyValue
Entry pointsub_2C84BA0
Binary size96KB (~3,400 decompiled lines)
Pass typeModule-level NVIDIA custom (not registered in New PM)
Callerssub_2C88020, sub_2C883F0, self-recursive
Barrier predicatesub_2C83D20
Access classifiersub_2C83AE0
Per-BB analysissub_2C84640 (bidirectional, parameterized by direction)
State object12 red-black tree maps at known offsets in a1
Diagnostic" Removed dead synch: " with per-category read/write counts
Upstream equivalentNone -- entirely NVIDIA-proprietary

Five-Phase Algorithm

Phase 1: Barrier Identification (sub_2C83D20)

The helper sub_2C83D20 classifies whether a given instruction is a synchronization barrier. The check is a conjunction of five conditions:

function isSyncBarrier(inst) -> bool:
    if inst.opcode != 85:                       // internal call opcode
        return false
    callee = inst.field[-32]                     // callee pointer at offset -32
    if callee == null:
        return false
    if callee.byte[0] != 0:                     // byte 0 == 0 means intrinsic (not user-defined)
        return false
    if callee.field[24] != inst.field[80]:       // scope match
        return false
    if !(callee.byte[33] & 0x20):               // convergent attribute flag
        return false
    return CEA1A0(callee.field[36])              // confirm barrier intrinsic ID

The convergent attribute flag (bit 0x20 at byte+33) is the key discriminator. LLVM marks barrier intrinsics as convergent to prevent optimizations from moving them across control flow boundaries. The final sub_CEA1A0 call validates that the intrinsic ID falls within the known barrier ID range, distinguishing barriers from other convergent intrinsics (e.g., warp vote operations).

Phase 2: Memory Access Classification (sub_2C83AE0)

For every non-barrier instruction, sub_2C83AE0 determines whether it reads from or writes to memory that could create a hazard across a barrier. It outputs two boolean flags via pointer parameters a2 (read) and a3 (write).

OpcodeValueInstructionClassification
610x3DStoreWrite, if element size > 0x1FF bits
620x3ELoadRead, with same large-type gate
650x41AtomicRead + Write
660x42AtomicCmpXchgWrite
850x55Call/IntrinsicContext-dependent (see below)

For call instructions (opcode 85), the classifier applies recursive analysis:

  1. Check if the callee has intrinsic flag 0x20 set.
  2. For barrier-like intrinsics with opcode 25 and field+96 == 0: classify as Read only.
  3. For general calls: invoke sub_B49E00 (isSharedMemoryAccess) to determine whether the callee accesses shared/global memory. If yes: Read + Write.

The element size gate (> 0x1FF bits, i.e., > 511 bits) filters out trivially small memory operations that target scalar types in registers rather than actual memory-backed storage. Loads and stores of types narrower than 512 bits are assumed to operate on register-promoted values and do not participate in cross-thread hazards.

Phase 3: Bidirectional Fixed-Point Dataflow

Complexity. Let B = number of basic blocks, S = number of barrier instructions, and I = total instructions across all blocks. Phase 1 (barrier identification) is O(S). Phase 2 (access classification) is O(I). The dataflow fixed-point iterates until no boolean in the 4 * B * 2 lattice positions flips from 0 to 1; since the lattice has height 1, convergence is bounded by O(B) iterations, each costing O(B + I) for the forward and backward scans, giving O(B * (B + I)) per convergence cycle. Phase 4 (elimination decision) is O(S). Phase 5 restarts the entire analysis from Phase 3 on each removal, yielding a worst-case total of O(S * B * (B + I)). In practice, CUDA kernels have B < 100, S < 20, and convergence in 2--3 iterations, so the pass behaves as near-linear in typical use. The red-black tree maps contribute O(log B) per insert/lookup, but this is dominated by the iteration cost.

This is the core of the pass and accounts for the majority of its 96KB size. The algorithm maintains eight red-black tree maps organized into forward and backward analysis sets, plus four bridge maps for the final elimination decision.

Map Layout

Offset rangeDirectionContents
a1[15..20]ForwardReadAbove per basic block
a1[21..26]ForwardWriteAbove per basic block
a1[27..32]ForwardReadBelow per basic block
a1[33..38]ForwardWriteBelow per basic block
a1[39..44]BackwardReadAbove per basic block
a1[45..50]BackwardWriteAbove per basic block
a1[51..56]BackwardReadBelow per basic block
a1[57..62]BackwardWriteBelow per basic block
a1[63..68]BridgeReadAbove crossing barrier
a1[69..74]BridgeWriteAbove crossing barrier
a1[75..80]BridgeReadBelow crossing barrier
a1[81..86]BridgeWriteBelow crossing barrier

Each map is a std::map-style red-black tree (48-byte nodes: left/right/parent pointers, key = basic block pointer, value = 1-byte boolean at offset 40). The helper sub_2C84590 performs map insertion; sub_2C84AF0 is a variant for a different node type used in the bridge maps.

Iteration Algorithm

The analysis loop is implemented as a goto-based iteration between labels LABEL_2 and LABEL_178 in the decompiled output:

function analyzeBarriers(F, state):
    LABEL_2:  // restart point after barrier removal

    // --- Forward pass ---
    for each BB in F:
        sub_2C84640(state, BB, direction=1)  // scan BB forward
            // For each instruction from BB start toward first barrier:
            //   classify as read/write via sub_2C83AE0
            //   OR the flags into forward maps [15..38]
            // Propagate successor BBs' flags backward if they
            // contain already-analyzed barriers

    // --- Forward convergence check ---
    changed_fwd = false
    for each BB in F:
        if forward_maps[BB] != previous_forward_maps[BB]:
            changed_fwd = true
            break

    // --- Backward pass ---
    for each BB in F:
        sub_2C84640(state, BB, direction=0)  // scan BB backward
            // For each instruction from BB end toward last barrier:
            //   classify as read/write
            //   OR into backward maps [39..62]
            // Propagate predecessor BBs' flags forward

    // --- Backward convergence check ---
    changed_bwd = false
    for each BB in F:
        if backward_maps[BB] != previous_backward_maps[BB]:
            changed_bwd = true
            break

    // If either direction changed, iterate
    if changed_fwd or changed_bwd:
        goto LABEL_2_inner  // re-run dataflow (not full restart)

    // Both converged -- proceed to Phase 4
    goto elimination_phase

The sub_2C84640 helper is the per-BB analysis workhorse. It takes a direction parameter:

  • direction=1 (forward): scans from block entry toward the first barrier, accumulating ReadAbove/WriteAbove. Propagates read/write information from successor blocks.
  • direction=0 (backward): scans from block exit toward the last barrier, accumulating ReadBelow/WriteBelow. Propagates information from predecessor blocks.

The convergence check compares the entire map contents (all four categories for every BB) against their values from the previous iteration. If any single boolean flipped from 0 to 1, the changed flag is set. Since the analysis is monotone (booleans can only transition from 0 to 1, never back), convergence is guaranteed in at most O(|BB|) iterations, though in practice it converges in 2--3 iterations for typical CUDA kernels.

Phase 4: Elimination Decision

After the dataflow converges, the pass examines every barrier instruction and checks the bridge maps (a1[63..86]) which represent the combined read/write sets crossing barrier boundaries.

A barrier is redundant (dead) if any of the following holds:

ConditionInterpretation
ReadAbove == 0 AND WriteAbove == 0No shared-memory accesses reach this barrier from above; the barrier orders nothing
ReadBelow == 0 AND WriteBelow == 0No accesses reach from below
ReadAbove == 0 AND WriteBelow == 0No RAW or WAW hazard across the barrier
WriteAbove == 0 AND ReadBelow == 0No WAR or WAW hazard across the barrier

The first two conditions capture the case where one side of the barrier has no memory traffic at all. The latter two capture the case where both sides access memory, but the access patterns cannot conflict.

Special Case: Intrinsic IDs 8260--8262

For call instructions (opcode 85) where the callee's intrinsic ID satisfies (ID - 8260) <= 2 (i.e., IDs 8260, 8261, or 8262), the pass applies an additional test via sub_BD3660 (hasOneUse). If the barrier-like intrinsic has only a single use, it is considered removable even if the standard dataflow check would keep it. These IDs likely correspond to specialized barrier variants (__syncthreads_count, __syncthreads_and, __syncthreads_or) where the return value is used as data. When the return value has only one use, the compiler can reason that the data-carrying aspect is trivially handled and the barrier itself may still be dead from a memory ordering perspective.

Phase 5: Removal and Complete Restart

When a barrier is identified as dead, the pass:

  1. Emits a diagnostic string (if the controlling dump flag is enabled):

     Removed dead synch: [filename:line] in function <name>
    Read above: N, Write above: N, Read below: N, Write below: N
    

    where N is 0 or 1 for each category.

  2. Calls sub_B43D60 (Instruction::eraseFromParent) to delete the barrier instruction from the IR.

  3. Restarts from Phase 3 (goto LABEL_2) -- a complete re-analysis of the entire function.

The restart is not optional. Removing a barrier changes the memory access pattern visible between adjacent barriers: what was previously two separate "above/below" regions separated by a barrier now becomes a single merged region. This merging may cause an adjacent barrier to lose its hazard justification, making it dead as well. The cascading effect can propagate through a chain of barriers.

Dataflow Analysis Detail

The Eight Red-Black Tree Maps

The forward and backward map sets track identical categories but are populated from different CFG traversal directions. This bidirectional approach ensures that the analysis captures memory accesses reachable from any control flow path, not just straight-line code.

Forward maps (a1[15..38]):

  • Populated by scanning each BB from entry toward the first barrier.
  • ReadAbove[BB] = 1 if any instruction before the first barrier in BB reads shared/global memory, OR if any successor BB contributes a read.
  • WriteAbove[BB] = same for writes.
  • ReadBelow/WriteBelow[BB] = propagated from successor blocks' analysis.

Backward maps (a1[39..62]):

  • Populated by scanning each BB from exit toward the last barrier.
  • ReadBelow[BB] = 1 if any instruction after the last barrier in BB reads memory, OR if any predecessor BB contributes a read.
  • WriteBelow[BB] = same for writes.
  • ReadAbove/WriteAbove[BB] = propagated from predecessor blocks.

Bridge maps (a1[63..86]):

  • Keyed by barrier instruction pointer (not BB pointer).
  • Represent the combined access sets that cross the specific barrier boundary.
  • Populated during the final pass over barrier instructions after dataflow convergence.

Monotone Dataflow Framework

The analysis is a classic monotone dataflow problem on a Boolean lattice:

  • Domain: {0, 1} per (basic-block, category) pair.
  • Transfer function: OR of local classification with propagated values.
  • Meet operator: OR (any path contributing an access sets the flag).
  • Direction: Bidirectional (forward pass propagates from successors, backward pass propagates from predecessors).
  • Convergence: Guaranteed because the lattice has height 1 (a value can only change from 0 to 1, never back). The fixed point is reached when no additional propagation changes any value.

In the worst case, each iteration may set one new bit, and there are 4 * |BB| bits per direction, so convergence takes at most 4 * |BB| iterations per direction. In practice, CUDA kernels have shallow CFGs and the iteration converges in 2--3 rounds.

Cascading Restart Logic

The most expensive aspect of the algorithm is the complete restart after each barrier removal. Consider a function with N barriers:

B0 -- barrier_1 -- B1 -- barrier_2 -- B2 -- barrier_3 -- B3

If barrier_2 is removed first, blocks B1 and B2 merge into a single region. If B1 contained only writes and B2 contained only reads, barrier_1 was previously justified by the WAR hazard between B0's writes and B1's reads. But after merging, B1+B2 now contains both reads and writes, and barrier_3 might become dead if B3 has no memory accesses. This cascading effect requires full re-analysis.

Worst-case complexity: O(N_barriers * N_BBs * convergence_iterations), where convergence_iterations is bounded by 4 * |BB| but is typically 2--3. For a kernel with B barriers removed in sequence, the total work is O(B * F * C) where F is the per-iteration cost of the dataflow and C is the convergence bound.

In practice, CUDA kernels rarely have more than 10--20 barriers, and cascading removals are uncommon (typically 0--3 restarts), so the theoretical quadratic cost is not a bottleneck.

Relationship to basic-dbe and branch-dist

CICC contains three passes that eliminate dead synchronization barriers. They differ in scope, cost, and the cases they handle:

Propertybasic-dbebranch-distDead Sync Elimination
Pass namebasic-dbebranch-dist(unnamed, called from module pass)
Entry pointllvm::BasicDeadBarrierEliminationPasssub_1C47810sub_2C84BA0
RegistrationNew PM slot 376New PM slot (function pass)Module-level caller
ScopeSingle BB / localFunction-level with CFG propagationFunction-level with full restart
DataflowNone (pattern match)Fixed-point, 13 rb-tree mapsFixed-point, 12 rb-tree maps
Restart on removalNoYes (goto LABEL_2)Yes (goto LABEL_2)
Binary sizeSmall (ctor_261)63KB core + helpers96KB core + helpers
Knobsbasic-dbe10 knobs (ctor_525)None known (controlled by caller)

basic-dbe handles trivially dead barriers detectable without dataflow analysis -- cases where the barrier is immediately adjacent to another barrier, or where the enclosing block contains no memory operations at all. It runs in the standard function pass pipeline and is cheap.

branch-dist performs full CFG propagation with 13 red-black tree maps and restart-on-removal, but it uses NVVM IR opcodes (0x36/0x37/0x3A/0x3B/0x4E) rather than the generic LLVM IR opcodes (61/62/65/66/85) used by the full engine. It also has its own address space filtering logic and 10 configurable knobs.

The full dead synchronization elimination engine (sub_2C84BA0) is the most aggressive of the three. It uses the LLVM IR opcode set, applies the element-size gate for loads/stores, and handles the special intrinsic IDs 8260--8262. It runs separately from the New PM function pass pipeline, invoked from module-level callers sub_2C88020 and sub_2C883F0.

Configuration

No dedicated knobs have been identified for the full engine at sub_2C84BA0. Its behavior is controlled entirely by its callers (sub_2C88020, sub_2C883F0), which determine when and whether the engine runs. This is in contrast to branch-dist, which has 10 knobs, and basic-dbe, which has at least an enable flag.

The diagnostic output is gated by an internal condition in the caller, not by a standalone dump knob.

Diagnostic Strings

" Removed dead synch: "
"Read above: "
", Write above: "
", Read below: "
", Write below: "
" in function "
"dbg"

The complete diagnostic message, assembled from these fragments:

 Removed dead synch: [filename:line] in function <name>
Read above: 0, Write above: 0, Read below: 1, Write below: 1

The numeric values are the boolean (0/1) access flags for each category. When the pass removes a barrier, the diagnostic shows exactly why it was safe: which of the four access categories was absent.

Function Map

FunctionAddressSizeRole
--sub_2C84BA096KB (3,400 lines)Main engine: 5-phase algorithm
--sub_2C83D20smallisSyncBarrier predicate
--sub_2C83AE0smallclassifyMemoryAccess (read/write classification)
--sub_2C84640mediumPer-BB analysis (bidirectional, direction parameter)
--sub_2C84590smallRed-black tree insert (forward/backward maps)
--sub_2C84AF0smallRed-black tree insert (bridge maps, different node type)
--sub_2C84080smallMap lookup / convergence check helper
--sub_2C83F20smallMap initialization / clear helper
--sub_2C83D50smallMap destructor / cleanup
--sub_BD3660smallhasOneUse -- used for intrinsic IDs 8260--8262 special case
--sub_CEA1A0smallBarrier intrinsic ID confirmation
--sub_B49E00smallisSharedMemoryAccess -- CUDA address space check
--sub_B43D60smallInstruction::eraseFromParent -- barrier deletion
--sub_B46E30smallgetNumSuccessors -- CFG successor count
--sub_B46EC0smallgetSuccessor(i) -- i-th successor retrieval
--sub_CB6200smallraw_ostream::write -- diagnostic string output
--sub_B91420smallDebug location extraction (filename/line)
--sub_B91F50smallDebug info accessor
--sub_BD5D20smallType/value accessor
--sub_22409D0smallIR utility (instruction manipulation)
--sub_CB59D0smallraw_ostream integer write
--sub_CB59F0smallraw_ostream integer write (variant)
--sub_2C88020--Caller: module-level pass invoking the engine
--sub_2C883F0--Caller: module-level pass invoking the engine (variant)

Common Pitfalls

These are mistakes a reimplementor is likely to make when building an equivalent dead synchronization elimination engine.

1. Removing a barrier that protects a cross-thread shared memory hazard invisible to single-thread analysis. The most dangerous mistake is treating the analysis as a single-thread dataflow problem. The pass classifies memory accesses as read/write per thread, but the barrier's purpose is to order accesses across threads. If thread A writes to smem[tid] above the barrier and thread B reads smem[tid-1] below it, a single-thread view sees no RAW hazard (different addresses). The correct analysis must conservatively assume that any shared memory write above and any shared memory read below constitutes a hazard -- the pass uses boolean flags (not address tracking) precisely because aliasing across threads is unknowable at compile time. A reimplementation that attempts to be "smarter" by tracking addresses will remove barriers that are needed.

2. Not restarting the full analysis after each barrier removal. When a barrier is deleted, the two regions it separated merge into one. This merged region may expose an adjacent barrier as dead (it no longer has memory accesses on one side). A reimplementation that removes all identified dead barriers in a single pass and then stops will miss these cascading redundancies. The restart is mandatory: the pass deliberately uses a goto back to Phase 3 after each removal, re-analyzing the entire function from scratch.

3. Incorrectly classifying call instructions as non-memory-accessing. The access classifier (sub_2C83AE0) must recursively analyze callees to determine if they access shared/global memory. A reimplementation that conservatively marks all calls as read+write will be correct but will retain too many barriers (poor optimization). Conversely, one that ignores calls entirely will remove barriers protecting memory accesses hidden inside called functions. The correct behavior checks the isSharedMemoryAccess predicate on the callee and falls back to read+write if the callee is opaque.

4. Treating __syncthreads_count/and/or (IDs 8260--8262) the same as plain __syncthreads. These barrier variants return a value (lane participation count/and/or). Even when the barrier is dead from a memory-ordering perspective, the return value may be used as data by the program. The pass applies a special hasOneUse check for these IDs. A reimplementation that blindly removes them when the dataflow says "no hazard" will break programs that depend on the return value for algorithmic purposes.

5. Applying the element-size gate too aggressively. The pass filters out loads/stores of types narrower than 512 bits (> 0x1FF), assuming they are register-promoted scalars. A reimplementation that raises this threshold (e.g., to 1024 bits) will miss legitimate memory operations that should keep a barrier alive. Conversely, lowering it to 0 will make the analysis overly conservative, retaining dead barriers for trivial register operations.

Test This

The following kernel contains consecutive __syncthreads() barriers with no shared memory accesses between them. The dead synchronization elimination pass should remove the redundant barriers.

__global__ void dead_sync_test(float* out, int n) {
    __shared__ float smem[256];

    smem[threadIdx.x] = (float)threadIdx.x;
    __syncthreads();    // barrier 1: needed (write above, read below)

    float val = smem[threadIdx.x ^ 1];
    __syncthreads();    // barrier 2: dead -- no smem access between barrier 1 and 2's "below"

    __syncthreads();    // barrier 3: consecutive with barrier 2 -- trivially dead

    out[threadIdx.x] = val;
}

What to look for in PTX:

  • Count the number of bar.sync 0; instructions. The kernel has three __syncthreads() calls in source, but only one should survive: barrier 1 (which orders the write to smem against the read from smem[tid^1]). Barriers 2 and 3 have no shared memory hazard to protect.
  • The diagnostic "Removed dead synch:" (visible with internal dump flags) shows the per-category access flags that justified removal: Read above: 0, Write above: 0 means no memory accesses reach the barrier from above.
  • To verify the pass preserves necessary barriers, move the float val = smem[...] read to between barriers 2 and 3. Now barrier 2 orders the write against this read and must survive -- expect two bar.sync instructions.
  • The cascading restart behavior is observable with 5 consecutive __syncthreads() with no memory between them. The pass removes one, restarts the analysis, removes the next, and repeats until only one remains.

Reimplementation Checklist

  1. Barrier identification predicate. Implement the five-condition conjunction: opcode == 85 (internal call), non-null callee, byte[0] == 0 (intrinsic flag), scope match (callee.field[24] == inst.field[80]), convergent attribute (bit 0x20 at byte+33), and barrier intrinsic ID confirmation.
  2. Memory access classifier. Classify every non-barrier instruction as read/write/both/neither based on opcode (store=0x3D, load=0x3E, atomic=0x41, cmpxchg=0x42, call=0x55), with the element-size gate (>511 bits) for loads/stores and recursive analysis for call instructions including shared-memory-access checks.
  3. Bidirectional fixed-point dataflow. Maintain eight red-black tree maps (forward ReadAbove/WriteAbove/ReadBelow/WriteBelow per BB, backward same) populated by scanning each BB in both directions, propagating from successors (forward) and predecessors (backward), iterating until no boolean flips from 0 to 1.
  4. Bridge map construction. After dataflow convergence, populate four bridge maps keyed by barrier instruction pointer, representing the combined read/write access sets crossing each specific barrier boundary.
  5. Elimination decision logic. A barrier is dead if: (ReadAbove==0 AND WriteAbove==0), OR (ReadBelow==0 AND WriteBelow==0), OR (ReadAbove==0 AND WriteBelow==0), OR (WriteAbove==0 AND ReadBelow==0). Handle the special case for intrinsic IDs 8260--8262 (__syncthreads_count/and/or) where single-use return values allow additional removal.
  6. Complete restart after removal. After each barrier deletion, restart the entire dataflow analysis from scratch to handle cascading redundancies where removing one barrier makes adjacent barriers dead.

Cross-References