Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Synchronization & Warp Intrinsics

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

This page documents how ptxas v13.0.88 handles the lowering of synchronization primitives and warp-level intrinsic operations from PTX source through Ori IR to final SASS instructions. The coverage spans warp vote, shuffle, match, and redux operations; thread-block barriers; memory barriers and fences; warp-level synchronization; asynchronous barriers (mbarrier); and atomic/reduction intrinsic lowering.

PTX codegen handlerssub_580E50 (vote), sub_5801D0 (shfl), sub_58A730 (match), sub_567680 (redux), sub_524FB0 (bar), sub_570290 (barrier), sub_500BF0 (bar.arrive), sub_570940 (barrier.arrive), sub_52D590 (bar.red), sub_5889B0 (barrier.red), sub_56A5A0 (bar.warp), sub_4DB410 (membar), sub_6C0D90 (atom/red)
Intrinsic IDs0x01--0x11 (reduxsync, 17), 0x89--0x1FA (sm70 warp/barrier/wmma, 370)
Ori IR opcodes96 (barrier), 119 (vote), 130 (HSET2 in ROT13; sync internal / shared-mem LEA), 314 (atom/red)
SASS opcodesVOTE, VOTEU, BAR, BAR_INDEXED, MATCH, MEMBAR, WARPSYNC, BSYNC, BSSY, DEPBAR, ERRBAR, ELECT, NANOSLEEP, ATOM, ATOMG, ATOMS, RED
Blackwell additionsFENCE_G, FENCE_S, FENCE_T, CGABAR_ARV, CGABAR_GET, CGABAR_SET, CGABAR_WAIT, CGAERRBAR, SYNCS_BASIC, SYNCS_LD_UNIFM
Optimizer phases25 (StageAndFence), 26 (OriRemoveRedundantBarriers), 42 (ExpandMbarrier), 71 (OptimizeSyncInstructions), 72 (LateExpandSyncInstructions), 99, 100, 114
Intrinsic detectionsub_A9A410 (sm70 warp-sync prefix matcher), sub_A94440 (mbarrier classifier)
Related EIATTREIATTR_NUM_BARRIERS, EIATTR_NUM_MBARRIERS, EIATTR_MBARRIER_INSTR_OFFSETS, EIATTR_SYNC_STACK, EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS, EIATTR_GEN_ERRBAR_AT_EXIT
CLI options--assume-extern-functions-do-not-sync, --no-membermask-overlap, --print-potentially-overlapping-membermasks
Related pagesSynchronization & Barriers (passes), Intrinsic Table

Instruction Classification

ptxas groups synchronization and warp operations into three Ori IR opcode categories that govern scheduling, dependency tracking, and optimization treatment throughout the pipeline.

Ori OpcodeCategoryInstructionsWAR Resource Mask
96Barrierbar.sync, bar.red, bar.arrive, barrier.*0x200001 (bit 0 + bit 21)
119Votevote.{all,any,uni,ballot}, match.*, redux.*, activemask, elect.sync0x1 (bit 0 only)
130 (HSET2)Sync internalBAR/MEMBAR pseudo-ops during optimization (actual SASS BAR = 61, MEMBAR = 111)Used by phases 26, 71 for redundancy analysis

The WAR resource mask at opcode 96 (2097153 = 0x200001) uniquely identifies barrier instructions to the scoreboard tracker. For all other opcodes, the base mask is 1. The scheduler uses this to insert appropriate stall cycles between barrier producers and consumers.

Warp Vote

Warp vote operations evaluate a per-lane predicate across the active threads in a warp and return a collective result. In ptxas these are lowered through the vote codegen handler at sub_580E50 (approximately 3.2KB of decompiled code).

PTX to SASS Mapping

PTX InstructionSASS OpcodeResultMembermask
vote.sync.all.pred p, q, membermaskVOTE.ALLPredicate: true iff all active lanes have q=trueExplicit 32-bit mask
vote.sync.any.pred p, q, membermaskVOTE.ANYPredicate: true iff any active lane has q=trueExplicit 32-bit mask
vote.sync.uni.pred p, q, membermaskVOTE.UNIPredicate: true iff q is uniform across active lanesExplicit 32-bit mask
vote.sync.ballot.b32 d, q, membermaskVOTE.BALLOTR: 32-bit ballot mask of lanes where q=trueExplicit 32-bit mask
activemask.b32 dCS2R (read SR_LANEMASK_ACTIVE)R: current active lane maskImplicit (all active)
elect.sync d, membermaskELECT.SYNCPredicate: true for exactly one active laneExplicit 32-bit mask (sm75+)

On sm100+ (Blackwell), VOTEU is available as a uniform-register variant of VOTE for cases where the result feeds only uniform consumers.

Codegen Handler Structure -- sub_580E50

The vote handler follows the standard intrinsic codegen pattern: allocates a 50,000-byte scratch buffer via sub_424070, then builds an inline PTX function body through sequential sprintf() calls. The handler dispatches on three architecture tiers:

sub_580E50(ctx, string_table):
    instr = *(ctx + 1096)
    buf = alloc(50000)

    // Common prologue: function header + parameter declarations
    sprintf(buf, string_table[309261...])   // ".reg .pred %p; ..."

    // Feature check: sub_70B6E0(instr) -- has membermask operand?
    if has_membermask:
        mask_val = sub_70B780(instr)
        sprintf(buf, format_string, mask_val)

    // Architecture dispatch:
    if (sub_70FA00(instr, 11) || SM > 89) && sub_7081E0(instr) == 1:
        // Path 1: sm90+ with sync variant
        // Emits: vote.sync.{mode} with full operand set
        // Reads operands 0,1,2 via sub_70B8E0, sub_70CA70, sub_709E80, sub_70B4F0

    else if SM > 69 && sub_7081E0(instr) == 1:
        if sub_70FA00(instr, 10) || sub_709E60(instr) == 1:
            // Path 2a: sm70+ with explicit sync
            // Reads operands via sub_70B510, sub_70B8E0

        else:
            // Path 2b: sm70+ standard path
            // Checks sub_70FA00(instr, 17) -- has predicate output
            // Checks sub_70BA10(instr) -- ballot mode
            // SM > 75 branch: different register conventions
            // sub_70A910(instr) == 1: uniform result path

    // Epilogue: closing braces, return buffer

The accessor sub_70FA00(instr, 0) returns the SM architecture level (e.g., 70, 75, 80, 89, 90). The value at parameter 11 checks for a specific feature flag (cluster/sync extension). sub_7081E0(instr) returns the instruction variant (1 = sync form).

Intrinsic Registration

Vote intrinsics are registered as part of the sm70 block (0x89--0x1FA) with four entries:

Intrinsic NamePTX Equivalent
__cuda_sm70_votesync_allvote.sync.all.pred
__cuda_sm70_votesync_anyvote.sync.any.pred
__cuda_sm70_votesync_ballotvote.sync.ballot.b32
__cuda_sm70_votesync_univote.sync.uni.pred

Detection of these intrinsics at the IR level is handled by sub_A9A410 (194 bytes binary, 908 bytes decompiled), which performs prefix matching against three string patterns:

// sub_A9A410 -- IntrinsicDetector::isSM70WarpSync
static const char* prefixes[] = {
    "__cuda_sm70_warpsync",
    "__cuda_sm70_votesync_",
    "__cuda_sm70_matchsync_",
};
for (each prefix) {
    name = getSymbolName(instr.symbol_id);
    if (!strncmp(prefix, name, strlen(prefix)))
        return 1;
}
return 0;

This function is called during instruction lowering (subsystem 6 at 0xA9F000--0xAA8000) to identify warp-synchronous call sites that need special handling during barrier optimization.

Warp Shuffle

Warp shuffle moves data between lanes within a warp. The codegen handler sub_5801D0 (approximately 3.3KB decompiled) generates inline PTX for the four shuffle modes.

PTX to SASS Mapping

PTX InstructionSASS OpcodeData Movement
shfl.sync.idx.b32 d|p, a, b, c, membermaskSHF / SHFLd = lane[b] (direct indexed)
shfl.sync.up.b32 d|p, a, b, c, membermaskSHF / SHFLd = lane[laneid - b] (shift up)
shfl.sync.down.b32 d|p, a, b, c, membermaskSHF / SHFLd = lane[laneid + b] (shift down)
shfl.sync.bfly.b32 d|p, a, b, c, membermaskSHF / SHFLd = lane[laneid ^ b] (butterfly XOR)

The c operand packs the clamp value and width: c = ((width - 1) << 8) | clamp. The optional predicate output p indicates whether the source lane was within bounds.

Codegen Handler -- sub_5801D0

The shuffle handler is structurally similar to vote. It reads up to 5 operands (source value, lane offset, clamp/width, membermask, and optional predicate output) through the accessor chain:

sub_5801D0(ctx, string_table):
    // string_table offsets start at 311376
    // Prologue with 4 sprintf calls for parameter declarations

    if (SM >= 90 || feature_flag_11) && variant == 1:
        // sm90+ path: reads operands 0..4
        // sub_70B960(instr) -- gets shuffle mode enum
        // sub_70B450(instr) -- gets data type
        // Emits shfl.sync.{mode}.b32 with full 8-operand format

    else if SM > 69 && variant == 1:
        if feature_10 || sub_709E60(instr) == 1:
            // sm70+ explicit sync path
            // 7-operand format

        else:
            // Standard sm70 path
            // Checks sub_70BA10 for predicate output
            // SM > 75: different operand packing

The shuffle mode is obtained via sub_70B960(instr), returning an enum: 0=idx, 1=up, 2=down, 3=bfly. sub_70B450(instr) returns the data type (b32 for standard shuffles).

Intrinsic Registration

Intrinsic NamePTX Equivalent
__cuda_sm70_shflsync_idxshfl.sync.idx.b32
__cuda_sm70_shflsync_upshfl.sync.up.b32
__cuda_sm70_shflsync_downshfl.sync.down.b32
__cuda_sm70_shflsync_bflyshfl.sync.bfly.b32

Each has a variant with predicate output (e.g., __cuda_sm70_shflsync_idx_pred).

Warp Match

Warp match instructions compare a value across lanes and return which lanes hold matching values. The codegen handler sub_58A730 (approximately 4.5KB decompiled) is the largest in this group.

PTX to SASS Mapping

PTX InstructionSASS OpcodeResult
match.sync.any.b32 d, a, membermaskMATCH.ANYd: mask of lanes holding the same value as the calling lane
match.sync.all.b32 d|p, a, membermaskMATCH.ALLd: mask of lanes holding the same value; p: true if ALL active lanes match
match.sync.any.b64 d, a, membermaskMATCH.ANY64-bit value comparison variant
match.sync.all.b64 d|p, a, membermaskMATCH.ALL64-bit value comparison variant

Codegen Handler -- sub_58A730

The match handler has three architecture tiers and handles both b32 and b64 operand widths:

sub_58A730(ctx, string_table):
    // string_table offsets start at 323786

    if (SM >= 90 || feature_flag_11) && variant == 1:
        // sm90+ path: 7-operand format
        // Reads: sub_70B4F0, sub_709E80, sub_70CA70, sub_70B8E0(0..2)

    else if SM > 69 && variant == 1:
        // sm70+ path
        // Checks feature_10 and sub_709E60 for explicit sync
        // sub_70B940(instr) -- has match predicate output?
        // sub_70D1F0(instr, 0) -- gets operand by index
        // sub_70B950(instr) -- gets comparison mode

Intrinsic Registration

Intrinsic NameVariants
__cuda_sm70_matchsync_any_b32Standard
__cuda_sm70_matchsync_any_b6464-bit
__cuda_sm70_matchsync_all_b32With predicate output
__cuda_sm70_matchsync_all_b64With predicate output, 64-bit

Detection uses the same sub_A9A410 prefix matcher with "__cuda_sm70_matchsync_".

Warp Redux

Warp redux performs a warp-wide reduction operation and returns the result to all participating lanes. The codegen handler sub_567680 (approximately 2.0KB decompiled) is relatively compact.

PTX to SASS Mapping

PTX InstructionSASS Functional UnitOperation
redux.sync.add.s32 d, a, membermaskreduxWarp-wide signed integer addition
redux.sync.min.s32 d, a, membermaskreduxWarp-wide signed integer minimum
redux.sync.max.s32 d, a, membermaskreduxWarp-wide signed integer maximum
redux.sync.min.u32 d, a, membermaskreduxWarp-wide unsigned integer minimum
redux.sync.max.u32 d, a, membermaskreduxWarp-wide unsigned integer maximum
redux.sync.add.u32 d, a, membermaskreduxWarp-wide unsigned integer addition
redux.sync.and.b32 d, a, membermaskreduxWarp-wide bitwise AND
redux.sync.or.b32 d, a, membermaskreduxWarp-wide bitwise OR
redux.sync.xor.b32 d, a, membermaskreduxWarp-wide bitwise XOR
redux.sync.min.f32.NaN d, a, membermaskreduxWarp-wide float minimum (NaN-propagating)
redux.sync.max.f32.NaN d, a, membermaskreduxWarp-wide float maximum (NaN-propagating)
redux.sync.min.f32.abs d, a, membermaskreduxWarp-wide float absolute minimum

The scheduler tracks redux operations on the dedicated redux functional unit pipeline, alongside adu, alu, cbu, fma2x, fma, half, transcendental, ipa, lsu, schedDisp, tex, ttu, udp, and the various MMA pipelines.

Codegen Handler -- sub_567680

sub_567680(ctx, string_table):
    // Prologue: function header

    if (SM >= 90 || feature_flag_11) && variant == 1:
        // sm90+ path: 8-operand format
        // Reads: sub_709E80, sub_70CA70, sub_707530, sub_7087C0,
        //        sub_707630, sub_70B8E0(0..2)
        // sub_707630 -- gets reduction operation type
        // sub_7087C0 -- gets data type qualifier

    else if SM > 79 && variant == 1:
        // sm80+ path
        // Two sub-branches:
        //   - feature_10 || explicit_sync || feature_19: simplified format
        //   - Standard: full format with sub_707650, sub_7087F0,
        //     sub_707540, sub_70D1F0

The accessor sub_707630(instr) returns the reduction operation enum (add/min/max/and/or/xor), while sub_7087C0(instr) returns the data type qualifier (s32/u32/b32/f32). Note that redux requires sm80+ in the hardware; the sm70 block in the intrinsic table registers redux-sync intrinsics as software emulation routines.

Redux Sync Intrinsic Registration (IDs 0x01--0x11)

The earliest 17 intrinsic IDs are dedicated to software-emulated redux-sync operations for pre-sm80 targets:

IDIntrinsic NameOperation
0x01__cuda_reduxsync_b32_andBitwise AND reduction
0x02__cuda_reduxsync_b32_orBitwise OR reduction
0x03__cuda_reduxsync_b32_xorBitwise XOR reduction
0x04__cuda_reduxsync_f32_maxFloat maximum
0x05__cuda_reduxsync_f32_minFloat minimum
0x06__cuda_reduxsync_f32_max_absFloat absolute maximum
0x07__cuda_reduxsync_f32_min_absFloat absolute minimum
0x08__cuda_reduxsync_f32_max_NaNFloat maximum (NaN-propagating)
0x09__cuda_reduxsync_f32_min_NaNFloat minimum (NaN-propagating)
0x0A__cuda_reduxsync_s32_addSigned integer sum
0x0B__cuda_reduxsync_s32_maxSigned integer maximum
0x0C__cuda_reduxsync_s32_minSigned integer minimum
0x0D__cuda_reduxsync_u32_addUnsigned integer sum
0x0E__cuda_reduxsync_u32_maxUnsigned integer maximum
0x0F__cuda_reduxsync_u32_minUnsigned integer minimum
0x10--0x11(additional variants)Extended operations

On sm80+, redux PTX instructions lower directly to hardware SASS instructions and bypass the software emulation path.

Thread-Block Barriers

Thread-block barriers synchronize all threads within a CTA (Cooperative Thread Array). ptxas provides codegen handlers for three PTX barrier families plus their PTX 8.0 cluster-aware equivalents.

PTX Barrier Family

PTX HandlerAddressSizePTX Instructions
sub_524FB00x524FB01.8KBbar.sync, bar
sub_500BF00x500BF01.2KBbar.arrive
sub_52D5900x52D5901.5KBbar.red.{and,or,popc}
sub_5702900x5702902.5KBbarrier.sync, barrier
sub_5709400x570940--barrier.arrive
sub_5889B00x5889B04.8KBbarrier.red
sub_56A5A00x56A5A01.9KBbar.warp.sync

PTX to SASS Mapping

PTX InstructionSASS OpcodeBehavior
bar.sync NBAR.SYNCBlock until all CTA threads arrive at barrier N
bar.sync N, countBAR.SYNCBlock until count threads arrive at barrier N
bar.arrive NBAR.ARVSignal arrival at barrier N without blocking
bar.arrive N, countBAR.ARVSignal arrival with thread count
bar.red.and N, pBAR.RED.ANDBarrier + warp-level AND reduction on predicate
bar.red.or N, pBAR.RED.ORBarrier + warp-level OR reduction
bar.red.popc N, dBAR.RED.POPCBarrier + warp-level population count
barrier.cta.sync NBAR.SYNCPTX 8.0 cluster-aware CTA barrier
barrier.cta.arrive NBAR.ARVPTX 8.0 cluster-aware CTA arrive
barrier.cta.red NBAR.REDPTX 8.0 cluster-aware CTA reduction

The hardware provides 16 named barriers (indices 0--15). The EIATTR_NUM_BARRIERS metadata records the maximum barrier index used per kernel, which the driver uses to partition the convergence barrier file at launch.

Codegen Handler Details -- sub_524FB0

The bar.sync / bar handler dispatches across three architecture generations:

sub_524FB0(ctx, string_table):
    // string_table offsets start at 294205

    if feature_flag_11 || SM > 89:
        // sm90+ path: 4 format strings for prologue
        // Checks sub_70B930(instr) for count variant:
        //   count != 2: single-operand format (barrier index only)
        //   count == 2: two-operand format (barrier index + thread count)

    else if SM > 69:
        if !feature_13 || sub_70B860(instr) > 69:
            // sm70+ standard path
            // Same count dispatch as sm90
        else:
            // sm70 with feature_13: extended format
            // Reads sub_709400 (barrier scope) and sub_708200 (barrier type)

    else:  // SM <= 69 (pre-Volta)
        // Legacy path
        // sub_709400 -- barrier scope identifier
        // sub_708200 -- barrier operation type
        // count dispatch with 3-operand format strings

The accessor sub_70B930(instr) returns the operand count mode: 1 for single-operand (barrier index only), 2 for two-operand (barrier index + thread count). sub_70C180(instr) returns a special value (-1 for default thread count).

Named Barrier Intrinsic Registration

The sm70 intrinsic block registers barrier operations combinatorially:

Sub-CategoryCountCombinatorial Source
__cuda_sm70_barrier_arrive_{0..15}1616 barrier indices
__cuda_sm70_barrier_arrive_{0..15}_cnt16With explicit thread count
__cuda_sm70_barrier_sync_{0..15}1616 barrier indices
__cuda_sm70_barrier_sync_{0..15}_cnt16With explicit thread count
__cuda_sm70_barrier_red_and_{0..15}16AND reduction per barrier
__cuda_sm70_barrier_red_and_{0..15}_cnt16With thread count
__cuda_sm70_barrier_red_or_{0..15}16OR reduction per barrier
__cuda_sm70_barrier_red_or_{0..15}_cnt16With thread count
__cuda_sm70_barrier_red_popc_{0..15}16POPC reduction per barrier
__cuda_sm70_barrier_red_popc_{0..15}_cnt16With thread count

This combinatorial explosion produces 160 intrinsic entries for barriers alone (16 indices x 2 count variants x 5 operation types).

Barrier Codegen Pattern -- sub_570290

The barrier (PTX 8.0 form) handler at sub_570290 (2.5KB) is the most complex barrier handler. It adds cluster-awareness for sm90+ and handles the barrier.cta.* variants. The handler has an elaborate multi-level dispatch:

sub_570290(ctx, string_table):
    // sm90+ path: additional CTA scope parameters
    //   sub_709E80(instr) -- barrier scope enum
    //   sub_70B8E0(instr, 0) -- barrier index operand
    //   sub_70B8E0(instr, 1) -- thread count operand

    // sm70+ with feature_10 or explicit sync:
    //   Separate code paths for count=1 vs count=2

    // sm70+ standard (no explicit sync, no feature_10):
    //   Six format strings building an elaborate inline PTX body
    //   Handles sub_70B930 count modes 1 and 2
    //   Checks sub_70C180 for default-count (-1) vs explicit count
    //   Generates bar.sync N [, count]; or bar.arrive N [, count];

Memory Barriers and Fences

Memory Barriers -- sub_4DB410

The membar codegen handler at sub_4DB410 (84 lines decompiled) is the smallest sync handler. It generates memory barrier instructions across three scope levels.

PTX InstructionSASS OpcodeScope
membar.ctaMEMBAR.CTAThread block (CTA)
membar.gpuMEMBAR.GPUDevice (GPU)
membar.sysMEMBAR.SYSSystem (all agents)
sub_4DB410(ctx, string_table):
    mode = sub_709FE0(instr)

    if mode != 2 && mode != 4:
        // Standard 3-operand format
        // sub_70B710(instr) -- scope qualifier
        // sub_709FF0(instr) -- barrier type
        // sub_70A530(instr) -- additional qualifier

    else:  // mode 2 or 4 (cta or sys)
        if SM > 49:
            // sm50+ uses 3-operand format with scope
        else:
            // Pre-sm50: 2-operand membar + separate scope annotation

The EIATTR_SW_WAR_MEMBAR_SYS_INSTR_OFFSETS metadata records the byte offset of every membar.sys instruction in the output SASS. This allows the driver to apply software workarounds (WAR patches) at specific membar.sys locations for known hardware errata.

Fence Operations

Fence operations enforce ordering between different memory proxy domains. These are not exposed as separate PTX codegen handlers but are inserted by the compiler's synchronization passes (phases 25 and 72).

PTX InstructionSASS Opcode (sm100+)Purpose
fence.proxy.alias(expanded inline)Orders generic/alias memory accesses
fence.proxy.async(expanded inline)Orders async copy completion visibility
fence.proxy.async.global(expanded inline)Global memory async fence
fence.sc.ctaFENCE_SSequentially-consistent fence, CTA scope
fence.sc.gpuFENCE_GSequentially-consistent fence, GPU scope
fence.acq_rel.ctaFENCE_TAcquire-release fence, CTA scope

On Blackwell (sm100+), dedicated FENCE_G, FENCE_S, and FENCE_T SASS opcodes replace the older pattern of MEMBAR + proxy annotation sequences.

StageAndFence (Phase 25)

The StageAndFence pass (sub_1392E30, 166 bytes entry, sub_1390B30 8,956 bytes core) inserts fence instructions after loop unrolling to re-establish memory ordering correctness. When loop unrolling replicates memory operations that crossed a synchronization boundary in the original loop body, this pass inserts fence.proxy or MEMBAR pseudo-instructions at the boundaries of unrolled iterations.

The core function takes floating-point parameters (double/__m128d), indicating it incorporates latency and throughput heuristics when deciding fence placement -- preferring to merge adjacent fences or delay them to overlap with independent computation.

Warp-Level Synchronization

WARPSYNC

WARPSYNC mask synchronizes the threads in a warp identified by the lane mask. This is the fundamental warp-level sync primitive on sm70+ (Volta and later).

PTXSASSPurpose
bar.warp.sync membermaskWARPSYNC maskSynchronize warp lanes specified by mask

The intrinsic __cuda_sm70_warpsync (single entry in the sm70 block) is the simplest warp-sync intrinsic, and is detected by the same sub_A9A410 prefix matcher that handles vote and match.

BSSY / BSYNC (Convergence Barriers)

The BSSY / BSYNC instruction pair replaces the pre-Volta implicit reconvergence stack. The compiler must insert these pairs explicitly at divergence/reconvergence points:

SASS OpcodePurpose
BSSY B, targetPush a synchronization barrier; target is the reconvergence point
BSYNC BPop and wait at the convergence barrier B

These are not directly exposed as PTX instructions -- they are inserted by the compiler during phase 72 (LateExpandSyncInstructions) and the architecture-specific sync expansion passes (phases 99, 100, 114). The EIATTR_SYNC_STACK metadata records the convergence barrier stack depth.

ELECT

ELECT.SYNC (sm75+) elects a single active lane from the warp, setting a predicate register to true for exactly one thread.

In the SASS opcode table, ELECT appears among the Blackwell-era additions alongside ENDCOLLECTIVE, PREXIT, SETMAXREG, and SETSMEMSIZE. The ELECT opcode is used for leader-based algorithms where only one thread per warp performs a shared operation.

Asynchronous Barriers (MBARRIER)

Introduced with sm90 (Hopper), mbarrier provides hardware-accelerated asynchronous barriers in shared memory. These are critical for async copy (cp.async.bulk), TMA operations, and warpgroup-level synchronization.

MBARRIER Operation Classification

ptxas classifies mbarrier operations through sub_A94440 (MBarrierDetector, 412 bytes binary) and sub_A9A5F0 (MBarrierClassifier). The classifier resolves the mbarrier symbol name (prefix %mbarrier_) and returns an operation type enum:

Type IDSuffixOperation
0INITInitialize barrier object in shared memory
1ARRIVESignal arrival (non-blocking)
2ARRIVE_NOCOMPLETEArrive without completing the phase
3ARRIVE_DROPArrive and decrement expected count
4ARRIVE_DROP_NOCOMPLETEArrive, drop, no completion
5TEST_WAITTest if barrier phase is complete
6TEST_WAIT_PARITYPhase-parity-based completion test
7CP_ASYNC_ARRIVEcp.async arrival notification
8INVALInvalidate barrier
9TRY_WAITWait with timeout
10TRY_WAIT_PARITYPhase-parity-based wait with timeout
11EXPECT_TXSet expected transaction byte count
12COMPLETE_TXMark transaction bytes as complete

The "trivial" mbarrier operations (types 0--8) are handled inline; "non-trivial" operations (types 9--12, including EXPECT_TX and the extended TRY_WAIT variants) require more complex lowering.

Mbarrier Symbol Naming

Mbarrier objects are tracked through shared memory symbols following the pattern:

%mbarrier_{basename}_{operation}

The resolver at sub_A9A920 extracts the base name from the full symbol (e.g., %mbarrier_pipeline0_ARRIVE yields base name pipeline0). The format string "%%mbarrier_%s_%s" at sub_AA33C0 is used for symbol construction during mbarrier expansion.

Reserved shared memory regions for TMA pipeline mbarriers:

  • __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier
  • __nv_reservedSMEM_tmem_allocation_pipeline_mbarrier_parity

ExpandMbarrier (Phase 42)

Phase 42 expands mbarrier pseudo-instructions into native barrier sequences through architecture vtable dispatch:

mov    rdi, [rsi+0x630]     ; rdi = ctx->arch_backend (offset 1584)
mov    rax, [rdi]            ; rax = arch_backend->vtable
jmp    [rax+0x168]           ; call vtable[45] -- ExpandMbarrier impl

The expansion is architecture-specific:

  • Pre-sm90: No mbarrier pseudo-ops exist; the phase is a no-op
  • sm90 (Hopper): Expands to hardware mbarrier instruction sequences, resolves shared memory addresses, inserts fence.proxy for coherence
  • sm100+ (Blackwell): Extended semantics for tcgen05.fence, cluster-level barriers, and async pipeline operations

Expansion pattern:

Before (Ori pseudo-ops):           After (native SASS):
MBARRIER_INIT  %mbar, count       MBARRIER.INIT  [smem], count
MBARRIER_ARRIVE_EXPECT_TX          MBARRIER.ARRIVE.EXPECT_TX  [smem], bytes
  %mbar, bytes
CP.ASYNC.BULK.TENSOR               CP.ASYNC.BULK.TENSOR  [dst], [src], [smem]
  dst, src, %mbar
MBARRIER_TRY_WAIT_PARITY           MBARRIER.TRY_WAIT.PARITY  pred, [smem],
  %mbar, parity, pred                parity

EIATTR Metadata

EIATTRPurpose
EIATTR_NUM_MBARRIERSCount of mbarrier objects used by the kernel
EIATTR_MBARRIER_INSTR_OFFSETSByte offsets of mbarrier instructions for driver patching

Blackwell CGA Barriers

On sm100+ (Blackwell), a new class of CGA (Cooperative Grid Array) barriers extends the mbarrier concept to cluster-level synchronization:

SASS OpcodePurpose
CGABAR_ARVCGA barrier arrive
CGABAR_GETQuery CGA barrier state
CGABAR_SETSet CGA barrier parameters
CGABAR_WAITWait on CGA barrier
CGAERRBARCGA error barrier

Atomic/Reduction Intrinsic Lowering

The OCG atomic/reduction handler at sub_6C0D90 (19KB, 813 decompiled lines) lowers atom.* and red.* intrinsic calls into Ori IR opcode 314 instructions. Unlike the warp-level sync handlers (which generate inline PTX via sprintf), this function works at the Ori IR level directly: it parses a sub-op parameter array, validates all qualifier combinations, resolves operands to register references, and emits the final instruction through sub_92C240. All diagnostics use error code 7308 and the prefix "Instrinsic - \"%s\"" (the typo is in the binary).

Parameter Array Parsing

The intrinsic name is decomposed by the OCG name parser (sub_6C9BC0) into an integer token array stored at *(ctx+10704). Each token encodes one qualifier dimension of the atomic operation. The handler reads tokens sequentially through a switch-case loop:

TokenVariableSemanticDecoded Value
0memory_order=4Memory orderingRelaxed
1domain=12, is_mmio=1Memory domainMMIO (global, mapped)
2domain=5Memory domainShared (_shared)
3scope=5Visibility scope.cta
4scope=6Visibility scope.gpu
5is_noreturn=1Return behaviorReduction (fire-and-forget, no return value)
6data_size=2Operand width64-bit (u64)
7data_size=4Operand widthVector (v2/v4)
8data_type=12Data type.f32
9data_type=11Data type.s32
10data_type=10Data type.u32 (also .u64 when size=2)
11op=0Operation.add
12op=1Operation.min
13op=2Operation.max
14op=3Operation.inc
15op=4Operation.dec
16op=5Operation.and
17op=6Operation.or
18op=7Operation.xor

Tokens not matching any case are silently skipped. If the parameter array is empty (no tokens), all values take defaults: data_type=1 (unspecified), op=-1 (unspecified), data_size=1 (32-bit), and all flags zero.

Modifier Word Encoding

The parsed qualifiers are packed into a single 32-bit modifier word that accompanies the Ori instruction through the pipeline to ISel:

Bit [14:13] = type encoding:  00=u32  01=s32  10=u64/f32  11=invalid
Bit [12:10] = operation:      0=add  1=min  2=max  3=inc  4=dec  5=and  6=or  7=xor
Bit [8]     = no-return:      1=reduction (red.*)  0=atomic (atom.*)
Bit [7:5]   = memory order:   4=relaxed (only value supported here)
Bit [4:2]   = scope:          5=cta  6=gpu
Bit [1:0]   = operand flags:  bit0=(addr_type==u32)  bit1=(data_type==u32)
Top nibble  = 0x6 (constant marker: 0x60000000)

The type encoding bits [14:13] are set during cross-validation: s32 with 32-bit width sets 0x2000, u32/f32 with 64-bit width sets 0x4000, and invalid combinations set 0xE000 (with an error).

Validation Chain

The handler enforces a strict 10-phase validation sequence. Each failure emits error 7308 with a descriptive message:

PhaseCheckError Message
1Domain must be _shared (5) or _mmio (12); otherwise global is assumed but errors if explicitly set to something else"Domain param '_shared' or '_global' required"
2Vector type operand count must match expected operand count from data_size"Vector type does not match number of subops"
3Data type must be explicitly set (not default 1)"Type {u32, s32, u64} not specified"
4Vector width (v12>1) requires u32 (10) or f32 (12) type"Vector supported only for {u32, u64}"
5Operation must be explicitly set (not default -1); emitted twice"Op {add, min, max, inc, dec, and, or, xor} not specified"
6Shared-domain reductions only support .add"Unsupported non _add global memory reduction"
7aScope without memory order is deprecated"Deprecated scope without memory order semantics" (warning)
7bMemory order requires scope"Required scope with memory order semantics"
8MMIO semantics require global domain"Domain param '_global' required for mmio semantics"
9s32 requires 32-bit; f32+64-bit only with add; otherwise invalid"Invalid data type / op combination" or "Invalid vector / data type combination"
10Each data operand's type field must match declared type; address operand must be u32 (10) or f32 (12)"Operand type does not match specified type" / "Unexpected instrinsic type (%s) in param (%d)"

Operand Resolution and Shared Memory Address Materialization

After validation, the handler resolves up to three operand slots:

  1. Destination/address: Resolved via sub_926370 into a 24-bit register ID, then tagged with 0x50000000 (output register class marker).

  2. Source data operand: Read from the operand descriptor array at *(ctx+10728). Routing depends on bits [30:28] of the operand word:

    • Value 5 (shared memory pointer): Allocates a temporary register in class 6 via sub_91BF30, then emits an Ori opcode 130 pseudo-instruction via sub_934630 to materialize the shared memory address into a general register. This extra LEA/MOV is necessary because ATOMS requires an explicit register operand, not a symbolic shared-memory reference.
    • Value 1 with !(operand[1] & 0x1000000): Direct register reference (24-bit register ID from low bits).
    • Otherwise: Full register resolution through sub_91D150 + operand legalization through sub_7DEFA0.
  3. Second data operand (MMIO only, v109=1): Same three-way resolution for the second source, reading from operand descriptor offset +12.

The final instruction is emitted as:

sub_92C240(output, ctx, 314, data_type, operand_count, operand_buffer, 1)

where Ori opcode 314 represents the unified ATOM/RED operation.

SASS Opcode Selection

The Ori opcode 314 instruction flows through the optimizer pipeline and reaches ISel (sub_C0EB10), which selects the final SASS opcode based on the domain and no-return flag encoded in the modifier word:

Modifier BitsSASS OpcodeROT13Table EntryPTX Equivalent
domain=global, no-return=0ATOMGNGBZT103atom.global.*
domain=shared, no-return=0ATOMSNGBZF105atom.shared.*
domain=generic, no-return=0ATOMNGBZ102atom.*
no-return=1REDERQ104red.*

The operation bits [12:10] further select the SASS sub-opcode qualifier (.ADD, .MIN, .MAX, .INC, .DEC, .AND, .OR, .XOR). The type bits [14:13] determine the data type suffix (.32, .64, .S32, .U32, .F32).

Scope and Memory Order (sm70+)

When scope and memory order are both present, the modifier word carries them through to ISel where they become SASS instruction modifiers:

  • Scope .cta (token 3, value 5): Atomic is visible only within the CTA
  • Scope .gpu (token 4, value 6): Atomic is visible to all thread blocks on the device
  • Memory order relaxed (token 0, value 4): No ordering guarantees beyond atomicity

The handler does not encode acquire, release, or acq_rel memory orders -- these are handled by the separate memory fence/order handler at sub_6C1CF0. The deprecation warning for scope-without-order indicates ptxas is transitioning toward requiring explicit memory order qualifiers for all scoped atomics.

Limitations and Notable Behavior

  1. No CAS/EXCH tokens: The parameter array parser has no tokens for .cas (compare-and-swap) or .exch (exchange). These operations are either encoded through a different OCG intrinsic or use a distinct sub-op encoding not visible in this function's switch-case.

  2. Shared-memory restriction: Only atom.shared.add is supported as a reduction. All other shared-memory reduction operations (red.shared.{min,max,and,or,xor}) are rejected.

  3. MMIO path: Token 1 (domain=MMIO) enables a separate code path that processes two data operands instead of one. This supports the MMIO atomic semantics where both the address and a data value must be explicitly resolved.

  4. Error message bug: The message "Unsupported non _add global memory reduction" fires for shared-memory non-add reductions despite saying "global". This is likely a copy-paste artifact in the ptxas source.

Synchronization Pipeline Summary

The complete synchronization processing pipeline spans 8 optimizer phases:

PhaseNameCategoryPurpose
25StageAndFenceLoweringInsert fences after loop unrolling
26OriRemoveRedundantBarriersOptimizationDataflow-driven barrier elimination (multi-function)
42ExpandMbarrierLoweringExpand mbarrier pseudo-ops via arch vtable
71OptimizeSyncInstructionsOptimizationSync instruction redundancy elimination
72LateExpandSyncInstructionsLoweringFinal sync pseudo-op expansion to SASS
99(Architecture-specific)LoweringPost-RA sync expansion
100(Architecture-specific)LoweringArchitecture vtable dispatch for sync
114(Post-scheduling)FixupPost-scheduling dependency barrier fixup

The progression is: early fence insertion (25) -> cross-function barrier elimination (26) -> mbarrier expansion (42) -> optimization within partial-SSA (71) -> final expansion (72) -> post-RA architecture hooks (99, 100) -> post-scheduling fixup (114).

Ori IR Opcode 130 -- Sync Analysis Target

The optimizer phases 26 and 71 identify synchronization instructions by checking for Ori opcode 130 (HSET2 in the ROT13 name table; used as an internal Ori IR marker for barrier/sync instructions -- actual SASS BAR is opcode 61, MEMBAR is opcode 111). For each barrier instruction found:

  1. Extract the destination operand from field+84
  2. Resolve the register through the register table at context+88
  3. Test whether the register's use-count (reg+24) indicates consumption
  4. If the barrier result is dead (no thread observes it before the next dominating barrier), eliminate the barrier
  5. At block boundaries, attempt to merge barriers with compatible scopes

Knobs and Gating

Knob / FlagEffect
Knob 487Master gate for sync optimization (checked by phases 25, 71, 72)
Knob 358Sync mode selector (0=none, 1=conservative, 2=aggressive, 3+=full multi-function)
Knob 472Barrier liveness analysis mode
context+1368 bit 0Global sync optimization enable
context+1397 bits[6:7]Architecture-specific sync configuration
context+1398 bits[3:4]Sync expansion mode (architecture-dependent)
DisableErrbarAfterMembarSuppress error barrier insertion after membar

SASS Opcode Summary Table

Complete mapping of all synchronization and warp SASS opcodes, with their ROT13-encoded internal names as found in the ptxas binary:

ROT13 (Binary)SASS OpcodeTable OffsetCategory
IBGRVOTE4600Warp vote
IBGRHVOTEU--Uniform warp vote (sm100+)
ONEBAR5160Thread-block barrier
ONE_VAQRKRQBAR_INDEXED--Indexed barrier variant
REEONEERRBAR4184Error barrier
QRCONEDEPBAR--Dependency barrier (scoreboard)
ZNGPUMATCH--Warp match
ZRZONEMEMBAR--Memory barrier
JNECFLAPWARPSYNC--Warp synchronize
OFLAPBSYNC--Convergence barrier sync
OFFLBSSY--Convergence barrier set
E2OR2B5128Register to barrier transfer
--ELECT--Warp lane election (sm75+)
--NANOSLEEP--Nanosleep hint
--FENCE_G--Global fence (sm100+)
--FENCE_S--Shared fence (sm100+)
--FENCE_T--Texture fence (sm100+)
--CGABAR_ARV--CGA barrier arrive (sm100+)
--CGABAR_GET--CGA barrier query (sm100+)
--CGABAR_SET--CGA barrier set (sm100+)
--CGABAR_WAIT--CGA barrier wait (sm100+)
--CGAERRBAR--CGA error barrier (sm100+)
--SYNCS_BASIC--Basic sync (sm100+)
--SYNCS_LD_UNIFM--Sync with uniform load (sm100+)
NGBZATOM102Atomic (generic address space)
NGBZTATOMG103Atomic (global memory)
ERQRED104Reduction (fire-and-forget)
NGBZFATOMS105Atomic (shared memory)

Key Function Reference

AddressIdentitySizePurpose
sub_580E50VoteCodegen~3.2KBPTX vote.* to inline PTX body
sub_5801D0ShflCodegen~3.3KBPTX shfl.* to inline PTX body
sub_58A730MatchCodegen~4.5KBPTX match.* to inline PTX body
sub_567680ReduxCodegen~2.0KBPTX redux.* to inline PTX body
sub_524FB0BarSyncCodegen~1.8KBPTX bar.sync / bar
sub_570290BarrierCodegen~2.5KBPTX barrier.* (PTX 8.0)
sub_500BF0BarArriveCodegen~1.2KBPTX bar.arrive
sub_570940BarrierArriveCodegen--PTX barrier.arrive
sub_52D590BarRedCodegen~1.5KBPTX bar.red.{and,or,popc}
sub_5889B0BarrierRedCodegen~4.8KBPTX barrier.red
sub_56A5A0BarWarpCodegen~1.9KBPTX bar.warp.sync
sub_4DB410MembarCodegen~0.8KBPTX membar.*
sub_A9A410isSM70WarpSync194BIntrinsic prefix detection
sub_A94440isNonTrivialMBarrier412BMbarrier operation classifier
sub_A9A5F0classifyMBarrier--Mbarrier type enum resolver
sub_A9A920resolveMBarrierBaseName--Extract mbarrier base name from symbol
sub_AA33C0constructMBarrierSymbol--Build %%mbarrier_%s_%s symbol
sub_1392E30StageAndFence (phase 25)166B entryPost-unroll fence insertion
sub_1390B30StageAndFence core8,956BMain fence insertion logic
sub_790A40RemoveRedundantBarriers2,288BCross-function barrier elimination
sub_90A340OptimizeSyncInstructions1,670BSync instruction optimization
sub_1381DA0LateExpandSync1,517BFinal sync expansion
sub_6C0D90LowerAtomicRedIntrinsic~19KBOCG atom.*/red.* to Ori opcode 314
sub_C0EB10InstructionSelector185KBMain ISel (handles all sync opcodes)