Warp-Level Operation Builtins
Warp-level builtins provide lane-to-lane communication within a 32-thread warp. They cover four major categories: shuffle (data exchange between lanes), vote (predicate aggregation), match (value matching across lanes), and redux (warp-wide reductions). The shuffle operations also serve as the lowering target for the WMMA fragment load/store operations described in the tensor core page.
Shuffle Operations (IDs 413--416)
The __shfl_sync family enables direct register-to-register communication between warp lanes. Four shuffle modes exist, each registered as a _sync variant:
| ID | Builtin | Mode | Description |
|---|---|---|---|
| 413 | __nvvm_shfl_up_sync | Up | Lane reads from lane - delta |
| 414 | __nvvm_shfl_down_sync | Down | Lane reads from lane + delta |
| 415 | __nvvm_shfl_bfly_sync | Butterfly | Lane reads from lane XOR delta |
| 416 | __nvvm_shfl_idx_sync | Index | Lane reads from arbitrary srcLane |
Shuffle Dispatch via Table Lookup
All shuffle builtins route through sub_12B3540 (EDG) / sub_954F10 (NVVM), the table-based lowering handler. Three groups of 8 IDs each cover the complete shuffle interface:
| ID Range | Group | Description |
|---|---|---|
| 302--309 | Legacy __shfl | Non-sync variants (4 modes x 2 types: i32/f32) |
| 338--345 | __shfl_sync | Sync variants with mask (4 modes x 2 types) |
| 395--402 | __shfl_*_sync | Newer SM interface (4 modes x 2 types) |
Within each group of 8, the layout is:
| Offset | Mode | i32 Variant | f32 Variant |
|---|---|---|---|
| +0, +1 | shfl_up | offset +0 | offset +1 |
| +2, +3 | shfl_down | offset +2 | offset +3 |
| +4, +5 | shfl_xor | offset +4 | offset +5 |
| +6, +7 | shfl_idx | offset +6 | offset +7 |
The handler builds the argument list (mask, value, delta/lane, width), looks up the target intrinsic by shuffle mode and data type from its red-black tree map, and emits a function call.
Vote Operations (IDs 351--358)
Warp vote builtins aggregate a boolean predicate across all participating lanes. Both legacy (non-sync) and sync variants are registered.
| ID | Builtin | Operation | Sync |
|---|---|---|---|
| 351 | __nvvm_vote_all | All predicates true? | No |
| 352 | __nvvm_vote_any | Any predicate true? | No |
| 353 | __nvvm_vote_uni | All predicates equal? | No |
| 354 | __nvvm_vote_ballot | Bitmask of predicates | No |
| 355 | __nvvm_vote_all_sync | All predicates true? | Yes |
| 356 | __nvvm_vote_any_sync | Any predicate true? | Yes |
| 357 | __nvvm_vote_uni_sync | All predicates equal? | Yes |
| 358 | __nvvm_vote_ballot_sync | Bitmask of predicates | Yes |
Vote Lowering
The handler sub_12ABB90 (EDG) / sub_94D570 (NVVM) takes parameters:
(result, ctx, vote_op, args, is_ballot, is_sync)
The vote_op encoding: 0 = all, 1 = any, 2 = uni, 3 = ballot.
When is_sync=1, an extra mask argument is consumed from the call arguments. For non-sync variants, the handler looks up intrinsic 5301 (llvm.nvvm.vote). For sync variants, it generates an inline predicate pattern. The ballot variant (vote_op=3) sets is_ballot=1, which changes the return type from i1 (predicate) to i32 (bitmask).
Match Operations (IDs 361--364)
Match builtins find lanes with equal values and return a bitmask of matching lanes. Available in 32-bit and 64-bit variants with two matching modes.
| ID | Builtin | Width | Mode | Intrinsic |
|---|---|---|---|---|
| 361 | __match32_any_sync | 32-bit | Any match | 0x1011 |
| 362 | __match64_any_sync | 64-bit | Any match | 0x1011 |
| 363 | __match32_all_sync | 32-bit | All match | 0x100F |
| 364 | __match64_all_sync | 64-bit | All match | 0x100F |
The handler sub_12AD230 (EDG) dispatches on two opcodes: 0x1011 for any-match and 0x100F for all-match. The NVVM-side handler sub_94F430 uses intrinsic pairs 0x2017 / 0x2018 with mode variants 0, 1, 2 to encode the width and match type.
Warp Redux (IDs 413--416 range, via sub_12ADD20)
Warp-wide reduction operations perform arithmetic reductions across all active lanes in a single instruction. These are dispatched through sub_12ADD20 (EDG) / sub_94F250 (NVVM).
| ID | Operation | NVVM Intrinsic | Description |
|---|---|---|---|
| redux.sync.add | 0x24F5 (9461) | Sum reduction | Sum of values across warp |
| redux.sync.min | 0x24ED (9453) | Minimum reduction | Minimum value across warp |
| redux.sync.max | 0x24E9 (9449) | Maximum reduction | Maximum value across warp |
| redux.sync.or | 0x24F1 (9457) | Bitwise OR reduction | OR of values across warp |
The EDG side uses intrinsic codes 0x2332 and 0x2330 for the two redux variant families.
Activemask and Lanemask
The active mask and per-lane mask builtins are handled through sub_12ADB00 (EDG) / sub_94CF30 (NVVM):
These builtins return the set of currently active lanes (__activemask()) or per-lane positional masks (__lanemask_lt(), __lanemask_le(), __lanemask_eq(), __lanemask_ge(), __lanemask_gt()). They compile to PTX special register reads (%lanemask_*).
Predicate-Register Conversion (IDs 411--412)
Two builtins convert between predicate registers and general-purpose registers:
| ID | Builtin | Direction | Description |
|---|---|---|---|
| 411 | __nv_p2r | Predicate -> Register | Pack predicates into a 32-bit register |
| 412 | __nv_r2p | Register -> Predicate | Unpack a 32-bit register into predicates |
The handler generates element-wise operations: sub_9483E0 iterates over vector elements using sub_39FAC40 to compute the element count, then builds per-element extractelement + store (for p2r) or load + insertelement (for r2p) chains.
Nanosleep and CP.Async
Warp-adjacent utility builtins handled through sub_12AD230 / sub_94ED50:
| ID Range | Operation | Description |
|---|---|---|
| 367--369 | __nv_memcpy_async_shared_global_{4,8,16}_impl | Asynchronous copy (cp.async) |
These builtins combine data movement with implicit synchronization and are lowered through sub_12AB730 / sub_94C5F0, which builds the cp.async PTX instruction with the specified transfer size (4, 8, or 16 bytes).
Architecture Requirements
| Feature | Minimum SM | Notes |
|---|---|---|
__shfl (legacy, non-sync) | SM 30+ | Deprecated; requires full warp convergence |
__shfl_sync | SM 70+ (Volta) | Explicit mask; independent thread scheduling |
| Vote (non-sync) | SM 30+ | Deprecated |
Vote (_sync) | SM 70+ | Explicit mask required |
Match (_sync) | SM 70+ | Warp-level value matching |
Redux (redux.sync.*) | SM 80+ (Ampere) | Hardware-accelerated warp reduction |
| Elect sync | SM 90+ (Hopper) | Single-lane election from active mask |
cp.async | SM 80+ | Asynchronous shared memory copy |