Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Atomic Operations Builtins

Atomic builtins constitute the largest and most complex category in the NVVM builtin system, spanning over 130 IDs across two distinct subsystems: the legacy NVVM intrinsic atomics (IDs 207--275, 370--379) and the C++11-model atomics (IDs 366, 417--473). Both families converge in the lowering layer at sub_12AE930 (EDG) / sub_9502D0 (NVVM), a 1495-line handler that generates inline PTX assembly with explicit memory ordering and scope annotations.

Two Atomic Subsystems

The compiler maintains two parallel atomic APIs that reflect CUDA's historical evolution. The legacy NVVM atomics (__nvvm_atom_*) predate the C++ memory model and encode scope directly in the builtin name (e.g., __nvvm_atom_cta_add_gen_i for block-scoped integer add). The C++11 atomics (__nv_atomic_*) accept ordering and scope as runtime parameters, matching the cuda::atomic_ref interface.

Both subsystems lower to identical PTX instructions. The distinction matters only during the EDG frontend phase, where sub_6BBC40 generates the mangled __nv_atomic_* names from C++ source, and the NVVM lowering layer sub_12B3FD0 dispatches them by ID.

Legacy NVVM Atomics (IDs 207--275)

These 69 builtins encode the operation, scope, and type directly in the name. The lowering dispatches through sub_12AA9B0 for exchange-style operations and sub_12ADE80 for load/store/fetch operations. Each operation exists in three scope variants: default (device), _cta_ (block), and _sys_ (system).

ID RangeOperationBuiltin PatternPTX Mnemonic
207--218Add__nvvm_atom_{,cta_,sys_}add_gen_{i,ll,f,d}atom.add
219--227Exchange__nvvm_atom_{,cta_,sys_}xchg_gen_{i,ll,128}atom.exch
228--251Min/Max__nvvm_atom_{,cta_,sys_}{min,max}_gen_{i,ll,ui,ull}atom.min / atom.max
252--257Inc/Dec__nvvm_atom_{,cta_,sys_}{inc,dec}_gen_uiatom.inc / atom.dec
258--275Bitwise__nvvm_atom_{,cta_,sys_}{and,or,xor}_gen_{i,ll}atom.and / atom.or / atom.xor

Legacy CAS (IDs 370--379)

Compare-and-swap builtins include 128-bit variants for SM 70+ targets. The handler sub_12AA280 builds an AtomicCmpXchg IR node with acquire ordering on both success and failure paths and weak exchange semantics.

ID RangeOperationBuiltin Pattern
370--379CAS__nvvm_atom_{,cta_,sys_}cas_gen_{i,ll,us,128}

Half-Precision Atomics (IDs 459--468)

Added for SM 90+ (Hopper), these support f16x2 and f16x4 packed atomic adds:

ID RangeOperationBuiltin PatternSM Gate
459--461f16x2 add__nvvm_atom_{,cta_,sys_}add_gen_f2SM 90+
466--468f16x4 add__nvvm_atom_{,cta_,sys_}add_gen_f4SM 100+ (Blackwell)

C++11 Atomics (IDs 366, 417--473)

These 57 builtins implement the CUDA C++ atomic model with explicit memory ordering and scope parameters. The EDG frontend generator at sub_6BBC40 constructs the mangled names using a __nv_atomic_fetch_{op}_{width}_{type} pattern, where width is the byte count (1, 2, 4, 8, or 16) and the type suffix is _u (unsigned), _s (signed), or _f (float).

Thread Fence (ID 366)

__nv_atomic_thread_fence emits either a volatile fence (SM <= 69) or an explicit fence.{ordering}.{scope}; PTX instruction (SM 70+). Ordering and scope are extracted from constant operand parameters at compile time.

Load/Store (IDs 417--428)

IDBuiltinWidthPTX
417__nv_atomic_loadgenericld.{ordering}.{scope}.{type}
418--422__nv_atomic_load_{1,2,4,8,16}1--16 bytessame
423__nv_atomic_storegenericst.{ordering}.{scope}.{type}
424--428__nv_atomic_store_{1,2,4,8,16}1--16 bytessame

Fetch-Op (IDs 429--458)

Arithmetic and bitwise fetch operations are registered with width and type suffixes. Bitwise operations (and, or, xor) omit the type suffix since signedness is irrelevant for bitwise logic.

ID RangeOperationBuiltin Pattern
429--434fetch_add__nv_atomic_fetch_add_{4,8}_{u,s,f}
435--440fetch_sub__nv_atomic_fetch_sub_{4,8}_{u,s,f}
441--446fetch_and/or/xor__nv_atomic_fetch_{and,or,xor}_{4,8}
447--452fetch_max__nv_atomic_fetch_max_{4,8}_{u,s,f}
453--458fetch_min__nv_atomic_fetch_min_{4,8}_{u,s,f}

For fetch_sub with floating-point types (IDs 437, 440), the lowering negates the operand and emits atom.add rather than a dedicated subtraction instruction.

Exchange and CAS (IDs 462--473)

ID RangeOperationBuiltin Pattern
462--465Exchange__nv_atomic_exchange{,_4,_8,_16}
469--473CAS__nv_atomic_compare_exchange{,_2,_4,_8,_16}

PTX Inline Assembly Generation

The atomic codegen handler at sub_12AE930 (address 0x12AE930, 41KB) generates PTX inline assembly strings at compile time. The generated instruction format depends on the target SM:

Pre-SM 70 (volatile mode, unk_4D045E8 <= 0x45):

ld.volatile.b32 $0, [$1];
atom.add.volatile.u32 $0, [$1], $2;

SM 70+ (explicit memory model):

ld.acquire.gpu.b32 $0, [$1];
st.release.sys.b32 [$0], $1;
atom.add.acq_rel.cta.u32 $0, [$1], $2;
atom.cas.relaxed.gpu.b64 $0, [$1], $2, $3;

The sub_12AE930 / sub_9502D0 Algorithm in Detail

Both the EDG-side handler (sub_12AE930, 0x12AE930) and its NVVM-side twin (sub_9502D0, 0x9502D0) follow identical logic. They accept five parameters: (result, codegen_state, builtin_id, call_arg_list, type_info). The algorithm proceeds in six phases.

Phase 1: SM Version Check and Path Selection

v186 = (unk_4D045E8 <= 0x45)    // SM <= 69 -> volatile mode

When v186 is true, the handler enters the pre-SM 70 "volatile" path. All atomic operations receive a .volatile qualifier instead of explicit memory ordering and scope qualifiers. The 128-bit atomics emit diagnostic 0xEB6 (3766) and are rejected entirely.

When v186 is false (SM 70+), the handler enters the memory model path, which constructs the full {mnemonic}.{ordering}.{scope}.{type} format.

Phase 2: Operand Extraction and Builtin ID Dispatch

The handler extracts between 2 and 5 operands from the call argument list (pointer, value, compare-value for CAS, plus the ordering and scope parameters encoded as compile-time constants). The builtin ID selects the PTX mnemonic via a switch:

switch (builtin_id) {
    case 417..422:  mnemonic = "ld";          // atomic load
    case 423..428:  mnemonic = "st";          // atomic store
    case 429..434:  mnemonic = "atom.add";    // fetch-add (unsigned, signed, float)
    case 435..440:  mnemonic = "atom.add";    // fetch-sub (negated; see below)
    case 441..442:  mnemonic = "atom.and";    // fetch-and
    case 443..444:  mnemonic = "atom.or";     // fetch-or
    case 445..446:  mnemonic = "atom.xor";    // fetch-xor
    case 447..452:  mnemonic = "atom.max";    // fetch-max
    case 453..458:  mnemonic = "atom.min";    // fetch-min
    case 462..465:  mnemonic = "atom.exch";   // exchange
    case 469..473:  mnemonic = "atom.cas";    // compare-and-swap
    default:        fatal("unexpected atomic builtin function");
}

For IDs 435--440 (fetch_sub), the handler does not emit atom.sub (which does not exist in PTX). Instead, for integer types it negates the operand and emits atom.add; for float types it negates via fneg and emits atom.add.f.

For thread fence (ID 366), the handler branches to sub_12AE0E0 (volatile fence, pre-SM 70) or sub_12AE4B0 (explicit fence, SM 70+) and returns immediately, bypassing the rest of the atomic pipeline.

Phase 3: Memory Ordering Resolution

The ordering parameter is extracted from the first constant operand of the C++11 atomic call via sub_620EE0. The value (0--5) maps to a PTX qualifier string:

ValueC++ OrderingPTX QualifierApplies To
0relaxed / monotonicrelaxedAll operations
1consume (treated as acquire)acquireLoads, RMW
2acquireacquireLoads, RMW
3releasereleaseStores, RMW
4acq_relacq_relRMW operations
5seq_cstacquire (loads), release (stores)All

Sequential consistency (value 5) is downgraded: loads get acquire, stores get release, and RMW operations get acq_rel. True seq_cst semantics are achieved by inserting explicit fences around the operation (see "Fence Insertion for Seq_Cst" below).

Store-specific validation. For store builtins (IDs 423--428), only ordering values 0, 3, and 5 are legal. Any other value triggers fatal("unexpected memory order."). Value 5 is treated as relaxed for the store instruction itself, with the seq_cst fence handling the ordering guarantee externally.

Load-specific validation. For load builtins (IDs 417--422), values 3 (release) and 4 (acq_rel) are illegal and trigger the same fatal error.

Phase 4: Scope Resolution

The scope parameter is extracted from the second constant operand via sub_620EE0. The value (0--4) maps to a PTX scope qualifier:

switch (scope_value) {
    case 0:  // fall through
    case 1:  scope_str = "cta";      break;   // thread block
    case 2:
        if (unk_4D045E8 > 0x59)               // SM > 89
            scope_str = "cluster";             // SM 90+ (Hopper)
        else
            scope_str = "gpu";                 // SM <= 89: fallback
        break;
    case 3:  scope_str = "gpu";      break;   // device
    case 4:  scope_str = "sys";      break;   // system
    default: fatal("unexpected atomic operation scope.");
}

The cluster scope fallback is the critical SM gate at line 255 / 424 of sub_12AE930 / sub_9502D0: when the SM version is 89 or below, scope value 2 ("cluster") silently degrades to gpu. No diagnostic is emitted; the scope is simply rewritten. On SM 90+ (Hopper and later), cluster passes through to the PTX output.

Phase 5: Type Suffix Construction

The type suffix is built from two components: a type-class letter and a byte-width number. The type-class lookup uses a 4-entry table stored in local variable v196:

v196[0] = 'b'    // bitwise   (for exch, and, or, xor, cas)
v196[1] = 'u'    // unsigned  (for add, inc, dec, max, min on unsigned)
v196[2] = 's'    // signed    (for max, min on signed)
v196[3] = 'f'    // float     (for add on float/double)

The type-class index is derived from the LLVM type of the atomic operand:

  • Integer type with unsigned semantics: index 1 (u)
  • Integer type with signed semantics: index 2 (s)
  • Floating-point type: index 3 (f)
  • All other cases (exchange, CAS, bitwise): index 0 (b)

The byte-width is the size of the atomic operand in bytes. Valid sizes are validated against the bitmask 0x10116:

valid = ((1LL << byte_size) & 0x10116) != 0

This bitmask has bits set at positions 1, 2, 4, 8, and 16, accepting exactly the byte widths {1, 2, 4, 8, 16}. Any other size triggers fatal("unexpected size1").

The resulting suffix is the letter concatenated with the bit width (byte_size * 8): .u32, .s64, .f32, .b128, etc.

Phase 6: Inline ASM String Assembly and Emission

The handler assembles the final PTX string by concatenating the components. Two string buffers are maintained throughout: v190 (ordering string) and v193 (scope string), set during phases 3 and 4.

For SM 70+ (memory model mode):

// Loads:
sprintf(buf, "ld.%s.%s.%c%d $0, [$1];", v190, v193, type_letter, bit_width);
// Stores:
sprintf(buf, "st.%s.%s.%c%d [$0], $1;", v190, v193, type_letter, bit_width);
// RMW atomics:
sprintf(buf, "%s.%s.%s.%c%d $0, [$1], $2;", mnemonic, v190, v193, type_letter, bit_width);
// CAS:
sprintf(buf, "%s.%s.%s.%c%d $0, [$1], $2, $3;", mnemonic, v190, v193, type_letter, bit_width);

For pre-SM 70 (volatile mode):

// Loads:
sprintf(buf, "ld.volatile.%c%d $0, [$1];", type_letter, bit_width);
// RMW atomics:
sprintf(buf, "%s.volatile.%c%d $0, [$1], $2;", mnemonic, type_letter, bit_width);

Constraint string construction. The LLVM inline ASM constraint string is built dynamically to match the operand pattern:

PatternConstraint StringMeaning
Load (ld)"=r,l,~{memory}" or "=l,l,~{memory}"result in reg, address in 64-bit reg, memory clobber
Store (st)"l,r,~{memory}" or "l,l,~{memory}"address in 64-bit reg, value in reg, memory clobber
RMW (atom.*)"=r,l,r,~{memory}"result, address, operand, memory clobber
CAS (atom.cas)"=r,l,r,r,~{memory}"result, address, compare, swap, memory clobber

The register class for result and value operands is r for 32-bit types and l for 64-bit types. 128-bit types use l with pair operands.

The assembled PTX string and constraint string are passed to sub_B41A60 (NVVM side) or the equivalent EDG-side helper, which creates an LLVM InlineAsm node. The node is then emitted via sub_921880 / sub_1285290.

Fence Insertion for Seq_Cst

When the memory ordering is sequential consistency (value 5) and the SM version supports explicit fences (SM 70+), the handler does not simply emit atom.sc.{scope}. Instead, it implements seq_cst through a fence-bracketed pattern:

  1. Pre-fence: If the operation is a store or RMW and ordering >= release, the handler calls sub_94F9E0 (membar) or sub_94FDF0 (fence) to emit a leading fence:

    • sub_94F9E0 emits membar.{scope}; as inline PTX
    • sub_94FDF0 emits fence.sc.{scope}; or fence.acq_rel.{scope};
  2. The atomic operation: Emitted with downgraded ordering (acquire for loads, release for stores, acq_rel for RMW).

  3. Post-fence: If the operation is a load or RMW and ordering >= acquire, a trailing fence is emitted.

The fence scope matches the atomic operation's scope. The decision to emit membar vs fence depends on the SM version and the specific ordering level: membar is used for the pre-SM 70 path (though that path should not reach this code), and fence.sc / fence.acq_rel for SM 70+.

The pre/post-fence logic is gated by two conditions in the NVVM-side handler:

PRE-FENCE:  if (v186 && (v187 - 3) <= 2)    // v187 is ordering; range [3,5] = release, acq_rel, seq_cst
POST-FENCE: if (!v175 && v169 == 5)          // v175 = is_store; v169 = ordering = seq_cst

The Volatile Fence Handler (sub_12AE0E0)

For thread fence on SM <= 69, sub_12AE0E0 emits a volatile memory barrier. The function takes an ASM buffer and fence configuration parameters. It produces:

membar.{scope};

where the scope is derived from the fence's scope parameter (cta / gl / sys). This is the pre-memory-model equivalent of the explicit fence path.

The Explicit Fence Handler (sub_12AE4B0)

For thread fence on SM 70+, sub_12AE4B0 constructs an explicit fence.{ordering}.{scope}; instruction. The ordering for fences is a restricted set compared to atomics:

Ordering ValueFence Qualifier
3sc (sequentially consistent)
4acq_rel
5sc (same as 3)
Otherfatal("unexpected memory order.")

The scope string follows the same rules as atomics. The assembled string is emitted as LLVM inline ASM with a ~{memory} clobber.

Memory Ordering Encoding

The ordering parameter (values 0--5) maps to PTX qualifiers:

ValueOrderingUsed For
0relaxedDefault / monotonic
1, 2acquireLoads, RMW
3releaseStores
4acq_relRMW operations
5acquireSequential consistency (downgraded)

Scope Encoding

The scope parameter (values 0--4) maps to PTX scope qualifiers:

ValueScopePTXSM Requirement
0, 1Block.ctaAll
2Cluster.clusterSM 90+ (Hopper); falls back to .gpu on SM <= 89
3Device.gpuAll
4System.sysAll

Type Suffix Construction

The type suffix is built from a 4-entry table: b (bitwise), u (unsigned), s (signed), f (float). Combined with the byte size, this produces suffixes like .u32, .f64, .b128. Valid sizes are validated against the bitmask 0x10116 (bits for 1, 2, 4, 8, and 16 bytes).

The 13 Atomic Operations at PTX Emission

The PTX emission layer at sub_21E5E70 (base) and sub_21E6420 (L2-hinted) implements the final encoding from the NVPTX MachineInstr opcode to the PTX text. The instruction operand word at this stage encodes both scope and operation:

bits[7:4]    — scope:  0 = gpu (default), 1 = cta, 2 = sys
bits[23:16]  — atomic operation opcode (BYTE2)

The 13-entry dispatch table:

OpcodePTX SuffixL2-Hinted SuffixDescription
0x00.exch.b.exch.L2::cache_hint.bBitwise exchange
0x01.add.u.add.L2::cache_hint.uUnsigned add
0x02(missing)(missing)No .add.s in PTX ISA
0x03.and.b.and.L2::cache_hint.bBitwise AND
0x04(missing)(missing)Unused slot
0x05.or.b.or.L2::cache_hint.bBitwise OR
0x06.xor.b.xor.L2::cache_hint.bBitwise XOR
0x07.max.s.max.L2::cache_hint.sSigned max
0x08.min.s.min.L2::cache_hint.sSigned min
0x09.max.u.max.L2::cache_hint.uUnsigned max
0x0A.min.u.min.L2::cache_hint.uUnsigned min
0x0B.add.f.add.L2::cache_hint.fFloat add
0x0C.inc.u.inc.L2::cache_hint.uUnsigned increment
0x0D.dec.u.dec.L2::cache_hint.uUnsigned decrement
0x0E.cas.b.cas.L2::cache_hint.bCompare-and-swap

Opcodes 0x02 and 0x04 are unoccupied. There is no signed atomic add in PTX (signed add uses .add.u since two's-complement wrapping is identical). Slot 0x04 is simply skipped.

The scope prefix is emitted before the operation suffix:

bits[7:4] & 0xF:
    0  ->  (nothing; implicit .gpu scope)
    1  ->  ".cta"
    2  ->  ".sys"

Full PTX emission format:

atom[.scope].{op}.{type}{size}

Example: atom.cta.add.u32, atom.sys.cas.b64, atom.exch.b32.

L2 Cache Hint System (SM 80+ / Ampere)

sub_21E6420 (address 0x21E6420) is a parallel version of the base atomic emitter sub_21E5E70. It inserts .L2::cache_hint between the operation and type suffix for all 13 atomic operations:

atom[.scope].{op}.L2::cache_hint.{type}{size}

The L2 cache hint instructs the GPU's L2 cache to retain (or evict) the atomic target data after the operation completes. This is a PTX 7.3+ feature introduced with Ampere (SM 80+).

The L2-hinted path is selected when bit 0x400 is set in the instruction's encoding flags. The hint is applied at the MachineInstr level during instruction selection, not during the inline ASM generation phase of sub_12AE930. Both paths produce identical scope and type encoding; the L2 path adds exactly the .L2::cache_hint substring.

String emission uses SSE (xmm) register loads from precomputed constant data at addresses xmmword_435F590 through xmmword_435F620 to fast-copy the 16-byte prefix of each operation string, then patches the remaining bytes. This avoids branch-heavy string concatenation for the 13 cases.

AtomicExpandPass: IR-Level Expansion (sub_20C9140)

Before sub_12AE930 handles the C++11 atomics, and separately from the legacy builtin lowering, an LLVM FunctionPass named "Expand Atomic instructions" (pass ID "atomic-expand", registered at sub_20CA900) runs on LLVM IR to decide which atomic operations the NVPTX target can handle natively and which must be expanded into CAS loops.

Expansion Decision Tree

For each atomic instruction in the function:

  1. shouldExpandAtomicCmpXchgInIR (vtable +0x258): Default expands all cmpxchg to LL/SC or CAS-based loops. The NVPTX override may keep native i32/i64 cmpxchg on SM 70+.

  2. shouldExpandAtomicRMWInIR (vtable +0x280):

    • i32 xchg/add/min/max: kept native on all SM.
    • i64 xchg/add: kept native on SM 70+.
    • i32/i64 sub/nand: always expanded to CAS loop (no native PTX instruction).
    • i8/i16 (any operation): always expanded via partword masking.
    • Float atomicAdd: native on SM 70+ (fp32), SM 80+ (fp16/bf16).
  3. shouldExpandAtomicLoadInIR (vtable +0x270): Native for aligned i32/i64. Expanded for i8/i16 (widen to i32 load + extract) and i128+ (decompose to multiple loads).

  4. shouldExpandAtomicStoreInIR (vtable +0x278): Native for aligned i32/i64. Expanded for sub-word and >64-bit types.

Sub-Word Atomic Expansion (sub_20CB200)

No NVIDIA GPU architecture through SM 120 supports native sub-word (i8/i16) atomics. The pass generates mask-and-shift wrappers around word-sized CAS loops. The mask generation function sub_20CB200 (2896 bytes) produces a 6-field output struct:

FieldNamePurpose
+0x00AlignedAddrPointer masked to word boundary: ptr & ~(word_size - 1)
+0x08AlignedTypeAlways i32
+0x10PtrLSBLow address bits: ptr & (word_size - 1)
+0x18ShiftAmtBit position within the word: PtrLSB * 8 (little-endian)
+0x20Inv_MaskInverted mask: ~(((1 << (type_size * 8)) - 1) << ShiftAmt)
+0x28MaskMask: (1 << (type_size * 8)) - 1

The CAS loop (sub_20CBD50, 1646 bytes) then:

  1. Shifts the new value into position: ValOperand_Shifted = new_val << ShiftAmt.
  2. Loops: loads the word, applies the RMW operation on the masked sub-word, attempts CAS on the full word.
  3. On success: extracts the sub-word result via shift + mask.

CAS Loop Generation (sub_20C96A0)

For operations that cannot be handled natively, the pass builds a compare-and-swap loop with three basic blocks:

entry -> "atomicrmw.start" -> (CAS failure) -> "atomicrmw.start" (retry)
                           -> (CAS success) -> "atomicrmw.end"

Steps:

  1. Load current value from pointer.
  2. Compute new value using the RMW operation (dispatched through an 11-case switch at sub_20CC690: Xchg, Add, Sub, And, Or, Xor [implied], Nand, Max, Min, UMax, UMin, FMin, FMax).
  3. Emit cmpxchg with packed success+failure orderings.
  4. Branch back to start on failure, fall through to end on success.

Ordering-to-Fence Table (address 0x428C1E0)

The pass uses a 7-entry fence decision table indexed by LLVM AtomicOrdering enum:

OrderingIndexRelease Fence Before?Acquire Fence After?
NotAtomic0NoNo
Unordered1NoNo
Monotonic2NoNo
Acquire3NoYes
Release4YesNo
AcquireRelease5YesYes
SequentiallyConsistent6YesYes (+ barrier)

Fence emission calls sub_15F9C80 which creates an LLVM fence instruction with the specified ordering and sync scope.

Memory Barrier and Fence Emission

The PTX emission layer has two dedicated handlers for barriers and fences, separate from the atomic operation emitters.

Memory Barrier (sub_21E94F0)

Emits membar instructions based on a 4-bit operand encoding:

ValueInstructionScope
0membar.gpuDevice
1membar.ctaThread block
2membar.sysSystem
4fence.sc.clusterCluster (SM 90+)
3fatal("Bad membar op")Invalid

NVVM-Side Membar (sub_94F9E0)

At the NVVM lowering level, sub_94F9E0 handles membar emission with a different scope encoding:

Scope ValueScope StringPTX Instruction
0, 1ctamembar.cta;
2, 3glmembar.gl;
4sysmembar.sys;
Otherfatal("unexpected atomic operation scope.")

NVVM-Side Fence (sub_94FDF0)

Constructs fence.{ordering}.{scope}; from a state array. The ordering mapping is:

ValueOrdering String
3sc
4acq_rel
5sc
Otherfatal("unexpected memory order.")

Both membar and fence are emitted as inline PTX assembly (not LLVM IR fence instructions) because PTX-level memory ordering semantics have no direct LLVM IR equivalent at the precision NVIDIA requires.

Architecture Gates

SM ThresholdEffect
SM <= 59Diagnostic 0xEB6 warning for certain atomic patterns
SM 60--69Diagnostic 0xEB2 (3762) for specific atomic patterns
SM <= 69Volatile mode; 128-bit atomics not supported (diagnostic 0xEB4)
SM 70+Explicit ordering/scope in PTX output
SM <= 89Scope value 2 silently falls back from cluster to gpu
SM <= 89Half-precision (2-byte FP) atomics not supported
SM 90+ (Hopper)Cluster scope (.cluster) becomes available
SM 90+f16x2 packed atomic add (IDs 459--461)
SM 90+fence.sc.cluster becomes available
SM 100+ (Blackwell datacenter)f16x4 packed atomic add (IDs 466--468)

EDG Frontend Name Construction

The EDG atomic builtin generator sub_6BBC40 (address 0x6BBC40, 1251 lines) constructs internal function names from C++ cuda::atomic_ref calls. The algorithm uses a dispatch key v165 = *(uint16_t*)(type_node + 176), the EDG "builtin kind" tag, to select the operation:

v165 (hex)v165 (dec)Operation
0x6241, 0x624225153, 25154compare_exchange
0x6248, 0x624925160, 25161exchange
0x624F, 0x625025167, 25168fetch_add
0x6257, 0x625825175, 25176fetch_sub
0x625F, 0x626025183, 25184fetch_and
0x6263, 0x626425187, 25188fetch_xor
0x6267, 0x626825191, 25192fetch_or
0x626B, 0x626C25195, 25196fetch_max
0x6273, 0x627425203, 25204fetch_min
0x627B, 0x627C25211, 25212load
0x6280, 0x628125216, 25217store
0x628625222thread_fence

Within each pair, the odd ID is the "generic" overload that enters the renaming path; the even ID has its base name string set explicitly via strcpy.

Name Construction Algorithm (lines 877--996 of sub_6BBC40)

Step 1 -- Base name. Copy the EDG source name, then overwrite with the canonical base for the seven fetch-op builtins:

v165     Base name
------   ---------------------------
0x6250   "__nv_atomic_fetch_add"
0x6258   "__nv_atomic_fetch_sub"
0x6260   "__nv_atomic_fetch_and"
0x6264   "__nv_atomic_fetch_xor"
0x6268   "__nv_atomic_fetch_or"
0x626C   "__nv_atomic_fetch_max"
0x6274   "__nv_atomic_fetch_min"

Step 2 -- Width suffix. Append "_%u" formatted with the type size in bytes from *(uint32_t*)(type_node + 128). For fetch-op builtins, the size is validated as (type_size - 4) <= 4, accepting only 4 and 8 bytes.

Step 3 -- Type suffix (only for add/sub/max/min; lines 960--996). Reads type_kind = *(uint8_t*)(type_node + 140):

type_kindMeaningSuffixCondition
2integer_sbyte_4B6DF90[signedness_byte] != 0 (signed)
2integer_ubyte_4B6DF90[signedness_byte] == 0 (unsigned)
3float_fAlways
6unsigned explicit_uAlways

byte_4B6DF90 is a 256-entry lookup table that maps the EDG "integer kind" sub-tag (at type_node + 160) to a boolean: 1 = signed, 0 = unsigned.

Bitwise operations (and/or/xor) omit the type suffix entirely.

Naming Pattern Summary

__nv_atomic_fetch_{op}_{width}[_{type}]

{op}    = add | sub | and | xor | or | max | min
{width} = 4 | 8  (bytes)
{type}  = _s (signed), _u (unsigned), _f (float), or omitted (bitwise)

For load/store/exchange/compare_exchange, only the width suffix is appended; no type suffix.

Validation Diagnostics

DiagnosticHexCondition
8520x354Unsupported atomic operation for target
16450x66DWrong return type for builtin
16460x66EUnsupported type size (not in {1,2,4,8,16})
37450xEA1Atomic not supported for given type
37460xEA2First param scope exceeds range (>5)
37470xEA3Return param scope exceeds range (>4)
37480xEA4fetch_op type size not 4 or 8 bytes
37490xEA5Store with type_size <= 1 (too small)
37500xEA6Load with type_size > 3 (too large)
37560xEACCAS parameter type mismatch
37570xEADExchange parameter type mismatch
37590xEAFFloat return not supported below SM 90
37620xEB2SM 60--69 atomic variant diagnostic
37630xEB3Return type on store (SM <= 89)
37640xEB4128-bit store/load not supported on this SM
37650xEB516-bit store not supported on SM <= 69
37660xEB6Generic warning for SM <= 59
37670xEB7Type size not in {1,2,4,8,16} bitmask
37690xEB9Null argument list error

EDG Type Node Field Map

OffsetSizeField
+1288type_size (byte count: 1, 2, 4, 8, 16)
+1401type_kind (0=void, 2=integer, 3=float, 6=unsigned, 8=pointer, 12=typedef)
+160variesFor type_kind 12 (typedef): pointer to underlying type. For type_kind 2 (integer): uint8_t signedness sub-tag indexed into byte_4B6DF90.
+1688Pointer chain (for struct/compound types)
+1762builtin_kind (the v165 dispatch tag, uint16_t)

NVPTX MachineInstr Atomic Opcodes

At the SelectionDAG / MachineInstr level, atomic operations map to NVPTX-specific opcodes distinct from the inline ASM emission:

MachineInstr OpcodePTX Operation
149ATOMIC_LOAD
294--297atom.add (f32 / f64 / i32 / i64)
302--305atom.min (s32 / s64 / u32 / u64)
314--317atom.max (s32 / s64 / u32 / u64)
462atom.cas (generic)

These opcodes are emitted by the SelectionDAG lowering for native atomic operations that survive the AtomicExpandPass without expansion.

Function Map

FunctionAddressSizeRole
sub_6BBC400x6BBC40~1251 linesEDG atomic builtin name generator
sub_12AA2800x12AA280Legacy CAS IR node builder
sub_12AA9B00x12AA9B0Legacy atomic exchange handler
sub_12ADE800x12ADE80Scoped atomic load/store/fetch handler
sub_12AE0100x12AE010Fence acquire/release emitter (EDG only; BUG on NVVM)
sub_12AE0E00x12AE0E0Volatile fence emitter (pre-SM 70)
sub_12AE4B00x12AE4B0Explicit fence emitter (SM 70+)
sub_12AE9300x12AE93041KBPTX inline ASM atomic codegen (EDG side)
sub_12B3FD00x12B3FD0103KBMain builtin lowering mega-switch
sub_20C7CE00x20C7CE01399AtomicExpandPass: recursive type walker
sub_20C84C00x20C84C01656AtomicExpandPass: address space checker
sub_20C91400x20C91401204AtomicExpandPass: runOnFunction
sub_20C96A00x20C96A01814AtomicExpandPass: CAS loop generation
sub_20CA9000x20CA900218AtomicExpandPass: registration
sub_20CB2000x20CB2002896AtomicExpandPass: sub-word mask generation
sub_20CBD500x20CBD501646AtomicExpandPass: partword RMW expansion
sub_20CC6900x20CC69043AtomicExpandPass: 11-case operation dispatch
sub_20CD3E00x20CD3E06030AtomicExpandPass: partword CmpXchg expansion
sub_20CEB700x20CEB7010640AtomicExpandPass: full CmpXchg LL/SC expansion
sub_21E5E700x21E5E70PTX emission: base atomic opcode emitter
sub_21E64200x21E6420PTX emission: L2-hinted atomic opcode emitter
sub_21E8EA00x21E8EA0PTX emission: cluster barrier emitter
sub_21E94F00x21E94F0PTX emission: membar/fence emitter
sub_9502D00x9502D055KBPTX inline ASM atomic codegen (NVVM side)
sub_94F9E00x94F9E0NVVM membar emitter
sub_94FDF00x94FDF0NVVM fence emitter

Cross-References