Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVVM Builtin Table Structure

770 builtins mapped to integer IDs (1--770) in a wyhash open-addressing hash table. Dual tables exist: pre-optimization (sub_90AEE0) and post-optimization (sub_126A910), both with identical content but separate address spaces.

Pre-opt table buildersub_90AEE0 (109 KB, populates all 770 entries)
Pre-opt dispatchersub_913450 (name -> ID lookup)
Post-opt table buildersub_126A910 (123 KB)
Post-opt dispatchersub_12731E0 (name -> ID lookup)
Hash functionsub_CBF760 (wyhash v4 family)
Hash table insertsub_90ADD0 -> sub_C92610 -> sub_C92740
Hash table findsub_C92860 (find-only, quadratic probing)
Rehashsub_C929D0 (75% load factor trigger)
Total builtins770 (IDs 1--770)
StorageOpen-addressing at context+480 (20-byte header)

Architecture

sub_913450 (public API: name -> builtin ID)
  |
  +-- Guard: context+492 == 0?
  |    +-- sub_90AEE0 (lazy init: populate all 770 entries, once)
  |
  +-- strlen(name)
  +-- sub_C92610(name, len)         -> compute wyhash
  +-- sub_C92860(context+480, ...)  -> quadratic probe find
  |
  +-- return *(uint32*)(entry + 8)  -> the builtin ID

Hash Table Infrastructure

The builtin name table uses a specialized 20-byte hash table header at context+480 with a parallel hash cache array and wyhash-v4 string hashing. The table employs quadratic probing with triangular-number increments and grows at 75% load factor. For 770 entries the capacity sequence is 16 -> 32 -> 64 -> 128 -> 256 -> 512 -> 1024.

Full structural details -- table layout, bucket format, string entry format, wyhash length-dispatch table with pseudocode, probing algorithm, triple-gated comparison guard, rehash procedure, and sentinel values -- are documented in Hash Table and Collection Infrastructure. The "wyhash v4 String Hasher" and "Probing Strategy" sections on that page are the canonical references.

Complete Builtin ID Inventory

Synchronization & Compiler Intrinsics (IDs 1–7)

IDName
1__syncthreads
2__nvvm_bar0
3__nvvm_membar_cta
4__nvvm_membar_gl
5__nvvm_membar_sys
6__builtin_is_constant_evaluated
7__builtin_unreachable

Cluster Operations — SM 90+ (IDs 8–14)

IDName
8__nv_clusterDimIsSpecifed_impl
9__nv_clusterRelativeBlockRank_impl
10__nv_clusterSizeInBlocks_impl
11__nv_cluster_barrier_arrive_impl
12__nv_cluster_barrier_wait_impl
13__nv_cluster_barrier_arrive_relaxed_impl
14__nv_threadfence_cluster_impl

Barrier Extensions (IDs 15–20)

IDName
15–17__nvvm_bar0_{popc,and,or}
18–20__nvvm_bar{_sync_all,rier_sync,_warp_sync}

Bit Manipulation (IDs 21–26)

__nvvm_clz_{i,ll}, __nvvm_popc_{i,ll}, __nvvm_brev{32,64}

Math — Rounding/Abs/Saturate (IDs 27–56)

__nvvm_{floor,ceil,abs,fabs,round,trunc,saturate}_{ftz_f,f,d}, __nvvm_{ex2,lg2,sin,cos}_approx_{ftz_f,f,d}

Reciprocal / Sqrt / Rsqrt (IDs 57–87)

__nvvm_rcp_{rn,rz,rm,rp}_{ftz_f,f,d}, __nvvm_sqrt_{f,rn,rz,rm,rp}_{ftz_f,f,d}, __nvvm_rsqrt_approx_{ftz_f,f,d}

Type Conversions (IDs 88–184)

97 entries covering all float↔int, double↔int, float↔half, bitcast combinations with all four rounding modes and FTZ variants.

Address Space & Memory Queries (IDs 185–204)

IDName
185__nv_isGlobal_impl
186–188__nv_bswap{16,32,64}_impl
189–192__nv_is{Shared,Constant,Local,GridConstant}_impl
193–200__nv_cvta_{generic_to,to_generic}_{global,shared,constant,local}_impl
201__builtin_assume
202__nv_isClusterShared_impl
203__nv_cluster_query_shared_rank_impl
204__nv_associate_access_property_impl

Atomic Operations — Legacy NVVM (IDs 207–275)

69 entries: __nvvm_atom_{,cta_,sys_}{add,xchg,min,max,inc,dec,and,or,xor}_gen_{i,ll,f,d,ui,ull,128}

FP Arithmetic (IDs 276–349)

__nvvm_{min,max}_{i,ui,ll,ull}, __nvvm_f{min,max}_{f,ftz_f,d}, __nvvm_mulhi_{i,ui,ll,ull}, __nvvm_mul_{rn,rz,rm,rp}_{ftz_f,f,d}, __nvvm_div_*, __nvvm_add_*

Vote Operations (IDs 351–358)

__nvvm_vote_{all,any,uni,ballot} + _sync variants

Match Operations (IDs 361–364)

__match{32,64}_{any,all}_sync

FMA (IDs 383–403)

__nvvm_fma_{rn,rz,rm,rp}_{ftz_f,f,d,ftz_f2,f2}

C++11 Atomics (IDs 417–473)

Sized variants: __nv_atomic_{load,store,fetch_add,fetch_sub,fetch_and,fetch_or,fetch_xor,fetch_max,fetch_min,exchange,compare_exchange}_{1,2,4,8,16}_{u,s,f}

Surface Stores — sust (IDs 474–638)

165 entries covering __nvvm_sust_b_{1d,1d_array,2d,2d_array,3d}_{i8,...,v4i32}_{clamp,trap,zero}.

Pattern: sust_b_<dim>_<type>_<oob_mode> across 5 dimensions × 11 types × 3 OOB modes.

CUDA Varargs (IDs 639–642)

__cu_va_{start,end,arg,copy}

Tex/Surf Handler (ID 647)

__nv_tex_surf_handler — generic dispatch for texture/surface reads (surface stores use the dedicated sust builtins above).

C++ ABI (IDs 648–677)

__cxa_vec_{ctor,cctor,dtor,new2,new,new3,delete2,delete,delete3}, __gen_nvvm_mem{cpy,set}_*, _Znw{j,m,y}, _Zna{j,m,y}, _ZdlPv{,m,y}, _ZdaPv{,m,y}

WMMA Tensor Core — SM 70+ (IDs 678–707)

30 entries: __hmma_m{16n16k16,32n8k16,8n32k16}_{ld_a,ld_b,ld_c_f16,ld_c_f32,st_c_f16,st_c_f32,mma_f16f16,mma_f32f16,mma_f16f32,mma_f32f32}

Integer/Binary Tensor Core — SM 75+ (IDs 708–745)

38 entries: __imma_m{16n16k16,32n8k16,8n32k16}_{ld_a,ld_b,ld_c,st_c,mma}_{s8,u8}, __imma_m8n8k32_{s4,u4}, __bmma_m8n8k128_{b1}

Extended Tensor Core — SM 80+ (IDs 746–764)

__dmma_m8n8k4_mma_f64, __mma_tf32_m16n16k8_mma_f32, __mma_bf16_m*_mma_f32 + load/store variants

WGMMA — SM 90+ (IDs 765–768)

__wgmma_mma_async_{f16,bf16,tf32,f8}

Alloca (IDs 769–770)

_alloca, __builtin_alloca

Category Summary

CategoryID RangeCount
Sync/barriers/cluster1–2020
Bit manipulation21–266
Math (floor/ceil/abs/round/etc)27–5630
Reciprocal/sqrt/rsqrt57–8731
Type conversions88–18497
Address space queries/cvta185–20420
Atomic ops (NVVM legacy)207–27569
FP min/max, mulhi, arithmetic276–34974
Vote + match operations351–36412
Compare-and-swap370–37910
FMA383–40321
Shuffle + misc404–41613
C++11 atomics (sized)417–47357
Surface stores (sust)474–638165
CUDA varargs + math shim639–6468
Tex/surf handler6471
C++ ABI + memgen + new/delete648–67730
WMMA tensor core (f16)678–70730
IMMA/BMMA tensor core708–74538
Extended tensor (dmma/tf32/bf16)746–76419
WGMMA (SM 90+ warpgroup)765–7684
Alloca769–7702
TOTAL770

SM Generation Coverage

GenerationFeatures Enabled
SM 70 (Volta)WMMA (half-precision tensor core)
SM 75 (Turing)IMMA (integer), BMMA (binary)
SM 80 (Ampere)DMMA (double), TF32, BF16
SM 90 (Hopper)WGMMA (warpgroup), cluster ops, f8

All 770 builtins are registered regardless of target SM. Architecture gating happens in the lowering layer that consumes the builtin IDs.

Key Observations

  • Lazy initialization: The entire table is built on first lookup. Guard: context+492 != 0.
  • No texture reads (suld): Only surface store builtins are registered. Texture/surface reads go through __nv_tex_surf_handler (ID 647).
  • Write-once table: Tombstone mechanics exist but deletions never occur for the builtin table.
  • Duplicate prefix optimization: IDA shows SSE xmmword constant loads for long common prefixes (__nvvm_sust_b_2d_array_*) — this is compiler optimization of string literal loads, not a different code path.