Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

OpenMP Runtime Declaration Table

cicc embeds a 194-entry table of OpenMP runtime function declarations at sub_312CF50 (0x312CF50, 117 KB decompiled). This single function is the authoritative source for every __kmpc_*, omp_*, and __tgt_* device-runtime call the compiler can emit into NVPTX IR. It defines the complete ABI contract between compiler-generated GPU code and the OpenMP device runtime library (libomptarget / libomp). The function takes an integer case index (0--193), constructs the corresponding FunctionType, checks whether the symbol already exists in the module via Module::getNamedValue, and if absent, creates a Function::Create with ExternalLinkage. The result is registered into a context-local array so that any later codegen pass can reference a runtime function by its numeric index without reconstructing the type.

Upstream LLVM defines the same runtime function set declaratively in llvm/include/llvm/Frontend/OpenMP/OMPKinds.def using the __OMP_RTL macro, which the OMPIRBuilder expands at construction time. cicc's table is a procedural equivalent: a giant switch(a3) with 194 cases that does exactly what OMPKinds.def + OMPIRBuilder::initialize() do, but compiled into the binary rather than generated from a .def file. The ordering of cases 0--193 matches the upstream OMPRTL_ enum one-to-one, confirming that cicc v13.0 tracks LLVM 18.x's OpenMP runtime interface.

Key Facts

PropertyValue
Entry pointsub_312CF50 @ 0x312CF50
Decompiled size117 KB
Total entries194 (indices 0--193)
Sentinelindex 193 = __last (void function, marks table end)
Varargs entries2: index 7 (__kmpc_fork_call), index 118 (__kmpc_fork_teams)
Linkage for all entriesExternalLinkage (encoded as 0x103 = 259)
Special attributeAttribute #26 applied to indices 7 and 118 post-creation
Registration helpersub_3122A50(context, index, funcDecl)
Type constructionsub_BCF480 = FunctionType::get
Symbol lookupsub_BA8CB0 = Module::getNamedValue
Function creationsub_B2C660 = Function::Create
Upstream equivalentOMPKinds.def __OMP_RTL entries + OMPIRBuilder::initialize()

Context Object Type Cache

The first parameter a1 points to the OpenMP runtime context object. Starting at offset +2600, it contains a pre-allocated cache of LLVM types used to construct function signatures, avoiding redundant Type::get* calls:

OffsetTypeLLVM equivalent
+2600voidType::getVoidTy
+2608i1Type::getInt1Ty
+2616i8Type::getInt8Ty
+2624i16Type::getInt16Ty
+2632i32Type::getInt32Ty
+2640i64Type::getInt64Ty
+2648i8*PointerType::get(i8, 0)
+2664i32*PointerType::get(i32, 0)
+2672i64*PointerType::get(i64, 0)
+2680doubleType::getDoubleTy
+2688i64 / size_tDataLayout::getIntPtrType
+2704i8* (generic ptr)PointerType::get(i8, 0)
+2712i8**PointerType::get(i8*, 0)
+2720i8***PointerType::get(i8**, 0)
+2752kmp_critical_name*[8 x i32]*
+2784ident_t*{i32, i32, i32, i32, i8*}*
+2800__tgt_kernel_arguments*13-field struct pointer
+2816__tgt_async_info*{i8*}*
+2896KernelEnvironmentTy*{ConfigEnv, ident_t*, DynEnv*}*
+2912KernelLaunchEnvironmentTy*{i32, i32}*
+2928kmpc_microvoid(i32*, i32*, ...)* (varargs microtask)
+2944kmp_reduce_funcvoid(i8*, i8*)*
+2960kmp_copy_funcvoid(i8*, i8*)*
+3008kmpc_ctori8*(i8*)*
+3024kmp_routine_entry_ti32(i32, i8*)*
+3040kmp_ShuffleReductFctPtrvoid(i8*, i16, i16, i16)*
+3056kmp_InterWarpCopyFctPtrvoid(i8*, i32)*
+3072kmp_ListGlobalFctPtrvoid(i8*, i32, i8*)*

This layout mirrors the OMP_TYPE, OMP_STRUCT_TYPE, and OMP_FUNCTION_TYPE sections of upstream OMPKinds.def. The struct type definitions for ident_t, KernelEnvironmentTy, and __tgt_kernel_arguments match the upstream __OMP_STRUCT_TYPE declarations exactly.

Execution Modes: SPMD vs Generic

GPU OpenMP kernels operate in one of two execution modes, and the choice fundamentally determines which runtime functions the compiler emits:

ModeValueDescriptionWorker threads
Generic1Master-worker state machine. Only thread 0 runs serial code; workers spin in a polling loop (__kmpc_barrier_simple_generic). Parallel regions are dispatched via __kmpc_kernel_prepare_parallel / __kmpc_kernel_parallel.Idle until parallel region
SPMD2All threads execute the same code from kernel entry. Serial sections between parallel regions are guarded by tid == 0 checks with shared-memory output promotion and __kmpc_barrier_simple_spmd barriers.Active from first instruction
Generic-SPMD3Transient state during the Generic-to-SPMD transformation. Never observed at runtime.N/A

The execution mode is encoded in a bit-vector attached to the kernel function's metadata. The runtime function __kmpc_target_init (index 155) reads the KernelEnvironmentTy struct which embeds the ConfigurationEnvironmentTy -- the first byte of that inner struct encodes the execution mode. __kmpc_is_spmd_exec_mode (index 186) queries it at runtime.

The SPMD-vs-Generic distinction affects which runtime calls appear in the generated IR:

  • Generic mode kernels call __kmpc_kernel_prepare_parallel, __kmpc_kernel_parallel, __kmpc_kernel_end_parallel, __kmpc_barrier_simple_generic, and the full __kmpc_fork_call microtask dispatch.
  • SPMD mode kernels call __kmpc_parallel_51 (index 158) for nested parallelism, __kmpc_barrier_simple_spmd for synchronization, and __kmpc_alloc_shared / __kmpc_free_shared for shared-memory output promotion between guarded and parallel sections.
  • Both modes call __kmpc_target_init / __kmpc_target_deinit for kernel lifecycle management.

Call Generation Infrastructure

When any codegen pass needs a runtime function, it calls sub_312CF50(omp_context + 400, existing_value, case_index). The omp_context object (typically at a2+208 in the pass state) contains both the type cache (+2600..+3072) and the runtime function array. If Module::getNamedValue finds the symbol already declared, it is returned immediately; otherwise a new declaration is created and registered.

Once a declaration is obtained, sub_921880 (create runtime library call instruction) builds the CallInst node with the argument list from current SSA values, attaches debug/source location metadata, and inserts it at the specified basic block position.

Primary Consumers

PassAddressSizeRuntime Entries Used
Generic-to-SPMD transformsub_26968A061 KB6 (thread ID), 180 (alloc_shared), 181 (free_shared), 187 (barrier_simple_spmd)
State machine generationsub_267842041 KB155 (target_init), 156 (target_deinit), 171 (kernel_parallel), 172 (kernel_end_parallel), 188 (barrier_simple_generic)
Parallel region outlinersub_313D1B047 KB7 (fork_call), 158 (parallel_51)
Parallel region mergingsub_268094052 KB180 (alloc_shared), 181 (free_shared), 187 (barrier_simple_spmd)
Attributor OpenMP driversub_269F53063 KBAll -- identifies/folds known runtime calls by index

Complete Runtime Function Table

All 194 entries, organized by functional category. The "Index" column is the switch case in sub_312CF50 and the slot in the context's runtime function array. Signatures use LLVM IR type syntax. The "Call Generation" column describes how and when cicc emits each call.

Standard OpenMP Runtime (0--13)

IndexFunctionSignaturePurposeCall Generation
0__kmpc_barriervoid(ident_t*, i32)Explicit barrierEmitted for #pragma omp barrier. On GPU compiles to __syncthreads(). OpenMPOpt may replace with index 187 (SPMD barrier)
1__kmpc_canceli32(ident_t*, i32, i32)Cancel constructThird param: cancel kind (1=parallel, 2=sections, 3=for, 4=taskgroup). Returns nonzero if cancellation pending
2__kmpc_cancel_barriervoid(ident_t*, i32)Implicit barrier + cancel checkGenerated at end of worksharing constructs when cancel is possible
3__kmpc_errorvoid(ident_t*, i32, i8*)Runtime errorSecond param: severity (1=warning, 2=fatal). Third: message string pointer
4__kmpc_flushvoid(ident_t*)Memory fence#pragma omp flush. On GPU: __threadfence() or scope-specific fence
5__kmpc_global_thread_numi32(ident_t*)Get global thread IDOn GPU: blockIdx*blockDim+threadIdx. Emitted at start of every region needing a thread identifier
6__kmpc_get_hardware_thread_id_in_blocki32()threadIdx.x equivalentDirect PTX %tid.x wrapper. Used by SPMD transform (sub_26968A0) to build tid==0 guards. Lookup: sub_312CF50(..., 6)
7__kmpc_fork_callvoid(ident_t*, i32, kmpc_micro, ...)Fork parallel region (varargs)Second param: shared variable count. Third: outlined microtask pointer. Remaining: shared variables. On GPU Generic mode triggers worker state machine dispatch. Attribute #26 applied post-create
8__kmpc_fork_call_ifvoid(ident_t*, i32, i32, i8*, i32)Conditional forkThird param: if-clause condition. If false, region executes serially
9__kmpc_omp_taskwaitvoid(ident_t*, i32)Taskwait#pragma omp taskwait
10__kmpc_omp_taskyieldi32(ident_t*, i32, i32)Task yield pointThird param: end-of-task flag
11__kmpc_push_num_threadsvoid(ident_t*, i32, i32)Set thread countnum_threads(N) clause. Pushes count for next parallel region
12__kmpc_push_proc_bindvoid(ident_t*, i32, i32)Set affinityproc_bind(spread/close/master). Third param encodes binding policy
13__kmpc_omp_reg_task_with_affinityi32(ident_t*, i32, i8*, i32, i8*)Register task with affinity infoOMP 5.0 affinity clause

Index 7 (__kmpc_fork_call) and index 118 (__kmpc_fork_teams) are the only two varargs entries. Both receive special post-processing: sub_B994D0 sets function attribute #26 (likely the convergent attribute or a varargs-related marker), checked via sub_B91C10. This prevents the optimizer from incorrectly splitting, duplicating, or removing these calls.

Hardware Query (14--16)

IndexFunctionSignaturePurpose
14__kmpc_get_hardware_num_blocksi32()gridDim.x equivalent
15__kmpc_get_hardware_num_threads_in_blocki32()blockDim.x equivalent
16__kmpc_get_warp_sizei32()Warp size (32 on NVIDIA)

These three functions have no parameters -- they are direct wrappers around PTX special registers (%nctaid.x, %ntid.x, and a compile-time constant 32).

OMP Standard Library API (17--45)

IndexFunctionSignaturePurpose
17omp_get_thread_numi32()Thread ID within team
18omp_get_num_threadsi32()Threads in current team
19omp_get_max_threadsi32()Max threads available
20omp_in_paralleli32()Inside parallel region?
21omp_get_dynamici32()Dynamic adjustment enabled?
22omp_get_cancellationi32()Cancellation enabled?
23omp_get_nestedi32()Nested parallelism enabled?
24omp_get_schedulevoid(i32*, i32*)Query loop schedule
25omp_get_thread_limiti32()Max total threads
26omp_get_supported_active_levelsi32()Max supported nesting
27omp_get_max_active_levelsi32()Current max nesting
28omp_get_leveli32()Current nesting depth
29omp_get_ancestor_thread_numi32(i32)Ancestor thread ID
30omp_get_team_sizei32(i32)Team size at nesting level
31omp_get_active_leveli32()Active parallel nesting
32omp_in_finali32()Inside final task?
33omp_get_proc_bindi32()Current binding policy
34omp_get_num_placesi32()Number of places
35omp_get_num_procsi32()Available processors
36omp_get_place_proc_idsvoid(i32, i32*)Processor IDs in place
37omp_get_place_numi32()Current place number
38omp_get_partition_num_placesi32()Places in partition
39omp_get_partition_place_numsvoid(i32*)Place numbers in partition
40omp_get_wtimedouble()Wall clock time
41omp_set_num_threadsvoid(i32)Set thread count
42omp_set_dynamicvoid(i32)Enable/disable dynamic
43omp_set_nestedvoid(i32)Enable/disable nesting
44omp_set_schedulevoid(i32, i32)Set loop schedule
45omp_set_max_active_levelsvoid(i32)Set max nesting

These are the user-facing OpenMP API functions. On GPU, most return compile-time constants or trivial register reads. The Attributor-based OpenMP driver (sub_269F530) can fold many of these to constants when the execution mode and team configuration are statically known -- for example, omp_get_num_threads folds to the blockDim.x launch parameter.

Begin/End (53--54)

IndexFunctionSignaturePurpose
53__kmpc_beginvoid(ident_t*, i32)Library initialization (rarely used on GPU)
54__kmpc_endvoid(ident_t*)Library shutdown

Master/Masked Constructs (46--49)

IndexFunctionSignaturePurposeCall Generation
46__kmpc_masteri32(ident_t*, i32)Enter master regionReturns 1 for master thread (thread 0), 0 for all others. IRGen wraps user code in if(__kmpc_master(..)) {...}
47__kmpc_end_mastervoid(ident_t*, i32)Exit master regionCalled at end of master block
48__kmpc_maskedi32(ident_t*, i32, i32)Enter masked region (OMP 5.1)Third param is the filter ID (which specific thread executes). Replaces master in OMP 5.1
49__kmpc_end_maskedvoid(ident_t*, i32)Exit masked regionCalled at end of masked block

Critical Sections (50--52)

IndexFunctionSignaturePurposeCall Generation
50__kmpc_criticalvoid(ident_t*, i32, kmp_critical*)Enter critical sectionOn GPU: atomic spin-lock acquire on the 32-byte lock variable
51__kmpc_critical_with_hintvoid(ident_t*, i32, i32, kmp_critical*)Enter with lock hintHint encodes contention strategy (uncontended, contended, speculative, non-speculative)
52__kmpc_end_criticalvoid(ident_t*, i32, kmp_critical*)Exit critical sectionAtomic release on lock variable

On GPU, critical sections use atomic operations on global memory. The kmp_critical_name type is [8 x i32] (32 bytes), used as an atomic lock variable. The _with_hint variant accepts a contention hint that the GPU runtime maps to different atomic strategies.

Reduction (55--58)

IndexFunctionSignaturePurpose
55__kmpc_reducei32(ident_t*, i32, i32, i64, i8*, kmp_reduce_func, kmp_critical*)Begin reduction (blocking)
56__kmpc_reduce_nowaiti32(ident_t*, i32, i32, i64, i8*, kmp_reduce_func, kmp_critical*)Begin reduction (non-blocking)
57__kmpc_end_reducevoid(ident_t*, i32, kmp_critical*)End reduction (blocking)
58__kmpc_end_reduce_nowaitvoid(ident_t*, i32, kmp_critical*)End reduction (non-blocking)

These are the standard reduction protocol entries. On GPU, the compiler typically prefers the NVIDIA-specific shuffle-based reductions (indices 176--178) which are significantly faster.

Static Loop Scheduling (61--70)

IndexFunctionSignature
61--64__kmpc_for_static_init_{4,4u,8,8u}void(ident_t*, i32, i32, i32*, {i32,i64}*, {i32,i64}*, {i32,i64}*, {i32,i64}*, {i32,i64}, {i32,i64})
65__kmpc_for_static_finivoid(ident_t*, i32)
66--69__kmpc_distribute_static_init_{4,4u,8,8u}Same 9-param shape as 61--64
70__kmpc_distribute_static_finivoid(ident_t*, i32)

The _4 / _4u / _8 / _8u suffixes indicate signed-32, unsigned-32, signed-64, unsigned-64 loop variable types respectively. All static_init functions take 9 parameters: location, thread ID, schedule type, pointer to is-last flag, pointers to lower/upper/stride/incr bounds, and chunk size.

Dynamic Dispatch (71--87)

Indices 71--74 handle distribute + dynamic dispatch initialization. Indices 75--82 handle standard dispatch_init and dispatch_next for the four integer widths. Indices 83--87 are dispatch finalization. Total: 17 entries covering the full dynamic loop scheduling interface.

Team Static & Combined Distribute-For (88--95)

Indices 88--91 (__kmpc_team_static_init_{4,4u,8,8u}) handle team-level static work distribution. Indices 92--95 (__kmpc_dist_for_static_init_{4,4u,8,8u}) are the combined distribute parallel for static init, taking 10 parameters (the extra parameter is the distribute upper bound pointer).

Tasking (98--116)

19 entries covering the full OpenMP tasking interface:

IndexFunctionSignaturePurpose
98__kmpc_omp_task_alloci8*(ident_t*, i32, i32, i64, i64, kmp_routine_entry_t)Allocate task descriptor (6 params). Returns kmp_task_t*. Params: flags, sizeof_task, sizeof_shareds, task_entry
99__kmpc_omp_taski32(ident_t*, i32, i8*)Submit allocated task for execution. Third param is the kmp_task_t* from task_alloc
100__kmpc_end_taskgroupvoid(ident_t*, i32)End #pragma omp taskgroup
101__kmpc_taskgroupvoid(ident_t*, i32)Begin taskgroup
102__kmpc_omp_task_begin_if0void(ident_t*, i32, i8*)Begin immediate task (when if clause evaluates to false)
103__kmpc_omp_task_complete_if0void(ident_t*, i32, i8*)Complete immediate task
104__kmpc_omp_task_with_depsi32(ident_t*, i32, i8*, i32, i8*, i32, i8*)Task with dependency list (7 params). Params: task, ndeps, dep_list, ndeps_noalias, noalias_list
105__kmpc_taskloopvoid(ident_t*, i32, i8*, i32, i64*, i64*, i64, i32, i32, i64, i8*)#pragma omp taskloop (11 params). Params: task, if_val, lb_p, ub_p, st, nogroup, sched, grainsize, task_dup
106__kmpc_taskloop_5void(ident_t*, i32, i8*, i32, i64*, i64*, i64, i32, i32, i64, i8*, i32)OMP 5.1 taskloop (12 params). Extra param: modifier
107__kmpc_omp_target_task_alloci8*(ident_t*, i32, i32, i64, i64, kmp_routine_entry_t, i64)Target-offload task allocation (7 params). Extra i64: device_id
108__kmpc_taskred_modifier_initi8*(ident_t*, i32, i32, i32, i8*)Init task reduction with modifier (5 params). Params: is_ws, num, data
109__kmpc_taskred_initi8*(i32, i32, i8*)Init task reduction (basic)
110__kmpc_task_reduction_modifier_finivoid(ident_t*, i32, i32)Finalize task reduction
111__kmpc_task_reduction_get_th_datai8*(i32, i8*, i8*)Get thread-local reduction data
112__kmpc_task_reduction_initi8*(i32, i32, i8*)Init task reduction (alternate path)
113__kmpc_task_reduction_modifier_initi8*(i8*, i32, i32, i32, i8*)Init with full modifier (5 params)
114__kmpc_proxy_task_completed_ooovoid(i8*)Out-of-order proxy task completion. Used for detached tasks
115__kmpc_omp_wait_depsvoid(ident_t*, i32, i32, i8*, i32, i8*)Wait on task dependencies (6 params)
116__kmpc_omp_taskwait_deps_51void(ident_t*, i32, i32, i8*, i32, i8*, i32)OMP 5.1 dependency wait (7 params). Extra param: nowait modifier

Index 106 (__kmpc_taskloop_5) and index 116 (__kmpc_omp_taskwait_deps_51) are OMP 5.1 additions with an extra modifier parameter compared to their predecessors.

Teams and Cancellation (117--121)

IndexFunctionSignaturePurpose
117__kmpc_cancellationpointi32(ident_t*, i32, i32)Cancellation point check
118__kmpc_fork_teamsvoid(ident_t*, i32, kmpc_micro, ...)Fork teams region (varargs)
119__kmpc_push_num_teamsvoid(ident_t*, i32, i32, i32)Set team count
120__kmpc_push_num_teams_51void(ident_t*, i32, i32, i32, i32)Set team count (OMP 5.1, 5 params)
121__kmpc_set_thread_limitvoid(ident_t*, i32, i32)Set per-team thread limit

Copyprivate and Threadprivate (122--124)

IndexFunctionSignaturePurpose
122__kmpc_copyprivatevoid(ident_t*, i32, i64, i8*, kmp_copy_func, i32)#pragma omp copyprivate. Broadcasts private data from single thread to all others. 6 params
123__kmpc_threadprivate_cachedi8*(ident_t*, i32, i8*, i64, i8***)Get/allocate threadprivate variable data. 5 params
124__kmpc_threadprivate_registervoid(ident_t*, i8*, kmpc_ctor, void*, void*)Register threadprivate with ctor, copy-ctor, dtor callbacks

Doacross Synchronization (125--128)

Cross-iteration dependencies for #pragma omp ordered depend(source/sink).

IndexFunctionSignaturePurpose
125__kmpc_doacross_initvoid(ident_t*, i32, i32, i8*)Init doacross tracking. Params: num_dims, dims_info
126__kmpc_doacross_postvoid(ident_t*, i32, i64*)Post (source): signal iteration completion
127__kmpc_doacross_waitvoid(ident_t*, i32, i64*)Wait (sink): wait for iteration to complete
128__kmpc_doacross_finivoid(ident_t*, i32)Finalize doacross tracking

Memory Allocators (129--136)

IndexFunctionSignaturePurpose
129__kmpc_alloci8*(i32, i64, i8*)OpenMP allocator alloc. Params: gtid, size, allocator
130__kmpc_aligned_alloci8*(i32, i64, i64, i8*)Aligned allocation. Params: gtid, align, size, allocator
131__kmpc_freevoid(i32, i8*, i8*)Free allocated memory. Params: gtid, ptr, allocator
132__tgt_interop_initvoid(ident_t*, i32, i8**, i32, i32, i32, i8*, i32)OMP 5.1 foreign runtime interop init (8 params)
133__tgt_interop_destroyvoid(ident_t*, i32, i8**, i32, i32, i32, i8*)Destroy interop object (7 params)
134__tgt_interop_usevoid(ident_t*, i32, i8**, i32, i32, i32, i8*)Use interop object (7 params)
135__kmpc_init_allocatori8*(i32, i32, i8*, i8*)Init OpenMP allocator. Params: gtid, memspace, num_traits, traits
136__kmpc_destroy_allocatorvoid(i32, i8*)Destroy allocator

Target Offloading (137--153)

18 entries implementing the host-side target offloading protocol. These are primarily used when cicc compiles host code that launches GPU kernels, not within device code itself:

IndexFunctionSignatureParamsPurpose
137__kmpc_push_target_tripcount_mappervoid(ident_t*, i64, i64)3Set iteration count for target region. Params: device_id, trip_count
138__tgt_target_mapperi32(ident_t*, i64, i8*, i32, i8**, i8**, i64*, i64*, i8**, i8**)10Launch target region with data mapping
139__tgt_target_nowait_mapper(14 params)14Async target launch. Adds depobj count/list, noalias count/list
140__tgt_target_teams_mapper(12 params)12Target teams launch. Adds num_teams, thread_limit, mappers
141__tgt_target_teams_nowait_mapper(16 params)16Async target teams. Most complex host-side offload call
142__tgt_target_kerneli32(ident_t*, i64, i32, i32, i8*, __tgt_kernel_args*)6New-style kernel launch (takes __tgt_kernel_arguments*)
143__tgt_target_kernel_nowait(10 params)10Async new-style launch. Adds depobj info
144__tgt_target_data_begin_mapper(9 params)9Map data to device
145__tgt_target_data_begin_nowait_mapper(13 params)13Async map-to
146__tgt_target_data_begin_mapper_issue(10 params)10Split-phase issue for async map-to
147__tgt_target_data_begin_mapper_waitvoid(i64, __tgt_async_info*)2Split-phase wait for async map-to
148__tgt_target_data_end_mapper(9 params)9Map data from device
149__tgt_target_data_end_nowait_mapper(13 params)13Async map-from
150__tgt_target_data_update_mapper(9 params)9Data update (host-to-device or device-to-host)
151__tgt_target_data_update_nowait_mapper(13 params)13Async data update
152__tgt_mapper_num_componentsi64(i8*)1Query user-defined mapper component count
153__tgt_push_mapper_componentvoid(i8*, i8*, i8*, i64, i64, i8*)6Register mapper component. Params: handle, base, begin, size, type, name

Task Completion Event (154)

IndexFunctionSignaturePurpose
154__kmpc_task_allow_completion_eventi8*(ident_t*, i32, i8*)Allow completion event for detached tasks (OMP 5.0)

GPU Kernel Lifecycle (155--158)

These are the most important entries for device-side GPU OpenMP code.

IndexFunctionSignaturePurposeCall Generation
155__kmpc_target_initi32(KernelEnvironmentTy*, KernelLaunchEnvironmentTy*)Kernel entryFirst call in every GPU OpenMP kernel. State machine generator (sub_2678420) emits this at entry. KernelEnvironmentTy carries ConfigurationEnvironmentTy (first byte = execution mode)
156__kmpc_target_deinitvoid()Kernel exitLast call in every GPU OpenMP kernel. Emitted by state machine generator
157__kmpc_kernel_prepare_parallelvoid(i8*)Generic: signal workersMaster thread writes outlined function pointer to shared memory, then signals workers to execute it. Replaced by __kmpc_parallel_51 after SPMD conversion
158__kmpc_parallel_51void(ident_t*, i32, i32, i32, i32, i8*, i8*, i8**, i64)OMP 5.1 GPU parallel dispatch9 params: if_expr, num_threads, proc_bind, fn, wrapper_fn, shared_args, num_shared_args. Used by parallel region outliner (sub_313D1B0) on SPMD kernels. Replaces fork_call for GPU

__kmpc_target_init is the first runtime call in every GPU OpenMP kernel. In Generic mode, it returns -1 for worker threads (which should enter the polling loop) and 0 for the master thread. In SPMD mode, it returns 0 for all threads. The KernelEnvironmentTy struct carries the ConfigurationEnvironmentTy which encodes the execution mode, team sizes, and runtime configuration.

New-Style Static Loops, OMP 5.1+ (159--170)

12 entries implementing the callback-based loop interface introduced in OpenMP 5.1:

IndexFunctionSignature
159--162__kmpc_for_static_loop_{4,4u,8,8u}void(ident_t*, i8*, i8*, {i32,i64}, {i32,i64}, {i32,i64})
163--166__kmpc_distribute_static_loop_{4,4u,8,8u}void(ident_t*, i8*, i8*, {i32,i64}, {i32,i64})
167--170__kmpc_distribute_for_static_loop_{4,4u,8,8u}void(ident_t*, i8*, i8*, {i32,i64}, {i32,i64}, {i32,i64}, {i32,i64})

Unlike the old-style _init/_fini pairs, these new-style loops take function pointer callbacks (i8* for the loop body and data pointer) and handle initialization + execution + finalization in a single call.

Legacy Kernel-Mode Parallel (171--174)

IndexFunctionSignaturePurpose
171__kmpc_kernel_paralleli1(i8**)Generic mode: worker checks if parallel work available
172__kmpc_kernel_end_parallelvoid()Generic mode: worker signals completion
173__kmpc_serialized_parallelvoid(ident_t*, i32)Execute parallel region serially (if(0) parallel)
174__kmpc_end_serialized_parallelvoid(ident_t*, i32)End serialized parallel

These are the Generic-mode worker-side functions. __kmpc_kernel_parallel returns true when the master thread has dispatched work via __kmpc_kernel_prepare_parallel, writing the outlined function pointer into the output parameter.

Warp-Level Primitives (175, 179, 189--190)

IndexFunctionSignaturePurpose
175__kmpc_shuffle_int32i32(i32, i16, i16)Warp shuffle for 32-bit value
179__kmpc_shuffle_int64i64(i64, i16, i16)Warp shuffle for 64-bit value
189__kmpc_warp_active_thread_maski64()Active lane mask (PTX activemask)
190__kmpc_syncwarpvoid(i64)Warp-level barrier with mask

The shuffle functions take (value, lane_offset, warp_size) and implement butterfly-pattern data exchange for intra-warp reductions. These compile down to PTX shfl.sync instructions.

NVIDIA Device Reduction (176--178)

IndexFunctionSignaturePurpose
176__kmpc_nvptx_parallel_reduce_nowait_v2i32(ident_t*, i64, i8*, ShuffleReductFctPtr, InterWarpCopyFctPtr)Intra-CTA parallel reduction
177__kmpc_nvptx_teams_reduce_nowait_v2i32(ident_t*, i32, i8*, i64, i8*, ShuffleReductFctPtr, InterWarpCopyFctPtr, ListGlobalFctPtr, ListGlobalFctPtr, ListGlobalFctPtr, ListGlobalFctPtr)Cross-CTA team reduction (11 params)
178__kmpc_reduction_get_fixed_bufferi8*()Get global reduction scratch buffer

These are the GPU-specific reduction entries -- the single most important performance-critical runtime calls for OpenMP on NVIDIA GPUs. The parallel reduction (index 176) uses a two-phase approach: (1) intra-warp reduction via shuffle, then (2) inter-warp reduction via shared memory copy. The compiler generates the ShuffleReductFctPtr and InterWarpCopyFctPtr callback functions as outlined helpers that the runtime calls during the reduction tree.

The teams reduction (index 177) adds four ListGlobalFctPtr callbacks for managing global memory buffers across CTAs, plus an extra size parameter. This is the most complex runtime call in the entire table, with 11 parameters.

Shared Memory Management (180--184)

IndexFunctionSignaturePurpose
180__kmpc_alloc_sharedi8*(i64)Dynamic shared memory allocation
181__kmpc_free_sharedvoid(i8*, i64)Free shared memory
182__kmpc_begin_sharing_variablesvoid(i8***, i64)Begin variable sharing protocol
183__kmpc_end_sharing_variablesvoid()End sharing protocol
184__kmpc_get_shared_variablesi8**()Get shared variable array

__kmpc_alloc_shared / __kmpc_free_shared are heavily used in the SPMD transformation's guarded output mechanism: values computed by the master thread that are needed by all threads are stored into dynamically-allocated shared memory, synchronized via barrier, then loaded by all threads.

SPMD Mode Detection (185--188)

IndexFunctionSignaturePurpose
185__kmpc_parallel_leveli16(ident_t*, i32)Current parallel nesting depth
186__kmpc_is_spmd_exec_modei8()Returns 1 if SPMD, 0 if Generic
187__kmpc_barrier_simple_spmdvoid(ident_t*, i32)Lightweight barrier for SPMD mode (bar.sync)
188__kmpc_barrier_simple_genericvoid(ident_t*, i32)State-machine barrier for Generic mode

The two barrier variants reflect the fundamental mode difference. __kmpc_barrier_simple_spmd compiles to a single bar.sync instruction. __kmpc_barrier_simple_generic involves polling a shared-memory flag because workers are in a state-machine loop that must check for new work after each barrier.

Profiling (191--192) and Sentinel (193)

IndexFunctionSignaturePurpose
191__llvm_profile_register_functionvoid(i8*)PGO: register function for profiling
192__llvm_profile_register_names_functionvoid(i8*, i64)PGO: register name table
193__lastvoid()Sentinel marking table end

The two __llvm_profile_* entries support profile-guided optimization instrumentation on GPU. The __last sentinel at index 193 is a void-to-void function that marks the end of the table; it is never called at runtime.

Declaration Construction Protocol

For each runtime function, sub_312CF50 follows an identical protocol:

// Pseudocode for a typical case (e.g., case 0: __kmpc_barrier)
case 0: {
    // 1. Build parameter type array from cached types
    Type *params[] = { ctx->ident_t_ptr, ctx->i32_ty };  // a1+2784, a1+2632

    // 2. Construct FunctionType
    FunctionType *fty = FunctionType::get(
        ctx->void_ty,   // return type (a1+2600)
        params, 2,       // param array + count
        /*isVarArg=*/false
    );

    // 3. Check if symbol already exists in module
    Value *existing = Module::getNamedValue("__kmpc_barrier");
    if (existing == a2)  // a2 is the existing-check value
        return existing;

    // 4. Create new function declaration
    Function *decl = Function::Create(
        fty,
        259,             // linkage = ExternalLinkage (0x103)
        "__kmpc_barrier",
        module
    );

    // 5. Register in context table
    registerRuntimeFunction(a1, /*index=*/0, decl);  // sub_3122A50

    return decl;
}

The linkage value 259 (0x103) decodes as ExternalLinkage with the DLLImport storage class flag set. This is consistent across all 194 entries.

For the two varargs entries (indices 7 and 118), the FunctionType::get call passes isVarArg=true, and after Function::Create, the code calls sub_B994D0 to add attribute #26 and sub_B91C10 to verify it was applied. Attribute #26 likely corresponds to a convergent-or-varargs marker that prevents the optimizer from incorrectly transforming these calls.

Comparison with Upstream LLVM OMPKinds.def

cicc's table maps one-to-one with the __OMP_RTL entries in LLVM 18.x's OMPKinds.def. The ordering is identical: the enum OMPRTL___kmpc_barrier = 0 corresponds to cicc's case 0, and so on through OMPRTL___last = 193 at case 193.

Key differences from upstream:

  1. Procedural vs declarative. Upstream uses X-macros (__OMP_RTL) expanded by OMPIRBuilder::initialize() to lazily create declarations on first use. cicc's sub_312CF50 is a compiled switch statement that eagerly creates declarations when requested by case index.

  2. Type representation. Upstream uses opaque pointer types (PointerType::get(Ctx, 0)) throughout. cicc preserves typed pointers (i8*, i32*, i64*, struct pointers) in its type cache, consistent with LLVM's pre-opaque-pointer era. This is because cicc's internal IR (NVVM IR) still uses typed pointers even though upstream LLVM has migrated to opaque pointers.

  3. Missing entries. cicc lacks __kmpc_push_num_threads_strict (present in latest upstream) and uses __kmpc_parallel_51 where upstream LLVM 18.x defines __kmpc_parallel_60 with a slightly different signature. The _51 name indicates cicc v13.0 targets the OMP 5.1 runtime ABI, not the OMP 6.0 draft.

  4. Attribute handling. Upstream OMPKinds.def includes extensive attribute sets (GetterAttrs, SetterAttrs, etc.) that annotate runtime functions with nounwind, nosync, nofree, willreturn, and memory effect attributes for optimization. cicc applies only attribute #26 to the two varargs functions and otherwise relies on the OpenMPOpt pass to infer attributes.

  5. The __tgt_interop_* entries (indices 132--134) in cicc take a slightly different parameter list than upstream: cicc includes an extra i32 parameter at the end that upstream encodes differently, reflecting a minor ABI divergence in the interop interface.

Configuration Knobs

All LLVM cl::opt knobs related to OpenMP optimization, as found in the cicc binary:

KnobTypeDefaultEffect
openmp-opt-disableboolfalseDisable all OpenMP optimizations
openmp-opt-enable-mergingboolfalseEnable parallel region merging
openmp-opt-disable-internalizationboolfalseSkip function internalization
openmp-opt-disable-deglobalizationboolfalseSkip global-to-local promotion
openmp-opt-disable-spmdizationboolfalseSkip Generic-to-SPMD transformation
openmp-opt-disable-foldingboolfalseSkip ICV folding
openmp-opt-disable-state-machine-rewriteboolfalseSkip state machine optimization
openmp-opt-disable-barrier-eliminationboolfalseSkip redundant barrier removal
openmp-opt-inline-deviceboolvariesInline device runtime calls
openmp-opt-verbose-remarksboolfalseEmit detailed optimization remarks
openmp-opt-max-iterationsintvariesFixed-point iteration limit for analysis
openmp-opt-shared-limitintvariesMax shared memory for SPMD output promotion
openmp-opt-print-module-afterboolfalseDump module IR after OpenMP optimization
openmp-opt-print-module-beforeboolfalseDump module IR before OpenMP optimization
openmp-deduce-icv-valuesboolvariesDeduce Internal Control Variable values
openmp-print-icv-valuesboolfalsePrint deduced ICV values
openmp-print-gpu-kernelsboolfalsePrint identified GPU kernels
openmp-hide-memory-transfer-latencyboolfalseOverlap data transfers with computation

The openmp-opt-shared-limit knob is particularly relevant for the SPMD transformation: it caps the total amount of shared memory allocated for guarded output promotion. If the serial sections between parallel regions produce too many live-out values, the SPMD transformation may be abandoned when the shared memory budget is exceeded.

Diagnostic Strings

The OpenMP subsystem emits two diagnostics during SPMD transformation:

CodeSeverityMessage
OMP120Remark"Transformed generic-mode kernel to SPMD-mode."
OMP121Warning"Value has potential side effects preventing SPMD-mode execution. Add [[omp::assume(\"ompx_spmd_amenable\")]] to the called function to override"

OMP120 is emitted by sub_26968A0 on successful Generic-to-SPMD conversion. OMP121 is emitted for each call instruction that references a function not in the SPMD-amenable set, explaining why the transformation failed and providing the user with the override attribute.

Pipeline Integration

The OpenMP passes are registered in the pipeline under three names:

Pipeline IDPass NameLevelDescription
75openmp-optModulePre-link OpenMP optimization
76openmp-opt-postlinkModulePost-link OpenMP optimization
154openmp-opt-cgsccCGSCCCall-graph-level OpenMP optimization

The runtime declaration table (sub_312CF50) is invoked lazily from any of these passes when they need to emit a runtime call. The SPMD transformation is part of the module-level openmp-opt pass.

Execution Mode Call Patterns

The execution mode fundamentally determines which runtime functions appear in generated IR. These pseudocode patterns show the exact call sequences emitted by the state machine generator (sub_2678420) and the SPMD transformation (sub_26968A0).

Generic Mode Kernel (mode byte = 1)

entry:
    ret = __kmpc_target_init(KernelEnv, LaunchEnv)   // [155]
    if (ret == -1) goto worker_loop                   // worker threads
    // master thread: user code
    __kmpc_kernel_prepare_parallel(outlined_fn_ptr)   // [157]
    __kmpc_barrier_simple_generic(loc, gtid)          // [188]
    // ... more serial + parallel sections ...
    __kmpc_target_deinit()                            // [156]
worker_loop:
    while (true) {
        __kmpc_barrier_simple_generic(loc, gtid)      // [188]
        if (__kmpc_kernel_parallel(&fn))               // [171]
            fn(args);
            __kmpc_kernel_end_parallel()               // [172]
        __kmpc_barrier_simple_generic(loc, gtid)      // [188]
    }

SPMD Mode Kernel -- Simple (mode byte = 2, single parallel region)

After successful Generic-to-SPMD transformation:

entry:
    __kmpc_target_init(KernelEnv, LaunchEnv)          // [155], returns 0 for all
    tid = __kmpc_get_hardware_thread_id_in_block()    // [6]
    is_main = (tid == 0)
    br is_main, user_code, exit.threads
user_code:
    // all threads: user code
    __kmpc_parallel_51(loc, gtid, ...)                // [158], for nested
    __kmpc_barrier_simple_spmd(loc, gtid)             // [187]
exit.threads:
    __kmpc_target_deinit()                            // [156]

SPMD Mode Kernel -- Complex (guarded regions, multiple parallel regions)

entry:
    __kmpc_target_init(...)                           // [155]
region.check.tid:
    tid = __kmpc_get_hardware_thread_id_in_block()    // [6]
    cmp = icmp eq tid, 0
    br cmp, region.guarded, region.barrier
region.guarded:
    ... master-only serial code ...
    shared_ptr = __kmpc_alloc_shared(sizeof(result))  // [180]
    store result -> shared_ptr
region.guarded.end:
    br region.barrier
region.barrier:
    __kmpc_barrier_simple_spmd(loc, gtid)             // [187]
    result = load from shared_ptr
    __kmpc_barrier_simple_spmd(loc, gtid)             // [187], post-load
    __kmpc_free_shared(shared_ptr, size)              // [181]
    ... all threads continue with result ...
exit:
    __kmpc_target_deinit()                            // [156]

The SPMD transformation eliminates the worker state machine entirely. Workers no longer idle-spin in a polling loop; they participate in computation from the kernel's first instruction. Serial sections between parallel regions are wrapped in tid==0 guards with shared-memory output promotion and barriers.

SPMD-Amenable Function Table

The SPMD transformation maintains a hash set of functions that are safe to call from all threads simultaneously, located at *(omp_context + 208) + 34952 (base pointer), +34968 (capacity).

PropertyValue
Hash functionOpen-addressing with linear probing
Slot computation((addr >> 9) ^ (addr >> 4)) & (capacity - 1)
Sentinel-4096 (empty slot marker)
ContentsFunctions pre-analyzed or annotated with [[omp::assume("ompx_spmd_amenable")]]

When a call instruction references a function not in this set, the SPMD transformation fails for that kernel and emits OMP121: "Value has potential side effects preventing SPMD-mode execution. Add [[omp::assume(\"ompx_spmd_amenable\")]] to the called function to override".

Functional Category Summary

CategoryCountIndices
Thread hierarchy and hardware query200--6, 14--16, 17--45
Work sharing / loop scheduling4861--95, 159--170
Tasking1998--116, 154
Synchronization120, 2, 4, 50--52, 59--60, 96--97, 187--188, 190
Target offloading / data mapping18137--153
GPU execution mode10155--158, 171--174, 185--186
Warp primitives4175, 179, 189--190
NVIDIA device reduction3176--178
Shared memory management5180--184
Memory allocators8129--136
Copyprivate / threadprivate3122--124
Doacross synchronization4125--128
Teams / cancellation5117--121
Master / masked446--49
Reduction (standard)455--58
Begin / end253--54
Profiling2191--192
Sentinel1193
Total194

Function Map

FunctionAddressSizeRole
sub_312CF50 -- OpenMP runtime declaration factory (194-case switch)0x312CF50----
sub_3122A50 -- registerRuntimeFunction(context, index, funcDecl)0x3122A50----
sub_2686D90 -- OpenMP runtime declaration table (215 KB, outer wrapper)0x2686D90----
sub_26968A0 -- Generic-to-SPMD transformation (61 KB)0x26968A0----
sub_2680940 -- Parallel region merging (52 KB)0x2680940----
sub_2678420 -- State machine generation for Generic mode (41 KB)0x2678420----
sub_269F530 -- Attributor-based OpenMP optimization driver (63 KB)0x269F530----
sub_313D1B0 -- Parallel region outliner (47 KB)0x313D1B0----
sub_BCF480 -- FunctionType::get(retTy, paramTys, count, isVarArg)0xBCF480----
sub_BA8CB0 -- Module::getNamedValue(name)0xBA8CB0----
sub_B2C660 -- Function::Create(funcTy, linkage, name, module)0xB2C660----
sub_B994D0 -- addAttribute(26, value) -- set function attribute0xB994D0----
sub_B91C10 -- hasAttribute(26) -- check function attribute0xB91C10----
sub_B9C770 -- Attribute construction (varargs attribute)0xB9C770----
sub_B8C960 -- Attribute kind construction0xB8C960----
sub_B2BE50 -- Function::getContext()0xB2BE50----
sub_921880 -- Create runtime library call instruction0x921880----
sub_5FB5C0 -- OpenMP variant processing (%s$$OMP_VARIANT%06d)0x5FB5C0----

OpenMP Variant Processing

cicc also supports OpenMP variant dispatch during EDG front-end processing. The function sub_5FB5C0 at 0x5FB5C0 handles mangled names with the format %s$$OMP_VARIANT%06d, which the front-end generates for #pragma omp declare variant constructs. This is separate from the runtime declaration table and operates at the source-level AST rather than at the LLVM IR level.

Cross-References

  • Generic-to-SPMD Transformation -- the primary consumer of the runtime table, performing mode conversion using entries 6, 155, 156, 180, 181, 187, 188
  • Pipeline & Ordering -- where openmp-opt (ID 75), openmp-opt-postlink (ID 76), and openmp-opt-cgscc (ID 154) sit in the pass pipeline
  • CLI Flags -- compiler flags that control OpenMP code generation
  • LLVM Knobs -- the openmp-opt-* knobs listed above
  • Kernel Metadata -- how KernelEnvironmentTy and execution mode are set during IR generation
  • Hash Infrastructure -- the open-addressing hash table pattern used by the SPMD-amenable function set
  • GPU Execution Model -- broader context on SPMD vs Generic execution