Concurrent Compilation

CICC implements a two-phase concurrent compilation model that is entirely absent from upstream LLVM. The optimizer runs twice over the same module: Phase I performs whole-module analysis and early IR optimizations on a single thread, then Phase II runs per-function backend optimization in parallel across a thread pool. The design exploits the fact that most backend passes (instruction selection prep, register pressure reduction, peephole) are function-local and do not require cross-function information once Phase I has completed interprocedural analysis.

The two-phase protocol lives in sub_12E7E70 (9,405 bytes), which calls the same master pipeline function sub_12E54A0 twice, discriminated only by a TLS phase counter. The concurrency infrastructure spans the 0x12D4000--0x12EA000 address range and includes a GNU Make jobserver integration for build-system-aware parallelism throttling -- a feature that allows make -j8 to correctly limit total system load even when each cicc invocation itself wants to spawn threads.


Phase I/II orchestrator	`sub_12E7E70` (9,405 bytes)
Phase counter (TLS)	`qword_4FBB3B0` -- values 1, 2, 3
Concurrency eligibility	`sub_12D4250` (626 bytes)
Function sorting	`sub_12E0CA0` (23,422 bytes)
Concurrent entry	`sub_12E1EF0` (51,325 bytes)
Worker entry	`sub_12E7B90` (2,997 bytes)
Per-function callback	`sub_12E8D50`
Per-function optimizer	`sub_12E86C0` (7,687 bytes)
GNU jobserver init	`sub_16832F0`
MAKEFLAGS parser	`sub_1682BF0`
Thread pool create	`sub_16D4AB0`
Thread pool enqueue	`sub_16D5230`
Thread pool join	`sub_16D4EC0`
Disable env var	`LIBNVVM_DISABLE_CONCURRENT_API` -- `byte_4F92D70`
Pipeline function	`sub_12E54A0` (49,800 bytes) -- called by both phases

Two-Phase Architecture

Both phases call the same optimization pipeline function sub_12E54A0(context, input, output, opts, errCb). The only difference is the value stored in the TLS variable qword_4FBB3B0 before each call. Individual optimization passes read this TLS variable to decide whether to run: Phase I passes fire when the counter equals 1; Phase II passes fire when it equals 2. This avoids running codegen-oriented passes during analysis and vice versa.

Phase Counter Protocol

The phase counter qword_4FBB3B0 is a TLS variable accessed via sub_16D40E0 (set) and sub_16D40F0 (get). It stores a pointer to a heap-allocated 4-byte integer. Three values are defined:

Value	Meaning	Set point
1	Phase I active -- analysis + early IR optimization	Before first `sub_12E54A0` call
2	Phase II active -- backend optimization + codegen prep	Before second `sub_12E54A0` call
3	Compilation complete for this module	After second `sub_12E54A0` returns

Sequential Path (sub_12E7E70)

When verbose logging is disabled and the module contains only one defined function, the orchestrator takes a fast path:

// Single-function fast path: no phase counter set at all
if (!verbose && num_defined_functions <= 1) {
    sub_12E54A0(ctx, input, output, opts, errCb);  // single un-phased call
    return;
}

This means the optimizer runs both phases in a single invocation -- passes see no phase counter and run unconditionally. For multi-function modules or when verbose logging is active, the full two-phase protocol engages:

// Phase I
int *phase = malloc(4);
*phase = 1;
tls_set(qword_4FBB3B0, phase);
sub_12E54A0(ctx, input, output, opts, errCb);

if (error_reported(errCb))
    return;  // abort on Phase I error

// Concurrency decision
bool concurrent = sub_12D4250(ctx, opts);
// Diagnostic: "Concurrent=Yes" or "Concurrent=No"

// Phase II
*phase = 2;
tls_set(qword_4FBB3B0, phase);
sub_12E54A0(ctx, input, output, opts, errCb);

// Done
*phase = 3;
tls_set(qword_4FBB3B0, phase);

The diagnostic string construction between phases is notable: v46 = 3LL - (v41 == 0) computes the length of "Yes" (3) vs "No" (2, but expression yields 2 via 3 - 1), then logs "Phase II" with the "Concurrent=Yes/No" annotation appended.

Concurrent Path (sub_12E7B90)

When the thread count exceeds 1, the orchestrator dispatches to sub_12E7B90 instead of running Phase II sequentially:

sub_12E7B90(ctx, module_ptr, thread_count, opts, ...)
    |
    |-- Phase I: *phase=1, sub_12E54A0(...)        // whole-module, single thread
    |-- sub_12D4250(ctx, opts)                     // eligibility check
    |
    +-- if eligible (>1 defined function):
    |     sub_12E1EF0(...)                         // concurrent Phase II
    |     *phase = 3
    |
    +-- else (single defined function):
          *phase = 2, sub_12E54A0(...)             // sequential Phase II
          *phase = 3

Phase I always runs single-threaded on the whole module because interprocedural analyses (alias analysis, call graph construction, inlining decisions) require a consistent global view. Only after Phase I completes does the system split the module into per-function chunks for parallel Phase II processing.

Eligibility Check

sub_12D4250 (626 bytes) determines whether the module qualifies for concurrent compilation. The check is straightforward:

int sub_12D4250(Module *mod, Options *opts) {
    int defined_count = 0;
    for (Function &F : mod->functions()) {
        if (!sub_15E4F60(&F))    // !isDeclaration()
            defined_count++;
    }
    if (defined_count <= 1)
        return 0;                // not eligible: only 0 or 1 defined function

    byte force = *(byte*)(opts + 4064);   // NVVMPassOptions slot 201 (0xC9)
    if (force != 0)
        return force;            // user-forced concurrency setting

    return sub_12D3FC0(mod, opts);  // auto-determine thread count
}

The key gate is defined_count > 1. A module with a single kernel and no device functions will always compile sequentially regardless of thread count settings. The opts + 4064 byte (NVVMPassOptions slot 201, type BOOL_COMPACT, default 0) allows the user to force concurrent mode on or off. When zero (default), sub_12D3FC0 auto-determines the thread count based on module characteristics.

Function Priority Sorting

Before distributing functions to worker threads, sub_12E0CA0 (23,422 bytes) sorts them by compilation priority. This step is critical for load balancing: larger or more complex functions should start compiling first so they don't become tail stragglers.

Sorting Algorithm

The sort uses a hybrid strategy consistent with libstdc++ std::sort:

Input size	Algorithm	Function
Small N	Insertion sort	`sub_12D48A0`
Large N	Introsort (quicksort + heapsort fallback)	`sub_12D57D0`

The threshold between insertion sort and introsort is 256 bytes of element data (consistent with the libstdc++ template instantiation pattern observed elsewhere in the binary).

Priority Source

Priority values come from function attributes extracted by sub_12D3D20 (585 bytes). The sorted output is a vector of (name_ptr, name_len, priority) tuples with 32-byte stride, used directly by the per-function dispatch loop to determine compilation order. Functions with higher priority (likely larger or more critical kernels) are submitted to the thread pool first.

Enumeration Phase

Before sorting, sub_12E0CA0 enumerates all functions and globals via an iterator callback table:

Callback	Address	Purpose
Next function	`sub_12D3C60`	Advance to next function in module
Iterator advance	`sub_12D3C80`	Step iterator forward
End check	`sub_12D3CA0`	Test if iterator reached end

For each function, the enumeration:

Checks the node type discriminator at *(byte*)(node + 16) -- type 0 = Function, type 1 = GlobalVariable
For functions: calls sub_15E4F60 (isDeclaration check), sub_12D3D20 (priority), sub_1649960 (name), inserts into v359 hash table (name to function) and v362 hash table (name to linkage type)
For global variables: walks the parent/linked GlobalValue chain via sub_164A820, inserts callee references into v365 hash table for split-module tracking

GNU Jobserver Integration

When cicc is invoked by GNU Make with -j, it can participate in the make jobserver protocol to avoid oversubscribing the system. The jobserver flag is passed from nvcc via the -jobserver CLI flag, which sets opts + 3288 (NVVMPassOptions slot 163, type BOOL_COMPACT, default 0).

Initialization (sub_16832F0)

The jobserver init function allocates a 296-byte state structure and calls sub_1682BF0 to parse the MAKEFLAGS environment variable:

int sub_16832F0(JobserverState *state, int reserved) {
    memset(state, 0, 296);
    state->flags[8] = 1;                    // initialized marker

    int err = sub_1682BF0(state);            // parse MAKEFLAGS
    if (err) return err;

    pipe(state->local_pipe);                 // local token pipe
    // state+196 = read FD, state+200 = write FD

    pthread_create(&state->thread,           // state+208
                   NULL, token_manager, state);

    reserve_vector(state, token_count);
    return 0;  // success
}

MAKEFLAGS Parsing (sub_1682BF0)

The parser searches the MAKEFLAGS environment variable for --jobserver-auth= and supports two formats:

Format	Example	Mechanism
Pipe FDs	`--jobserver-auth=3,4`	Read FD = 3, Write FD = 4 (classic POSIX pipe)
FIFO path	`--jobserver-auth=fifo:/tmp/gmake-jobserver-12345`	Named FIFO (GNU Make 4.4+)

The pipe format uses comma-separated read/write file descriptors inherited from the parent make process. The FIFO format uses a named pipe in the filesystem. In both cases, the jobserver protocol works the same way: a thread reads tokens from the pipe/FIFO before starting each per-function compilation, and writes tokens back when the function completes. This ensures cicc never runs more concurrent compilations than make's -j level permits.

Error Handling

if (jobserver_init_error) {
    if (error_code == 5 || error_code == 6) {
        // Warning: jobserver pipe not accessible (probably not in make context)
        emit_warning(severity=1);
        // Fall through: continue without jobserver
    } else {
        // Fatal: "GNU Jobserver support requested, but an error occurred"
        sub_16BD130("GNU Jobserver support requested, but an error occurred", 1);
    }
}

Error codes 5 and 6 are non-fatal (the jobserver pipe may not be available if cicc is invoked outside a make context). All other errors are fatal.

Thread Pool Management

Creation (sub_16D4AB0)

The thread pool is LLVM's standard ThreadPool (the binary contains "llvm-worker-{0}" thread naming at sub_23CE0C0). Creation occurs at line 799 of sub_12E1EF0:

int actual_threads = min(requested_threads, num_functions);
sub_16D4AB0(thread_pool, actual_threads);

The thread count is clamped to the number of functions -- there is no point spawning more threads than there are work items.

Thread Count Resolution

Thread count is resolved through a fallback chain in sub_12E7E70:

int thread_count = opts[1026];    // NVVMPassOptions slot 203 (offset 4104), default -1
if (thread_count < 0)
    thread_count = opts[1036];    // NVVMPassOptions slot 205 (offset 4144), default -1
if (thread_count == 0)
    thread_count = sub_22420F0(); // get_nprocs() -- number of online CPUs

Source	Slot	Offset	Default	Meaning
Primary	203 (0xCB)	4104	-1 (auto)	Explicit thread count
Fallback	205 (0xCD)	4144	-1 (auto)	Secondary thread count
System	--	--	--	`get_nprocs()` return value

The -1 sentinel means "auto-detect." When both slots are -1, the system falls back to the number of online CPUs. In practice, this means a 16-core machine will spawn 16 threads for concurrent Phase II compilation unless the user or calling tool overrides the count.

Per-Function Context Packaging

Before submitting each function to the thread pool, sub_12E1EF0 builds a self-contained compilation context (4,632 bytes) containing everything the worker needs:

Bitcode extraction: sub_1AB9F40(&buffer, *module_bc, &iterators, sub_12D4BD0, &filter_state) -- extracts the bitcode for a single function using sub_12D4BD0 as a filter callback that checks the function name against the work hash table
Bitcode parsing: sub_153BF40(module_data, &context, 0, 0, 0, 0) -- parses the extracted bitcode into an LLVM Module
Context copy: Copies all iterator state (v391--v395), context/options/error callbacks (v405--v409), module name string, the full 4,480-byte NVVMPassOptions struct, and module bitcode vectors

Each function gets its own independent copy of the options struct and module -- there is no shared mutable state between worker threads during Phase II.

Worker Entry and Completion

Work Submission

Each function is submitted to the thread pool with two callbacks:

v373 = sub_12E8D50;    // completion callback (runs the optimizer)
v372 = sub_12D4D90;    // destructor callback (cleanup)
sub_16D5230(work_item, thread_pool, context);  // enqueue

Per-Function Callback (sub_12E8D50)

The completion callback extracts the module from the packaged context and calls the Phase II per-function optimizer:

void sub_12E8D50(Context *ctx) {
    Module *mod = extract_module(ctx);
    sub_12E86C0(ctx, function_index, opts, module_name);
}

Per-Function Phase II Optimizer (sub_12E86C0, 7,687 bytes)

This function sets the TLS phase counter to 2 and runs the pass pipeline on the individual function's module:

void sub_12E86C0(Context *ctx, int func_idx, Options *opts, StringRef name) {
    int *phase = malloc(4);
    *phase = 2;
    tls_set(qword_4FBB3B0, phase);
    // Run Phase II pass pipeline on this function's module
    sub_12E54A0(ctx, ...);
}

Because qword_4FBB3B0 is TLS, each worker thread has its own phase counter. All worker threads see phase=2 concurrently without interference.

Post-Compilation Merge

After all worker threads complete (sub_16D4EC0 joins the thread pool):

Jobserver cleanup: sub_1682740 checks for jobserver errors and releases tokens
Error check: If any per-function callback reported an error, the compilation fails
Normal mode (opt_level >= 0): Appends a null byte to the output buffer (bitcode stream terminator)
Split-compile mode (opt_level < 0): Re-reads each function's bitcode via sub_153BF40, links all per-function modules via sub_12F5610 (the LLVM module linker), and restores linkage attributes from the v362 hash table. Specifically:
- Linkage values 7--8: set only low 6 bits (external linkage types)
- Other values: set low 4 bits, then check (value & 0x30) != 0 for visibility bits
- Sets byte+33 |= 0x40 (dso_local flag)

Configuration

Environment Variables

Variable	Check	Effect
`LIBNVVM_DISABLE_CONCURRENT_API`	`getenv() != NULL`	Sets `byte_4F92D70 = 1`. Disables concurrent/thread-safe LibNVVM API usage entirely. Any non-NULL value triggers it. Checked in global constructor `ctor_104` at `0x4A5810`.
`MAKEFLAGS`	Parsed by `sub_1682BF0`	Searched for `--jobserver-auth=` to enable GNU Make jobserver integration

NVVMPassOptions Slots

Slot	Offset	Type	Default	Purpose
163 (0xA3)	3288	`BOOL_COMPACT`	0	Jobserver integration requested (set by `-jobserver` flag)
201 (0xC9)	4064	`BOOL_COMPACT`	0	Force concurrency on/off (0 = auto)
203 (0xCB)	4104	`INTEGER`	-1	Primary thread count (-1 = auto)
205 (0xCD)	4144	`INTEGER`	-1	Fallback thread count (-1 = auto)

CLI Flags

Flag	Route	Effect
`-jobserver`	`opt "-jobserver"`	Enables GNU jobserver integration (sets slot 163)
`-split-compile=<N>`	`opt "-split-compile=<N>"`	Enables split-module compilation (opt_level set to -1)
`-split-compile-extended=<N>`	`opt "-split-compile-extended=<N>"`	Extended split-compile (also sets `+1644 = 1`)
`--sw2837879`	Internal	Concurrent ptxStaticLib workaround flag

Phase State Machine

  START
    |
    v
  [phase=1] --> sub_12E54A0 (Phase I: whole-module analysis)
    |
    v
  error? --yes--> RETURN (abort)
    |no
    v
  count_defined_functions()
    |
    +--(1 func)--> [phase=2] --> sub_12E54A0 (Phase II sequential)
    |                                |
    |                                v
    |                            [phase=3] --> DONE
    |
    +--(N funcs, threads>1)--> sub_12E1EF0 (concurrent)
    |                             |
    |                             +-- sort functions by priority
    |                             +-- create thread pool
    |                             +-- init jobserver (if requested)
    |                             +-- for each function:
    |                             |     extract per-function bitcode
    |                             |     parse into independent Module
    |                             |     [phase=2] per-function (TLS)
    |                             |     submit to thread pool
    |                             +-- join all threads
    |                             +-- link split modules (if split-compile)
    |                             +-- [phase=3] --> DONE
    |
    +--(N funcs, threads<=1)--> [phase=2] --> sub_12E54A0 (sequential)
                                    |
                                    v
                                [phase=3] --> DONE

Differences from Upstream LLVM

Upstream LLVM has no two-phase compilation model. The standard LLVM pipeline runs all passes in a single invocation with no phase discrimination. CICC's approach is entirely custom:

Phase counter TLS variable: Upstream LLVM passes have no concept of reading a global phase counter to decide whether to run. Every pass in CICC must check qword_4FBB3B0 and early-return if it belongs to the wrong phase.
Per-function module splitting: Upstream LLVM's splitModule() (in llvm/Transforms/Utils/SplitModule.h) exists for ThinLTO and GPU offloading, but CICC's splitting at sub_1AB9F40 with the sub_12D4BD0 filter callback is a custom implementation integrated with the NVVMPassOptions system.
GNU jobserver integration: No upstream LLVM tool participates in the GNU Make jobserver protocol. This is entirely NVIDIA-specific, implemented to play nicely with make -j in CUDA build systems.
Function priority sorting: Upstream LLVM processes functions in module iteration order. CICC's priority-based sorting via sub_12E0CA0 ensures that expensive functions start compiling first, reducing tail latency in the thread pool.

Function Map

Function	Address	Size	Role
Function iterator: next	`sub_12D3C60`	~200	--
Function iterator: advance	`sub_12D3C80`	~230	--
Function iterator: end check	`sub_12D3CA0`	~260	--
Function attribute/priority query	`sub_12D3D20`	585	--
Auto thread count determination	`sub_12D3FC0`	3,600	--
Concurrency eligibility check	`sub_12D4250`	626	--
Insertion sort (small N)	`sub_12D48A0`	--	--
Per-function bitcode filter callback	`sub_12D4BD0`	2,384	--
Work item destructor callback	`sub_12D4D90`	2,742	--
Introsort (large N)	`sub_12D57D0`	--	--
Function sorting and enumeration	`sub_12E0CA0`	23,422	--
Concurrent compilation top-level entry	`sub_12E1EF0`	51,325	--
Master pipeline assembly (both phases)	`sub_12E54A0`	49,800	--
Concurrent worker entry	`sub_12E7B90`	2,997	--
Phase I/II orchestrator	`sub_12E7E70`	9,405	--
Per-function Phase II optimizer	`sub_12E86C0`	7,687	--
Per-function completion callback	`sub_12E8D50`	--	--
LLVM module linker (post-merge)	`sub_12F5610`	7,339	--
Bitcode reader/verifier	`sub_153BF40`	--	--
`isDeclaration()` check	`sub_15E4F60`	--	--
Get function name	`sub_1649960`	--	--
Walk to parent GlobalValue	`sub_164A820`	--	--
Jobserver error check/cleanup	`sub_1682740`	--	--
MAKEFLAGS `--jobserver-auth=` parser	`sub_1682BF0`	--	--
GNU jobserver init (296-byte state)	`sub_16832F0`	--	--
TLS set (`qword_4FBB3B0`)	`sub_16D40E0`	--	--
TLS get (`qword_4FBB3B0`)	`sub_16D40F0`	--	--
Thread pool create	`sub_16D4AB0`	--	--
Thread pool join	`sub_16D4EC0`	--	--
Thread pool enqueue work item	`sub_16D5230`	--	--
Per-function bitcode extraction	`sub_1AB9F40`	--	--
`get_nprocs()` wrapper	`sub_22420F0`	--	--

Cross-References

Entry Point & CLI -- pipeline dispatch that leads to the optimizer, including -jobserver flag routing
Optimizer Pipeline -- sub_12E54A0, the pipeline function called by both phases
NVVMPassOptions -- the 222-slot options table including thread count and jobserver slots
Environment Variables -- LIBNVVM_DISABLE_CONCURRENT_API and MAKEFLAGS
CLI Flags -- -jobserver, -split-compile, -split-compile-extended
Bitcode I/O -- sub_153BF40 bitcode reader used for per-function module extraction

Keyboard shortcuts

CICC Reverse Engineering Reference