Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Concurrent Compilation

CICC implements a two-phase concurrent compilation model that is entirely absent from upstream LLVM. The optimizer runs twice over the same module: Phase I performs whole-module analysis and early IR optimizations on a single thread, then Phase II runs per-function backend optimization in parallel across a thread pool. The design exploits the fact that most backend passes (instruction selection prep, register pressure reduction, peephole) are function-local and do not require cross-function information once Phase I has completed interprocedural analysis.

The two-phase protocol lives in sub_12E7E70 (9,405 bytes), which calls the same master pipeline function sub_12E54A0 twice, discriminated only by a TLS phase counter. The concurrency infrastructure spans the 0x12D4000--0x12EA000 address range and includes a GNU Make jobserver integration for build-system-aware parallelism throttling -- a feature that allows make -j8 to correctly limit total system load even when each cicc invocation itself wants to spawn threads.

Phase I/II orchestratorsub_12E7E70 (9,405 bytes)
Phase counter (TLS)qword_4FBB3B0 -- values 1, 2, 3
Concurrency eligibilitysub_12D4250 (626 bytes)
Function sortingsub_12E0CA0 (23,422 bytes)
Concurrent entrysub_12E1EF0 (51,325 bytes)
Worker entrysub_12E7B90 (2,997 bytes)
Per-function callbacksub_12E8D50
Per-function optimizersub_12E86C0 (7,687 bytes)
GNU jobserver initsub_16832F0
MAKEFLAGS parsersub_1682BF0
Thread pool createsub_16D4AB0
Thread pool enqueuesub_16D5230
Thread pool joinsub_16D4EC0
Disable env varLIBNVVM_DISABLE_CONCURRENT_API -- byte_4F92D70
Pipeline functionsub_12E54A0 (49,800 bytes) -- called by both phases

Two-Phase Architecture

Both phases call the same optimization pipeline function sub_12E54A0(context, input, output, opts, errCb). The only difference is the value stored in the TLS variable qword_4FBB3B0 before each call. Individual optimization passes read this TLS variable to decide whether to run: Phase I passes fire when the counter equals 1; Phase II passes fire when it equals 2. This avoids running codegen-oriented passes during analysis and vice versa.

Phase Counter Protocol

The phase counter qword_4FBB3B0 is a TLS variable accessed via sub_16D40E0 (set) and sub_16D40F0 (get). It stores a pointer to a heap-allocated 4-byte integer. Three values are defined:

ValueMeaningSet point
1Phase I active -- analysis + early IR optimizationBefore first sub_12E54A0 call
2Phase II active -- backend optimization + codegen prepBefore second sub_12E54A0 call
3Compilation complete for this moduleAfter second sub_12E54A0 returns

Sequential Path (sub_12E7E70)

When verbose logging is disabled and the module contains only one defined function, the orchestrator takes a fast path:

// Single-function fast path: no phase counter set at all
if (!verbose && num_defined_functions <= 1) {
    sub_12E54A0(ctx, input, output, opts, errCb);  // single un-phased call
    return;
}

This means the optimizer runs both phases in a single invocation -- passes see no phase counter and run unconditionally. For multi-function modules or when verbose logging is active, the full two-phase protocol engages:

// Phase I
int *phase = malloc(4);
*phase = 1;
tls_set(qword_4FBB3B0, phase);
sub_12E54A0(ctx, input, output, opts, errCb);

if (error_reported(errCb))
    return;  // abort on Phase I error

// Concurrency decision
bool concurrent = sub_12D4250(ctx, opts);
// Diagnostic: "Concurrent=Yes" or "Concurrent=No"

// Phase II
*phase = 2;
tls_set(qword_4FBB3B0, phase);
sub_12E54A0(ctx, input, output, opts, errCb);

// Done
*phase = 3;
tls_set(qword_4FBB3B0, phase);

The diagnostic string construction between phases is notable: v46 = 3LL - (v41 == 0) computes the length of "Yes" (3) vs "No" (2, but expression yields 2 via 3 - 1), then logs "Phase II" with the "Concurrent=Yes/No" annotation appended.

Concurrent Path (sub_12E7B90)

When the thread count exceeds 1, the orchestrator dispatches to sub_12E7B90 instead of running Phase II sequentially:

sub_12E7B90(ctx, module_ptr, thread_count, opts, ...)
    |
    |-- Phase I: *phase=1, sub_12E54A0(...)        // whole-module, single thread
    |-- sub_12D4250(ctx, opts)                     // eligibility check
    |
    +-- if eligible (>1 defined function):
    |     sub_12E1EF0(...)                         // concurrent Phase II
    |     *phase = 3
    |
    +-- else (single defined function):
          *phase = 2, sub_12E54A0(...)             // sequential Phase II
          *phase = 3

Phase I always runs single-threaded on the whole module because interprocedural analyses (alias analysis, call graph construction, inlining decisions) require a consistent global view. Only after Phase I completes does the system split the module into per-function chunks for parallel Phase II processing.

Eligibility Check

sub_12D4250 (626 bytes) determines whether the module qualifies for concurrent compilation. The check is straightforward:

int sub_12D4250(Module *mod, Options *opts) {
    int defined_count = 0;
    for (Function &F : mod->functions()) {
        if (!sub_15E4F60(&F))    // !isDeclaration()
            defined_count++;
    }
    if (defined_count <= 1)
        return 0;                // not eligible: only 0 or 1 defined function

    byte force = *(byte*)(opts + 4064);   // NVVMPassOptions slot 201 (0xC9)
    if (force != 0)
        return force;            // user-forced concurrency setting

    return sub_12D3FC0(mod, opts);  // auto-determine thread count
}

The key gate is defined_count > 1. A module with a single kernel and no device functions will always compile sequentially regardless of thread count settings. The opts + 4064 byte (NVVMPassOptions slot 201, type BOOL_COMPACT, default 0) allows the user to force concurrent mode on or off. When zero (default), sub_12D3FC0 auto-determines the thread count based on module characteristics.

Function Priority Sorting

Before distributing functions to worker threads, sub_12E0CA0 (23,422 bytes) sorts them by compilation priority. This step is critical for load balancing: larger or more complex functions should start compiling first so they don't become tail stragglers.

Sorting Algorithm

The sort uses a hybrid strategy consistent with libstdc++ std::sort:

Input sizeAlgorithmFunction
Small NInsertion sortsub_12D48A0
Large NIntrosort (quicksort + heapsort fallback)sub_12D57D0

The threshold between insertion sort and introsort is 256 bytes of element data (consistent with the libstdc++ template instantiation pattern observed elsewhere in the binary).

Priority Source

Priority values come from function attributes extracted by sub_12D3D20 (585 bytes). The sorted output is a vector of (name_ptr, name_len, priority) tuples with 32-byte stride, used directly by the per-function dispatch loop to determine compilation order. Functions with higher priority (likely larger or more critical kernels) are submitted to the thread pool first.

Enumeration Phase

Before sorting, sub_12E0CA0 enumerates all functions and globals via an iterator callback table:

CallbackAddressPurpose
Next functionsub_12D3C60Advance to next function in module
Iterator advancesub_12D3C80Step iterator forward
End checksub_12D3CA0Test if iterator reached end

For each function, the enumeration:

  1. Checks the node type discriminator at *(byte*)(node + 16) -- type 0 = Function, type 1 = GlobalVariable
  2. For functions: calls sub_15E4F60 (isDeclaration check), sub_12D3D20 (priority), sub_1649960 (name), inserts into v359 hash table (name to function) and v362 hash table (name to linkage type)
  3. For global variables: walks the parent/linked GlobalValue chain via sub_164A820, inserts callee references into v365 hash table for split-module tracking

GNU Jobserver Integration

When cicc is invoked by GNU Make with -j, it can participate in the make jobserver protocol to avoid oversubscribing the system. The jobserver flag is passed from nvcc via the -jobserver CLI flag, which sets opts + 3288 (NVVMPassOptions slot 163, type BOOL_COMPACT, default 0).

Initialization (sub_16832F0)

The jobserver init function allocates a 296-byte state structure and calls sub_1682BF0 to parse the MAKEFLAGS environment variable:

int sub_16832F0(JobserverState *state, int reserved) {
    memset(state, 0, 296);
    state->flags[8] = 1;                    // initialized marker

    int err = sub_1682BF0(state);            // parse MAKEFLAGS
    if (err) return err;

    pipe(state->local_pipe);                 // local token pipe
    // state+196 = read FD, state+200 = write FD

    pthread_create(&state->thread,           // state+208
                   NULL, token_manager, state);

    reserve_vector(state, token_count);
    return 0;  // success
}

MAKEFLAGS Parsing (sub_1682BF0)

The parser searches the MAKEFLAGS environment variable for --jobserver-auth= and supports two formats:

FormatExampleMechanism
Pipe FDs--jobserver-auth=3,4Read FD = 3, Write FD = 4 (classic POSIX pipe)
FIFO path--jobserver-auth=fifo:/tmp/gmake-jobserver-12345Named FIFO (GNU Make 4.4+)

The pipe format uses comma-separated read/write file descriptors inherited from the parent make process. The FIFO format uses a named pipe in the filesystem. In both cases, the jobserver protocol works the same way: a thread reads tokens from the pipe/FIFO before starting each per-function compilation, and writes tokens back when the function completes. This ensures cicc never runs more concurrent compilations than make's -j level permits.

Error Handling

if (jobserver_init_error) {
    if (error_code == 5 || error_code == 6) {
        // Warning: jobserver pipe not accessible (probably not in make context)
        emit_warning(severity=1);
        // Fall through: continue without jobserver
    } else {
        // Fatal: "GNU Jobserver support requested, but an error occurred"
        sub_16BD130("GNU Jobserver support requested, but an error occurred", 1);
    }
}

Error codes 5 and 6 are non-fatal (the jobserver pipe may not be available if cicc is invoked outside a make context). All other errors are fatal.

Thread Pool Management

Creation (sub_16D4AB0)

The thread pool is LLVM's standard ThreadPool (the binary contains "llvm-worker-{0}" thread naming at sub_23CE0C0). Creation occurs at line 799 of sub_12E1EF0:

int actual_threads = min(requested_threads, num_functions);
sub_16D4AB0(thread_pool, actual_threads);

The thread count is clamped to the number of functions -- there is no point spawning more threads than there are work items.

Thread Count Resolution

Thread count is resolved through a fallback chain in sub_12E7E70:

int thread_count = opts[1026];    // NVVMPassOptions slot 203 (offset 4104), default -1
if (thread_count < 0)
    thread_count = opts[1036];    // NVVMPassOptions slot 205 (offset 4144), default -1
if (thread_count == 0)
    thread_count = sub_22420F0(); // get_nprocs() -- number of online CPUs
SourceSlotOffsetDefaultMeaning
Primary203 (0xCB)4104-1 (auto)Explicit thread count
Fallback205 (0xCD)4144-1 (auto)Secondary thread count
System------get_nprocs() return value

The -1 sentinel means "auto-detect." When both slots are -1, the system falls back to the number of online CPUs. In practice, this means a 16-core machine will spawn 16 threads for concurrent Phase II compilation unless the user or calling tool overrides the count.

Per-Function Context Packaging

Before submitting each function to the thread pool, sub_12E1EF0 builds a self-contained compilation context (4,632 bytes) containing everything the worker needs:

  1. Bitcode extraction: sub_1AB9F40(&buffer, *module_bc, &iterators, sub_12D4BD0, &filter_state) -- extracts the bitcode for a single function using sub_12D4BD0 as a filter callback that checks the function name against the work hash table
  2. Bitcode parsing: sub_153BF40(module_data, &context, 0, 0, 0, 0) -- parses the extracted bitcode into an LLVM Module
  3. Context copy: Copies all iterator state (v391--v395), context/options/error callbacks (v405--v409), module name string, the full 4,480-byte NVVMPassOptions struct, and module bitcode vectors

Each function gets its own independent copy of the options struct and module -- there is no shared mutable state between worker threads during Phase II.

Worker Entry and Completion

Work Submission

Each function is submitted to the thread pool with two callbacks:

v373 = sub_12E8D50;    // completion callback (runs the optimizer)
v372 = sub_12D4D90;    // destructor callback (cleanup)
sub_16D5230(work_item, thread_pool, context);  // enqueue

Per-Function Callback (sub_12E8D50)

The completion callback extracts the module from the packaged context and calls the Phase II per-function optimizer:

void sub_12E8D50(Context *ctx) {
    Module *mod = extract_module(ctx);
    sub_12E86C0(ctx, function_index, opts, module_name);
}

Per-Function Phase II Optimizer (sub_12E86C0, 7,687 bytes)

This function sets the TLS phase counter to 2 and runs the pass pipeline on the individual function's module:

void sub_12E86C0(Context *ctx, int func_idx, Options *opts, StringRef name) {
    int *phase = malloc(4);
    *phase = 2;
    tls_set(qword_4FBB3B0, phase);
    // Run Phase II pass pipeline on this function's module
    sub_12E54A0(ctx, ...);
}

Because qword_4FBB3B0 is TLS, each worker thread has its own phase counter. All worker threads see phase=2 concurrently without interference.

Post-Compilation Merge

After all worker threads complete (sub_16D4EC0 joins the thread pool):

  1. Jobserver cleanup: sub_1682740 checks for jobserver errors and releases tokens
  2. Error check: If any per-function callback reported an error, the compilation fails
  3. Normal mode (opt_level >= 0): Appends a null byte to the output buffer (bitcode stream terminator)
  4. Split-compile mode (opt_level < 0): Re-reads each function's bitcode via sub_153BF40, links all per-function modules via sub_12F5610 (the LLVM module linker), and restores linkage attributes from the v362 hash table. Specifically:
    • Linkage values 7--8: set only low 6 bits (external linkage types)
    • Other values: set low 4 bits, then check (value & 0x30) != 0 for visibility bits
    • Sets byte+33 |= 0x40 (dso_local flag)

Configuration

Environment Variables

VariableCheckEffect
LIBNVVM_DISABLE_CONCURRENT_APIgetenv() != NULLSets byte_4F92D70 = 1. Disables concurrent/thread-safe LibNVVM API usage entirely. Any non-NULL value triggers it. Checked in global constructor ctor_104 at 0x4A5810.
MAKEFLAGSParsed by sub_1682BF0Searched for --jobserver-auth= to enable GNU Make jobserver integration

NVVMPassOptions Slots

SlotOffsetTypeDefaultPurpose
163 (0xA3)3288BOOL_COMPACT0Jobserver integration requested (set by -jobserver flag)
201 (0xC9)4064BOOL_COMPACT0Force concurrency on/off (0 = auto)
203 (0xCB)4104INTEGER-1Primary thread count (-1 = auto)
205 (0xCD)4144INTEGER-1Fallback thread count (-1 = auto)

CLI Flags

FlagRouteEffect
-jobserveropt "-jobserver"Enables GNU jobserver integration (sets slot 163)
-split-compile=<N>opt "-split-compile=<N>"Enables split-module compilation (opt_level set to -1)
-split-compile-extended=<N>opt "-split-compile-extended=<N>"Extended split-compile (also sets +1644 = 1)
--sw2837879InternalConcurrent ptxStaticLib workaround flag

Phase State Machine

  START
    |
    v
  [phase=1] --> sub_12E54A0 (Phase I: whole-module analysis)
    |
    v
  error? --yes--> RETURN (abort)
    |no
    v
  count_defined_functions()
    |
    +--(1 func)--> [phase=2] --> sub_12E54A0 (Phase II sequential)
    |                                |
    |                                v
    |                            [phase=3] --> DONE
    |
    +--(N funcs, threads>1)--> sub_12E1EF0 (concurrent)
    |                             |
    |                             +-- sort functions by priority
    |                             +-- create thread pool
    |                             +-- init jobserver (if requested)
    |                             +-- for each function:
    |                             |     extract per-function bitcode
    |                             |     parse into independent Module
    |                             |     [phase=2] per-function (TLS)
    |                             |     submit to thread pool
    |                             +-- join all threads
    |                             +-- link split modules (if split-compile)
    |                             +-- [phase=3] --> DONE
    |
    +--(N funcs, threads<=1)--> [phase=2] --> sub_12E54A0 (sequential)
                                    |
                                    v
                                [phase=3] --> DONE

Differences from Upstream LLVM

Upstream LLVM has no two-phase compilation model. The standard LLVM pipeline runs all passes in a single invocation with no phase discrimination. CICC's approach is entirely custom:

  1. Phase counter TLS variable: Upstream LLVM passes have no concept of reading a global phase counter to decide whether to run. Every pass in CICC must check qword_4FBB3B0 and early-return if it belongs to the wrong phase.

  2. Per-function module splitting: Upstream LLVM's splitModule() (in llvm/Transforms/Utils/SplitModule.h) exists for ThinLTO and GPU offloading, but CICC's splitting at sub_1AB9F40 with the sub_12D4BD0 filter callback is a custom implementation integrated with the NVVMPassOptions system.

  3. GNU jobserver integration: No upstream LLVM tool participates in the GNU Make jobserver protocol. This is entirely NVIDIA-specific, implemented to play nicely with make -j in CUDA build systems.

  4. Function priority sorting: Upstream LLVM processes functions in module iteration order. CICC's priority-based sorting via sub_12E0CA0 ensures that expensive functions start compiling first, reducing tail latency in the thread pool.

Function Map

FunctionAddressSizeRole
Function iterator: nextsub_12D3C60~200--
Function iterator: advancesub_12D3C80~230--
Function iterator: end checksub_12D3CA0~260--
Function attribute/priority querysub_12D3D20585--
Auto thread count determinationsub_12D3FC03,600--
Concurrency eligibility checksub_12D4250626--
Insertion sort (small N)sub_12D48A0----
Per-function bitcode filter callbacksub_12D4BD02,384--
Work item destructor callbacksub_12D4D902,742--
Introsort (large N)sub_12D57D0----
Function sorting and enumerationsub_12E0CA023,422--
Concurrent compilation top-level entrysub_12E1EF051,325--
Master pipeline assembly (both phases)sub_12E54A049,800--
Concurrent worker entrysub_12E7B902,997--
Phase I/II orchestratorsub_12E7E709,405--
Per-function Phase II optimizersub_12E86C07,687--
Per-function completion callbacksub_12E8D50----
LLVM module linker (post-merge)sub_12F56107,339--
Bitcode reader/verifiersub_153BF40----
isDeclaration() checksub_15E4F60----
Get function namesub_1649960----
Walk to parent GlobalValuesub_164A820----
Jobserver error check/cleanupsub_1682740----
MAKEFLAGS --jobserver-auth= parsersub_1682BF0----
GNU jobserver init (296-byte state)sub_16832F0----
TLS set (qword_4FBB3B0)sub_16D40E0----
TLS get (qword_4FBB3B0)sub_16D40F0----
Thread pool createsub_16D4AB0----
Thread pool joinsub_16D4EC0----
Thread pool enqueue work itemsub_16D5230----
Per-function bitcode extractionsub_1AB9F40----
get_nprocs() wrappersub_22420F0----

Cross-References

  • Entry Point & CLI -- pipeline dispatch that leads to the optimizer, including -jobserver flag routing
  • Optimizer Pipeline -- sub_12E54A0, the pipeline function called by both phases
  • NVVMPassOptions -- the 222-slot options table including thread count and jobserver slots
  • Environment Variables -- LIBNVVM_DISABLE_CONCURRENT_API and MAKEFLAGS
  • CLI Flags -- -jobserver, -split-compile, -split-compile-extended
  • Bitcode I/O -- sub_153BF40 bitcode reader used for per-function module extraction