Split Compilation
Split compilation parallelizes the PTX-to-SASS assembly step during LTO. After libnvvm compiles linked NVVM IR into a single PTX stream, nvlink can split that PTX into multiple chunks and assemble them concurrently on a thread pool. This is the only point in the nvlink pipeline that uses multi-threading; all other phases (merge, layout, relocation, finalize) are single-threaded.
Two CLI flags control the feature:
| Flag | Global (value) | Global (state) | Meaning |
|---|---|---|---|
-split-compile=N | dword_2A5B518 | dword_2A5F260 | Number of NVVM-level splits. Forwarded to libnvvm as -split-compile=N. libnvvm produces N separate PTX chunks from a single compiled IR |
-split-compile-extended=N | dword_2A5B514 | -- | Number of threads to use for the ptxas assembly step. When > 1, nvlink spawns a thread pool of N workers to assemble the split PTX chunks in parallel |
Both values are integers. Neither flag has a short-form alias. When -split-compile-extended is not specified, dword_2A5B514 defaults to 1 (single-threaded).
Value Consensus
Split-compile values originate from the input objects, not the command line. Each fatbin member carries NVVM compilation options as an embedded string. During fatbin extraction (sub_42AF40 at 0x42AF40), nvlink parses -split-compile N from the embedded options and accumulates a consensus across all input modules using a state machine on dword_2A5F260:
| State | Value | Meaning |
|---|---|---|
| 0 | (none) | No module has declared a split-compile value yet |
| 1 | (none) | At least one module had no -split-compile option |
| 2 | N | First module declared -split-compile=N |
| 3 | N | Modules disagree on presence (some have it, some do not), but the value N is consistent |
| 4 | (conflict) | Two modules declared different split-compile values. Treated as a warning |
for each input module's embedded option string:
val = parse "-split-compile " from option string
if val found:
if state == 0: state = 2, value = val
elif state == 1: state = 3, value = val
elif state == 2 or state == 3:
if val != value: state = 4 // conflict
else:
if state == 0: state = 1
elif state == 2: state = 3
If the final state is 4 (conflict), nvlink emits a warning diagnostic about inconsistent -split-compile values across inputs.
Option Forwarding to libnvvm
The option builder (sub_426CD0 at 0x426CD0) constructs the argument array passed to nvvmCompileProgram. The split-compile forwarding logic:
if split_compile_extended != 1:
append "-split-compile-extended=<dword_2A5B514>"
if split_compile_value == 1:
// skip -split-compile (implicitly 1)
elif split_compile_extended != 1:
// warn: -split-compile vs -split-compile-extended conflict
else:
if split_compile_value != 1:
append "-split-compile=<dword_2A5B518>"
Both flags are only forwarded when their values differ from the default of 1. When -split-compile-extended is active, it takes precedence, and if -split-compile is also set to a non-default value that differs, nvlink emits a conflict warning via sub_467460.
Dispatch Paths in main
After libnvvm compilation completes (via sub_4BC6F0), nvlink chooses one of three paths based on the compilation mode and dword_2A5B514:
Path 1: Whole-Program, Single-Threaded
When -device-c is not set and split_compile_extended == 1:
dword_2A5B528 = byte_2A5F225 ? 6 : 0; // SASS=6, normal=0
options = build_ptxas_options(); // sub_429BA0
result = ptxas_compile_whole( // sub_4BD4E0
&output, ptx_data, sm_version,
addr64, is_64bit, debug, options, mode);
sub_4BD4E0 at 0x4BD4E0 calls the embedded ptxas to assemble the entire PTX stream in one shot. It uses the same option-setting and compilation API as the split path but expects a single output (the compiled PTX contains one module).
Path 2: Relocatable, Single-Threaded
When -device-c is set and split_compile_extended == 1:
options = build_ptxas_options(); // sub_429BA0
result = ptxas_compile_split( // sub_4BD760
&output, ptx_data, sm_version,
addr64, is_64bit, debug, options, mode);
sub_4BD760 at 0x4BD760 is the split-aware ptxas entry point. When called with a single split, it handles the single-chunk case: set arch, set options, add input, compile, retrieve output.
Path 3: Extended Split Compile (Multi-Threaded)
When split_compile_extended > 1:
// Allocate work items: 40 bytes per split
work_items = arena_alloc(40 * num_splits); // sub_426AA0
options = build_ptxas_options(); // sub_429BA0
// Default thread count to nproc if not specified
if (split_compile_extended == 0)
split_compile_extended = sysconf(_SC_NPROCESSORS_ONLN); // sub_43FD90
// Create thread pool
pool = thread_pool_create(split_compile_extended); // sub_43FDB0
if (!pool)
fatal("Unable to create thread pool");
// Allocate output pointer array
outputs = arena_alloc(8 * num_splits);
// Submit one task per split
for (i = 0; i < num_splits; i++) {
work_items[i].output_ptr = &outputs[i]; // offset 0
work_items[i].ptx_data = split_ptx[i]; // offset 8
work_items[i].sm_version = sm_version; // offset 16
work_items[i].addr64 = addr64; // offset 20
work_items[i].is_64bit = is_64bit; // offset 21
work_items[i].debug = debug; // offset 22
work_items[i].options = options; // offset 24
work_items[i].mode = mode; // offset 32
thread_pool_submit(pool, split_compile_worker, &work_items[i]);
}
// Wait for all tasks, then tear down pool
thread_pool_wait(pool); // sub_43FFE0
thread_pool_destroy(pool); // sub_43FE70
// Merge results
for (i = 0; i < num_splits; i++) {
check_error(work_items[i].result); // offset 36
validate_and_merge(elfw, outputs[i], "lto.cubin");
if (sm > 89 && needs_mercury)
post_link_transform(&outputs[i], ...); // sub_4275C0
merge_elf(elfw); // sub_45E7D0
}
Each work item is a 40-byte structure:
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | output_ptr | Pointer to where the compiled cubin will be stored |
| 8 | 8 | ptx_data | Pointer to the split PTX chunk (input) |
| 16 | 4 | sm_version | Target SM architecture number |
| 20 | 1 | addr64 | 64-bit addressing flag |
| 21 | 1 | is_64bit | 64-bit machine flag |
| 22 | 1 | debug | Debug info flag (-g) |
| 24 | 8 | options | Pointer to shared ptxas option string |
| 32 | 4 | mode | Compilation mode (0=normal, 6=SASS) |
| 36 | 4 | result | Return code from sub_4BD760 (written by worker) |
Work Item Lifecycle
A work item is a 40-byte structure that carries all inputs needed for one split's ptxas compilation and receives the result. This section traces a single work item through five phases: creation, submission, dequeue, compilation, and result collection.
Phase 1: Creation (main thread)
After sub_4BC6F0 (libnvvm compile) returns, the main thread has a contiguous PTX blob and a sizes array (v362). It allocates N work items as a contiguous block and an output-pointer array, then populates each work item from globals and per-split data.
// Allocate work item array: 40 bytes * num_splits (arena memory)
work_items_base = arena_alloc(40 * num_splits); // sub_426AA0
// Allocate output pointer array: 8 bytes * num_splits (arena memory)
outputs = arena_alloc(8 * num_splits); // sub_426AA0
// Build shared ptxas option string (once, shared across all items)
options = build_ptxas_options(); // sub_429BA0
// Populate each work item from globals and per-split PTX pointer
cursor = work_items_base;
for (i = 0; i < num_splits; i++) {
// offset 0: pointer to this split's slot in the outputs array
*(uint64_t *)(cursor + 0) = &outputs[i];
// offset 8: pointer to this split's null-terminated PTX string
// (copied from libnvvm's contiguous output into arena memory,
// sliced by the sizes array v362)
*(uint64_t *)(cursor + 8) = split_ptx[i];
// offset 16: target SM architecture (e.g. 89, 100)
*(uint32_t *)(cursor + 16) = dword_2A5F314; // sm_version
// offset 20: 64-bit addressing flag (inverted byte_2A5F2C0)
*(uint8_t *)(cursor + 20) = (byte_2A5F2C0 == 0);
// offset 21: 64-bit machine flag (dword_2A5F30C == 64)
*(uint8_t *)(cursor + 21) = (dword_2A5F30C == 64);
// offset 22: debug info flag (inverted byte_2A5F310)
*(uint8_t *)(cursor + 22) = (byte_2A5F310 != 0);
// offset 24: shared ptxas option string (same pointer for all items)
*(uint64_t *)(cursor + 24) = options;
// offset 32: compilation mode (0=normal, 6=SASS-only via byte_2A5F225)
*(uint32_t *)(cursor + 32) = dword_2A5B528;
// offset 36: result code (uninitialized -- written by worker)
// Not set during creation; worker writes here
cursor += 40;
}
Key observations from the decompiled main at line 1224--1248:
- The
addr64flag at offset 20 is the inverse ofbyte_2A5F2C0(thev85 = byte_2A5F2C0 == 0pattern). - The
debugflag at offset 22 is the inverse ofbyte_2A5F310(same negation pattern). - The
optionspointer at offset 24 is shared across all work items. It points to a single option string built bysub_429BA0. This is safe because workers only read it. - The
modeat offset 32 comes fromdword_2A5B528, which was set earlier: 6 ifbyte_2A5F225(SASS mode), 0 otherwise.
Phase 2: Submission (main thread)
Each populated work item is submitted to the thread pool as a (function, arg) pair:
for (i = 0; i < num_splits; i++) {
if (!thread_pool_submit(pool, split_compile_worker, &work_items[i]))
fatal("Call to ptxjit failed in extended split compile mode");
}
Inside thread_pool_submit (sub_43FF50), each call:
- Allocates a 24-byte task node via
malloc(0x18):- Offset 0: function pointer (
split_compile_worker=sub_4264B0) - Offset 8: argument pointer (address of this work item within the contiguous block)
- Offset 16: next pointer (set to NULL, unused by the priority queue)
- Offset 0: function pointer (
- Locks the pool mutex
- Pushes the task node into the priority queue (
sub_44DD10) - Increments
pending_count - Broadcasts on
task_condto wake all sleeping workers - Unlocks the pool mutex
The broadcast (not signal) wakes all workers, not just one. With N tasks submitted in rapid succession, this means up to N workers can wake simultaneously on the first submission. Subsequent submissions find workers already awake and dequeuing.
Phase 3: Dequeue (worker thread)
Each worker thread runs start_routine at 0x43FC80 in an infinite loop. On dequeue:
// Worker is holding pool->mutex
task = priority_queue_pop(pool->task_queue); // sub_44DE20
pool->pending_count--;
pool->active_count++;
pthread_mutex_unlock(&pool->mutex);
The task node contains the function pointer and the work item address. The worker calls:
task->func(task->arg); // split_compile_worker(&work_items[i])
free(task); // frees the 24-byte malloc'd task node
After execution, the worker re-acquires the mutex, decrements active_count, and signals done_cond if both active_count and pending_count are zero (the "all done" condition).
Phase 4: Compilation (worker thread)
sub_4264B0 (the split-compile worker, 48 bytes) is a thin unpacker. It reads each field from the work item structure at its known offset and forwards them as arguments to sub_4BD760:
// sub_4264B0 -- decompiled form
void split_compile_worker(work_item *item) {
*(int32_t *)((char *)item + 36) = sub_4BD760(
*(uint64_t *)((char *)item + 0), // output_ptr
*(uint64_t *)((char *)item + 8), // ptx_data
*(uint32_t *)((char *)item + 16), // sm_version
*(uint8_t *)((char *)item + 20), // addr64
*(uint8_t *)((char *)item + 21), // is_64bit
*(uint8_t *)((char *)item + 22), // debug
*(uint64_t *)((char *)item + 24), // options
*(uint32_t *)((char *)item + 32) // mode
);
}
The return value of sub_4BD760 is an elfLink error code (0--13), written directly into offset 36 of the work item. This is the only field the worker writes; all other fields are read-only inputs.
Inside sub_4BD760, the compilation proceeds through the embedded ptxas API:
sub_4CDD60(&ctx) create compiler context
sub_4CE3B0(ctx, mode) set compilation mode (0 or 6)
sub_4CE2F0(ctx, sm_version) set target SM architecture
sub_4CE380(ctx) enable 64-bit addressing (if addr64)
sub_4CE640(ctx, 1) enable 64-bit machine mode (if is_64bit)
sub_4CE3E0(ctx, options) add ptxas option string
sub_4CE070(ctx, ptx_data) add PTX input data
sub_4CE8C0(ctx) compile
sub_4CE670(ctx, &buf, &count, &size) retrieve output chunks
After compilation, sub_4CE670 returns the output chunk count. Two paths:
Multi-chunk path (count != 1): The PTX was split by libnvvm. The function enters the setjmp-protected output copy path. It allocates arena memory via sub_4307C0, copies the compiled binary with memcpy, and stores the pointer through output_ptr (writing to outputs[i]). The setjmp/longjmp wrapper catches arena allocation failures or memcpy errors and maps them to error code 1 (ELFLINK_INTERNAL).
Single-chunk fallback (count == 1): The PTX was not actually split (e.g., libnvvm decided splitting was not beneficial). The function adds extra architecture options (-m64 or -m32) and a debug option (address 30616008 which is the string "-g"), then re-retrieves via sub_4BE350. This fallback handles the case where split-compile was requested but libnvvm produced only one chunk.
Return codes from sub_4BD760:
| Code | Condition |
|---|---|
| 0 | Success, output written to *output_ptr |
| 1 | Success with warning (longjmp recovery set a warning flag) |
| 5 | Compilation failed (sub_4CE8C0 error, or output retrieval failed, or option-setting failed) |
| 7 | NVVM error (sub_4CE8C0 returned 3, mapping to ELFLINK_NVVM_ERROR) |
| 8 | Output not relocatable (cubin retrieval via sub_4BE350 failed and no error string) |
Phase 5: Result Collection (main thread)
After all submissions, the main thread waits for completion and then iterates:
thread_pool_wait(pool); // sub_43FFE0 -- blocks until active==0 && pending==0
thread_pool_destroy(pool); // sub_43FE70 -- shutdown flag, wait for threads, free
for (i = 0; i < num_splits; i++) {
// 1. Check error code at offset 36 of each work item
check_ptxas_error(work_items_base[i * 40 + 36], "<lto ptx>"); // sub_4297B0
// 2. Retrieve compiled cubin from output array
cubin = outputs[i];
// 3. Validate and add to output ELF
if (!validate_and_add(elfw, cubin, "lto.cubin", &is_mercury)) // sub_426570
fatal("Ptxjit compilation failed in extended split compile mode");
// 4. Mercury post-link transform (sm > 89 only)
if (sm_version > 0x59 && (!sass_mode || needs_mercury(cubin)) && !is_mercury)
mercury_post_link(&cubin, "lto.cubin", sm_version, &env, 0); // sub_4275C0
// 5. Merge into output ELF
merge_elf(elfw); // sub_45E7D0
// 6. Free the split PTX string (arena memory)
arena_free(split_ptx[i]); // sub_431000
}
// Free the work items block and split PTX pointer array
arena_free(work_items_base); // sub_431000
arena_free(split_ptx_array); // sub_431000
The error check at step 1 (sub_4297B0) handles three cases:
- Code 0: success, no action
- Code 7 (
ELFLINK_NVVM_ERROR): emits a fatal diagnostic unlessbyte_2A5F298is set (cudadevrt tolerance mode) or the source contains "cudadevrt" - All other non-zero codes: translates via
sub_4BC270to a human-readable string and emits a fatal diagnostic
Steps 3--5 run serially in the main thread. The validate_and_add function (sub_426570) performs architecture verification: it checks the compiled cubin's ELF headers against the target SM, validates the machine class (32/64-bit), and verifies format compatibility. The Mercury post-link step (sub_4275C0) is only invoked for sm > 89 (Blackwell and later) and transforms the cubin for the Mercury execution model.
Lifecycle Diagram
main thread worker threads (N)
=========== ==================
libnvvm compile
sub_4BC6F0()
|
v
split PTX into N chunks
|
v
allocate work_items[N] (sleeping on task_cond)
allocate outputs[N]
|
v
for each split i:
populate work_items[i]
submit(worker, &items[i]) -------> wake on task_cond
| dequeue task
| call split_compile_worker()
| read offsets 0..32
| sub_4BD760() -- ptxas compile
| write offset 36 (result)
| write *output_ptr (cubin)
| free task node
| decrement active_count
| signal done_cond if all done
v |
thread_pool_wait() <----(done_cond)----+
thread_pool_destroy()
|
v
for each split i:
check work_items[i].result
validate outputs[i]
mercury transform (if sm>89)
merge into output ELF
free split PTX[i]
|
v
free work_items, free split_ptx_array
Thread Safety Notes
The design achieves thread safety through strict partitioning:
-
No shared mutable state between workers. Each worker reads from its own 40-byte work item and writes to two locations: offset 36 (result code) in its own item, and
*output_ptr(its own slot in the output array). No two workers touch the same memory. -
Read-only shared data. The
optionspointer at offset 24 points to a single string built by the main thread before any submission. All workers read it; none write it. -
Arena allocator thread safety.
sub_4BD760allocates output memory viasub_4307C0(arena allocator). The arena uses per-memspace mutexes (visible at offset 7128 in the memspace structure insub_431000), so concurrent arena allocations from different workers are safe. -
No synchronization inside the worker.
sub_4264B0contains no locks, atomics, or synchronization primitives. It is a pure function of its input (the work item pointer) plus the thread-safe ptxas API calls. -
The
setjmp/longjmpinsub_4BD760is thread-local. Theenvbuffer is on the stack of each worker thread, so concurrent longjmps do not interfere.
Worker Function
sub_4264B0 at 0x4264B0 is the per-split worker. It unpacks the 40-byte work item and calls sub_4BD760:
void split_compile_worker(work_item *item) {
item->result = ptxas_compile_split(
item->output_ptr, // offset 0
item->ptx_data, // offset 8
item->sm_version, // offset 16
item->addr64, // offset 20
item->is_64bit, // offset 21
item->debug, // offset 22
item->options, // offset 24
item->mode // offset 32
);
}
sub_4BD760 at 0x4BD760 is the split-compile ptxas entry point. Unlike the whole-program sub_4BD4E0, it produces multiple output chunks when libnvvm has generated split PTX. The function:
- Creates a ptxas compiler context (
sub_4CDD60) - Sets the target architecture (
sub_4CE3B0,sub_4CE2F0) - Optionally sets addr64 (
sub_4CE380), 64-bit mode (sub_4CE640), and debug (sub_4CE310) - Adds compilation options (
sub_4CE3E0) - Adds the PTX input data (
sub_4CE070) - Runs compilation (
sub_4CE8C0) - Retrieves output via
sub_4CE670-- checks the output count (v35) - If count == 1: the PTX was not split, adds architecture-specific options (
-m32/-m64) and re-retrieves viasub_4BE350 - Copies the compiled binary into arena memory and returns it through the output pointer
Error handling uses setjmp/longjmp to catch compilation failures and return status codes.
Thread Pool Implementation
The thread pool is a custom implementation using pthreads. The pool control structure is 184 bytes (0xB8):
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | thread_array | Pointer to array of 16-byte thread entries [slot, pthread_t] |
| 8 | 8 | task_queue | Priority queue for pending tasks (min-heap) |
| 16 | 4 | pending_count | Number of tasks queued but not yet dequeued |
| 24 | 40 | mutex | pthread_mutex_t protecting the pool state |
| 64 | 48 | task_cond | pthread_cond_t signaled when a task is available |
| 112 | 48 | done_cond | pthread_cond_t signaled when a task completes or pool is shutting down |
| 160 | 8 | active_count | Number of workers currently executing a task |
| 168 | 8 | thread_count | Number of live worker threads (decremented as threads exit) |
| 176 | 1 | shutdown | Shutdown flag; set to 1 to signal workers to exit |
thread_pool_get_nproc (sub_43FD90 at 0x43FD90)
Returns the number of online processors via sysconf(_SC_NPROCESSORS_ONLN) (sysconf parameter 83). Used as the default thread count when -split-compile-extended is specified without a value (or is 0).
thread_pool_create (sub_43FDB0 at 0x43FDB0)
pool_t *thread_pool_create(size_t num_threads) {
pool = calloc(1, 0xB8); // 184-byte control block
pool->thread_array = calloc(num_threads, 16); // 16 bytes per thread
pool->thread_count = num_threads;
pool->pending_count = 0;
pthread_mutex_init(&pool->mutex, NULL);
pthread_cond_init(&pool->task_cond, NULL);
pthread_cond_init(&pool->done_cond, NULL);
pool->task_queue = priority_queue_create(comparator_always_true, 0);
for (i = 0; i < num_threads; i++) {
pthread_create(&pool->thread_array[i].thread, NULL, worker_main, pool);
pthread_detach(pool->thread_array[i].thread);
}
return pool;
}
All threads are created detached. The priority queue uses sub_43FC70 as its comparator, which always returns 1 -- this makes the heap behave as a FIFO queue (insertion order preserved since all priorities are "equal").
thread_pool_submit (sub_43FF50 at 0x43FF50)
int thread_pool_submit(pool_t *pool, void (*func)(void *), void *arg) {
if (!func || !pool) return 0;
task = malloc(24); // 24-byte task node
task->func = func; // offset 0: function pointer
task->arg = arg; // offset 8: argument pointer
task->next = NULL; // offset 16: unused (queue manages ordering)
pthread_mutex_lock(&pool->mutex);
priority_queue_push(task, pool->task_queue); // sub_44DD10
pool->pending_count++;
pthread_cond_broadcast(&pool->task_cond); // wake all waiting workers
pthread_mutex_unlock(&pool->mutex);
return 1;
}
The task node is 24 bytes, heap-allocated with malloc (not the arena allocator). The queue push (sub_44DD10) inserts into a min-heap backed by a dynamic array with doubling growth. Since the comparator always returns 1, elements are dequeued in insertion order.
worker_main (start_routine at 0x43FC80)
void *worker_main(pool_t *pool) {
while (1) {
pthread_mutex_lock(&pool->mutex);
// Wait for work
while (pool->pending_count == 0) {
if (pool->shutdown) goto exit;
pthread_cond_wait(&pool->task_cond, &pool->mutex);
}
// Dequeue task
task = priority_queue_pop(pool->task_queue); // sub_44DE20
pool->pending_count--;
pool->active_count++;
pthread_mutex_unlock(&pool->mutex);
// Execute task
if (task) {
task->func(task->arg);
free(task);
}
// Signal completion
pthread_mutex_lock(&pool->mutex);
pool->active_count--;
if (!pool->shutdown && pool->active_count == 0 && pool->pending_count == 0)
pthread_cond_signal(&pool->done_cond);
pthread_mutex_unlock(&pool->mutex);
}
exit:
pool->thread_count--;
pthread_cond_signal(&pool->done_cond);
pthread_mutex_unlock(&pool->mutex);
return NULL;
}
The worker loops indefinitely, sleeping on task_cond when no work is available. When the shutdown flag is set, it decrements the thread count and signals done_cond before exiting. The completion signal at the end of each task is only fired when both active_count and pending_count are zero -- this is the condition that thread_pool_wait blocks on.
thread_pool_wait (sub_43FFE0 at 0x43FFE0)
void thread_pool_wait(pool_t *pool) {
if (!pool) return;
pthread_mutex_lock(&pool->mutex);
while (1) {
if (pool->pending_count == 0) {
if (pool->shutdown) {
if (pool->thread_count == 0) break;
} else {
if (pool->active_count == 0) break;
}
}
pthread_cond_wait(&pool->done_cond, &pool->mutex);
}
pthread_mutex_unlock(&pool->mutex);
}
Two wait modes: during normal operation, waits until pending_count == 0 && active_count == 0 (all submitted tasks finished). During shutdown, waits until pending_count == 0 && thread_count == 0 (all threads exited).
thread_pool_destroy (sub_43FE70 at 0x43FE70)
void thread_pool_destroy(pool_t *pool) {
if (!pool) return;
// Signal shutdown
pthread_mutex_lock(&pool->mutex);
priority_queue_destroy(pool->task_queue); // sub_44DC40
pool->pending_count = 0;
pool->shutdown = 1;
pthread_cond_broadcast(&pool->task_cond); // wake all workers
pthread_mutex_unlock(&pool->mutex);
// Wait for all threads to exit
pthread_mutex_lock(&pool->mutex);
while (pool->pending_count != 0 || pool->thread_count != 0)
pthread_cond_wait(&pool->done_cond, &pool->mutex);
pthread_mutex_unlock(&pool->mutex);
// Cleanup
pthread_mutex_destroy(&pool->mutex);
pthread_cond_destroy(&pool->task_cond);
pthread_cond_destroy(&pool->done_cond);
free(pool->thread_array);
free(pool);
}
The destroy sequence is two-phase: first set the shutdown flag and broadcast to wake all sleeping workers, then wait for every thread to decrement thread_count to zero. Only then are the synchronization primitives destroyed and memory freed. The pool itself and the thread array are calloc/free-managed (not arena-allocated), matching the calloc in thread_pool_create.
Priority Queue
The task queue (sub_44DC60 / sub_44DD10 / sub_44DE20) is a binary min-heap stored in a dynamic array:
| Offset | Size | Field |
|---|---|---|
| 0 | 8 | array -- pointer to element pointer array |
| 8 | 8 | count -- number of elements |
| 16 | 8 | capacity -- allocated slots |
| 24 | 8 | comparator -- function pointer for ordering |
Push (sub_44DD10) inserts at the end and sifts up. Pop (sub_44DE20) moves the last element to position 0 and sifts down. The comparator sub_43FC70 always returns 1, so every parent is always "less than or equal" to its children -- the heap degenerates into approximate FIFO behavior. Growth doubles the capacity when full, using sub_4313A0 (arena realloc).
Error Handling
After all tasks complete:
for (i = 0; i < num_splits; i++) {
sub_4297B0(work_items[i].result, "<lto ptx>"); // check error code
output = outputs[i];
if (!validate_and_add(elfw, output, "lto.cubin", &is_mercury))
fatal("Ptxjit compilation failed in extended split compile mode");
// Mercury post-link if sm > 89
merge_elf(elfw);
}
sub_4297B0 checks the return code from sub_4BD760. A non-zero result for any split causes a fatal error. Each compiled split is independently validated against the target architecture, potentially transformed by the Mercury finalizer, and then merged into the output ELF.
Function Map
| Address | Name | Size | Role |
|---|---|---|---|
0x43FD90 | thread_pool_get_nproc | 18 B | Returns CPU count via sysconf(83) |
0x43FDB0 | thread_pool_create | 416 B | Allocates pool, spawns N detached worker threads |
0x43FC80 | worker_main | 272 B | Worker loop: dequeue, execute, signal completion |
0x43FF50 | thread_pool_submit | 144 B | Enqueues a (function, arg) task into the pool |
0x43FFE0 | thread_pool_wait | 128 B | Blocks until all submitted tasks complete |
0x43FE70 | thread_pool_destroy | 224 B | Signals shutdown, waits for threads, frees resources |
0x4264B0 | split_compile_worker | 48 B | Unpacks work item, calls sub_4BD760 |
0x4BD760 | ptxas_compile_split | 2,656 B | Split-aware ptxas entry: compile one PTX chunk to cubin |
0x4BD4E0 | ptxas_compile_whole | 640 B | Whole-program ptxas entry: compile single PTX to cubin |
0x426CD0 | lto_option_builder | 4,768 B | Builds libnvvm argument array, forwards split-compile flags |
0x43FC70 | comparator_true | 8 B | Priority queue comparator, always returns 1 |
0x44DC60 | pqueue_create | 192 B | Creates priority queue with comparator and initial capacity |
0x44DD10 | pqueue_push | 224 B | Inserts element, sifts up |
0x44DE20 | pqueue_pop | 288 B | Removes root element, sifts down |
0x4297B0 | check_ptxas_error | ~320 B | Checks elfLink return code, emits fatal diagnostic on failure |
0x426570 | validate_and_add | ~1,200 B | Validates compiled cubin arch/class, adds to output ELF writer |
0x4275C0 | mercury_post_link | ~512 B | Mercury finalizer post-link transform (sm > 89) |
0x431000 | arena_free | ~720 B | Returns arena-allocated memory to the memspace free list |
0x4BC6F0 | libnvvm_compile | -- | Compiles linked NVVM IR, produces PTX and split metadata |
Key Globals
| Address | Name | Type | Description |
|---|---|---|---|
dword_2A5B514 | split_compile_extended | int | Thread count for extended split compile. Default 1 (single-threaded). Value 0 triggers auto-detect via sysconf |
dword_2A5B518 | split_compile_value | int | Number of splits requested from libnvvm |
dword_2A5F260 | split_compile_state | int | Consensus state machine: 0=none, 1=absent, 2=present, 3=mixed, 4=conflict |
Cross-References
- LTO Overview -- pipeline context showing where split compilation fits (Step 3: PTX Assembly)
- Option Forwarding -- how
-split-compileand-split-compile-extendedare forwarded to cicc - libnvvm Integration -- libnvvm produces the PTX that split compilation parallelizes
- Whole vs Partial LTO -- split compilation interacts with partial mode (Path 3 dispatch)
- Merge Phase -- per-split cubins are merged via
merge_elfafter assembly