Thread Pool & Concurrency
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
ptxas compiles multiple entry functions (kernels) in a single PTX input file. When --split-compile or --allow-expensive-optimizations is active, the compilation driver dispatches each kernel to a worker thread for parallel DAGgen+OCG+ELF+DebugInfo processing. The threading infrastructure lives in two address ranges: the TLS system at 0x4280xx--0x4286xx (front-end) and the thread pool at 0x1CB17xx--0x1CB1Axx (shared infrastructure).
| TLS accessor | sub_4280C0 (597 bytes, 3,928 callers, 280-byte struct) |
| TLS key init | ctor_001 at 0x4094C0 (static constructor) |
| TLS destructor | destr_function at 0x427F10 |
| Thread pool ctor | sub_1CB18B0 (184-byte pool struct) |
| Task submit | sub_1CB1A50 (24-byte task node) |
| Worker thread | start_routine at 0x1CB1780 |
| Wait-all | sub_1CB1AE0 |
| Pool destroy | sub_1CB1970 |
| CPU count | sub_1CB1890 (sysconf(_SC_NPROCESSORS_CONF)) |
| Mutex init helper | sub_428620 (recursive mutex factory) |
| Global mutex lock | sub_4286A0 (lazy-init + lock) |
| Jobserver client | sub_1CC7300 (GNU Make integration) |
Architecture
┌─────────────────────────────────────────┐
│ Compilation Driver │
│ sub_446240 │
│ │
│ if (thread_count > 0): │
│ pool = sub_1CB18B0(thread_count) │
│ for each kernel: │
│ snapshot config → 360-byte buffer │
│ copy hash maps for isolation │
│ sub_1CB1A50(pool, sub_436DF0, buf) │
│ sub_1CB1AE0(pool) // wait-all │
│ sub_1CB1970(pool) // destroy │
└────────────┬────────────────────────────┘
│ enqueue tasks
▼
┌────────────────────────────────────┐
│ Thread Pool (184 bytes) │
│ │
│ mutex ──────── guards all fields │
│ cond_work ──── wake workers │
│ cond_done ──── signal completion │
│ work_queue ─── priority heap │
│ pending ────── pending task count │
│ shutdown ───── termination flag │
└───┬──────┬──────┬──────────────────┘
│ │ │
┌─────┘ │ └─────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Worker 0 │ │ Worker 1 │ │ Worker N │
│ (detached)│ │(detached)│ │(detached)│
└──────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ TLS ctx │ │ TLS ctx │ │ TLS ctx │
│ 280 bytes│ │ 280 bytes│ │ 280 bytes│
│ + pool │ │ + pool │ │ + pool │
└──────────┘ └──────────┘ └──────────┘
CLI Activation
Two options activate the thread pool:
| Option | Type | Behavior |
|---|---|---|
--split-compile N | int | Set thread count directly. 0 = CPU count via sysconf(83). 1 = serial (pool not created). N > 1 = N worker threads. |
--allow-expensive-optimizations | bool | Auto-enabled at -O2 and above. Enables the thread pool with an automatically determined thread count. |
Both flags ultimately result in a non-zero value at offset +668 in the compilation driver's state block. The driver checks this field to decide between sequential per-kernel iteration and thread pool dispatch.
Two internal options fine-tune pool behavior:
| Option | Type | Purpose |
|---|---|---|
--threads-dynamic-scheduling | bool | Enable dynamic scheduling of thread pool tasks (work stealing vs static partitioning). |
--threads-min-section-size | int | Minimum section size for thread pool work partitioning; prevents excessive task granularity. |
CPU Count Detection
// sub_1CB1890 -- returns online CPU count
__int64 sub_1CB1890() {
return sysconf(83); // _SC_NPROCESSORS_CONF on Linux
}
When --split-compile 0 is specified, the pool constructor receives the return value of sub_1CB1890() as its nmemb argument.
Thread-Local Storage (TLS)
The TLS system is the foundation of ptxas's concurrency model. Every thread -- main and workers alike -- gets its own 280-byte context struct, accessed through the single most-called function in the binary: sub_4280C0 (3,928 call sites).
Initialization: ctor_001 (0x4094C0)
The TLS key is created in a static constructor that runs before main:
// ctor_001 -- .init_array entry
int ctor_001() {
if (!qword_29FE0A0) { // one-time guard
pthread_key_create(&key, destr_function);
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
pthread_mutex_init(&mutex, &attr);
dword_29FE0F4 = sched_get_priority_max(SCHED_RR);
dword_29FE0F0 = sched_get_priority_min(SCHED_RR);
qword_29FE0A0 = &sentinel_node; // marks "initialized"
// ... initialize linked list sentinels
}
__cxa_atexit(cleanup_func, ...);
}
The SCHED_RR priority range is queried but never used for thread creation (all threads are created with default attributes). This appears to be vestigial infrastructure for priority-based scheduling that was never activated.
TLS Accessor: sub_4280C0
char *sub_4280C0() {
if (!qword_29FE0A0)
goto init_once; // fallback init (race protection)
char *result = pthread_getspecific(key);
if (result)
return result; // fast path: already allocated
char *ctx = malloc(0x118); // 280 bytes
memset(ctx, 0, 0x118);
pthread_cond_init(ctx + 128, NULL);
pthread_mutex_init(ctx + 176, NULL);
sem_init(ctx + 216, 0, 0);
// Insert into global doubly-linked list (under global mutex)
pthread_mutex_lock(&mutex);
ctx->prev = sentinel;
ctx->next = sentinel->next;
sentinel->next->prev = ctx;
sentinel->next = ctx;
pthread_mutex_unlock(&mutex);
pthread_setspecific(key, ctx);
return ctx;
}
The global doubly-linked list at offsets +256 (prev) and +264 (next) allows the system to enumerate all live TLS contexts. This is used during cleanup to destroy all thread contexts at program exit.
TLS Context Layout (280 bytes)
The full byte-level layout, verified against the constructor (sub_4280C0), destructor (destr_function), and the two diagnostic reporters (sub_42F590, sub_42FBA0):
| Offset | Size | Type | Purpose | Accessor |
|---|---|---|---|---|
| 0 | 1 | byte | has_warning_or_error flag -- set to 1 when severity > 2 | sub_42F590 (direct write) |
| 1 | 1 | byte | has_fatal_error flag -- set to 1 when severity > 4 | sub_42F590 (direct write) |
| 2 | 6 | -- | padding (zeroed by memset) | -- |
| 8 | 8 | jmp_buf* | Error recovery longjmp target (installed by sub_431ED0 before _setjmp) | sub_431ED0 (save/restore) |
| 16 | 8 | void* | Error descriptor pointer -- set to the faulting error descriptor on longjmp | sub_42F590 (write on fatal), sub_431ED0 (propagate on re-throw) |
| 24 | 8 | void* | Per-thread memory pool pointer (used by sub_424070) | sub_42BDD0 (swap), sub_4287D0 (read) |
| 32 | 8 | char* | Program name string (e.g. "ptxas") -- prepended to diagnostic messages | sub_430590 (set), sub_430570 (get) |
| 40 | 8 | char* | Diagnostic suffix string -- appended to message body when non-NULL | sub_42F590 (read, format as " %s") |
| 48 | 1 | byte | Info suppression flag -- suppresses severity-2 (info) messages | sub_42F590 (check) |
| 49 | 1 | byte | Diagnostic suppression flag -- suppresses severity-3 (warning) messages | sub_430560 (set) |
| 50 | 1 | byte | Werror promotion flag (--Werror) -- promotes warnings to errors | sub_430550 (set) |
| 51 | 1 | byte | Machine-readable annotation flag -- adds @E@/@W@/@O@/@I@ severity tags | sub_4305A0 (get) |
| 52 | 1 | byte | Multi-line continuation flag -- suppresses ". " prefix on wrapped lines | sub_4305C0 (set) |
| 53 | 75 | -- | padding (zeroed by memset) | -- |
| 128 | 48 | pthread_cond_t | Per-thread condition variable | constructor: pthread_cond_init |
| 176 | 40 | pthread_mutex_t | Per-thread mutex | constructor: pthread_mutex_init |
| 216 | 32 | sem_t | Semaphore for cross-thread signaling | constructor: sem_init(0, 0) |
| 248 | 8 | void* | Saved semaphore pointer (from pool, for destr_function to sem_post) | destr_function (read at qword[31]) |
| 256 | 8 | void* | Linked list prev pointer (global TLS chain) | constructor (qword[32]) |
| 264 | 8 | void* | Linked list next pointer (global TLS chain) | constructor (qword[33]) |
| 272 | 1 | byte | Destroyed flag (prevents double-destroy) | destr_function (set to 1) |
| 273 | 7 | -- | padding (zeroed by memset) | -- |
The fields fall into seven functional groups:
Error state (offsets 0--1). Two byte flags set by the diagnostic reporter sub_42F590. Byte 0 records whether any error-or-above diagnostic was emitted; byte 1 records fatal errors specifically. The compilation driver reads these after each kernel compilation to determine the process exit code.
Error recovery (offsets 8--16). A setjmp/longjmp mechanism for non-local error exits. sub_431ED0 saves the current jmp_buf* and error byte flags, installs a fresh jmp_buf, then enters the compiler. On a fatal diagnostic, sub_42F590 stores the error descriptor at offset +16 and calls longjmp to the target at offset +8. If no jmp_buf is installed, the handler calls sub_4275E0 (abort).
Per-thread allocator (offset 24). The most performance-critical field. The pool allocator sub_424070 reads this pointer on every allocation (accessed as sub_4280C0()[3]). When non-NULL, allocations go to the calling thread's own slab without any locking. sub_42BDD0 provides an atomic swap primitive that replaces the pool pointer and returns the old value -- used during pool migration at compilation boundaries. This is used pervasively: 3,928 call sites to sub_4280C0 are predominantly pool allocator calls that need the current thread's arena.
Diagnostic context (offsets 32, 40). The program name at +32 (e.g. "ptxas") is prepended to all diagnostic messages. The suffix at +40 is appended after the message body. Both are set per-thread to support library mode where multiple tool names coexist in the same process.
Diagnostic flags (offsets 48--52). Five single-byte flags controlling diagnostic formatting and filtering. The info suppression flag (+48) silences informational messages. The diagnostic suppression flag (+49) silences warnings entirely. The Werror flag (+50) promotes warnings to errors. The annotation flag (+51) enables machine-readable severity tags (@E@, @W@, @O@, @I@). The continuation flag (+52) enables multi-line continuation mode where wrapped lines omit the ". " prefix.
Synchronization primitives (offsets 128--248). The condvar, mutex, and semaphore are used by the thread pool for task coordination and cross-thread signaling. The saved semaphore pointer at +248 is set by the pool when assigning work to a thread; on thread exit, the destructor calls sem_post on it to notify the pool's shutdown logic.
Global linked list (offsets 256--272). A doubly-linked list threading through all live TLS contexts, protected by the global mutex at 0x29FE0xx. Used by the atexit handler to enumerate and destroy all contexts. The destroyed flag at +272 prevents double-destroy when contexts move to the free list for recycling.
TLS Destructor: destr_function (0x427F10)
When a worker thread terminates, the POSIX TLS destructor fires:
void destr_function(char *ctx) {
if (!ctx) return;
pthread_mutex_lock(&global_mutex);
if (ctx[272]) { // already destroyed?
pthread_mutex_unlock(&global_mutex);
return;
}
sem_t *sem = ctx->saved_semaphore; // offset +248
// Unlink from global TLS chain
ctx->prev->next = ctx->next;
ctx->next->prev = ctx->prev;
// Destroy sync primitives
pthread_cond_destroy(ctx + 128);
pthread_mutex_destroy(ctx + 176);
sem_destroy(ctx + 216);
// Move to free list (recycling, not freed)
ctx[272] = 1; // mark destroyed
ctx->next = free_list_sentinel;
ctx->prev = free_list_head;
free_list_head->next = ctx;
free_list_head = ctx;
pthread_mutex_unlock(&global_mutex);
if (sem)
sem_post(sem); // notify pool that worker exited
}
The destructor does not free() the 280-byte struct. Instead, it moves it to a free list rooted at a second sentinel node (unk_29FDC40 / unk_29FDD60). This is a deliberate choice: the destructor runs during pthread_exit() or thread detachment, and ptxas reuses TLS structs across multiple pool lifetimes within a single process invocation (library mode).
The sem_post at the end notifies the pool shutdown code (sub_1CB1970) that a worker has fully terminated.
Thread Pool Implementation
Pool Struct (184 bytes)
struct ThreadPool { // calloc(1, 0xB8) = 184 bytes
void *thread_handles; // +0: array of (pthread_t, 16 bytes each)
void *work_queue; // +8: priority heap (from sub_1CBEC10)
int32_t pending; // +16: count of tasks awaiting execution
// padding
pthread_mutex_t mutex; // +24: guards all mutable pool state (40 bytes)
pthread_cond_t cond_work; // +64: broadcast when new work arrives (48 bytes)
pthread_cond_t cond_done; // +112: signaled when all work completes (48 bytes)
int64_t active_count; // +160: workers currently executing tasks
int64_t max_threads; // +168: total worker count (= nmemb)
int8_t shutdown; // +176: set to 1 to terminate all workers
};
Constructor: sub_1CB18B0
ThreadPool *sub_1CB18B0(size_t nmemb) {
ThreadPool *pool = calloc(1, 0xB8); // 184 bytes, zero-init
pool->thread_handles = calloc(nmemb, 0x10); // 16 bytes per thread
pool->max_threads = nmemb;
pool->pending = 0;
pthread_mutex_init(&pool->mutex, NULL); // default (non-recursive)
pthread_cond_init(&pool->cond_work, NULL);
pthread_cond_init(&pool->cond_done, NULL);
// Create priority heap for task ordering
pool->work_queue = sub_1CBEC10(sub_1CB1770, 0);
// Spawn nmemb detached worker threads
for (int i = 0; i < nmemb; i++) {
pthread_create(&pool->thread_handles[i].tid, NULL,
start_routine, pool);
pthread_detach(pool->thread_handles[i].tid);
}
return pool;
}
Workers are detached immediately after creation. This means the pool does not use pthread_join for cleanup -- instead, it relies on the cond_done / max_threads countdown protocol in sub_1CB1970.
Work Queue: Priority Heap
The work queue at pool offset +8 is not a simple linked list -- it is a binary min-heap (priority queue) backed by a dynamically-resized array.
Heap structure (32 bytes):
| Offset | Size | Type | Field |
|---|---|---|---|
| 0 | 8 | void** | Array of element pointers |
| 8 | 8 | int64 | Current element count |
| 16 | 8 | int64 | Allocated capacity |
| 24 | 8 | fn_ptr | Comparator function |
Constructor (sub_1CBEC10): Allocates the heap struct from the pool allocator, sets the comparator to sub_1CB1770 (which always returns 1, making the heap behave as a FIFO -- every parent "beats" every child, so new elements sink to the end).
Enqueue (sub_1CBECC0): Standard heap push with sift-up. Appends element, then bubbles up through parent comparisons. Auto-grows the backing array (doubles capacity) when full via sub_424C50 (realloc equivalent).
Dequeue (sub_1CBEDD0): Standard heap pop with sift-down. Extracts root, moves last element to root, then percolates down via comparator-guided child selection.
Since the comparator sub_1CB1770 unconditionally returns 1, the heap degenerates into FIFO order. Tasks are dispatched in submission order.
Task Nodes (24 bytes)
struct TaskNode { // malloc(0x18) = 24 bytes
void (*func)(void *arg); // +0: task function pointer
void *arg; // +8: argument passed to func
void *reserved; // +16: zeroed (unused)
};
Task Submit: sub_1CB1A50
int sub_1CB1A50(ThreadPool *pool, void (*func)(void*), void *arg) {
if (!func || !pool)
return 0;
TaskNode *task = malloc(0x18); // 24 bytes
task->func = func;
task->arg = arg;
task->reserved = NULL;
pthread_mutex_lock(&pool->mutex);
sub_1CBECC0(task, pool->work_queue); // heap push
++pool->pending;
pthread_cond_broadcast(&pool->cond_work);
pthread_mutex_unlock(&pool->mutex);
return 1;
}
The broadcast wakes all sleeping workers, not just one. This is correct for the use case: multiple tasks are typically submitted in a burst (one per kernel), and all idle workers should attempt to pick up work immediately.
Worker Thread: start_routine (0x1CB1780)
void *start_routine(ThreadPool *pool) {
pthread_mutex_t *mtx = &pool->mutex;
pthread_cond_t *done = &pool->cond_done;
while (1) {
pthread_mutex_lock(mtx);
// Wait for work or shutdown
while (pool->pending == 0 && !pool->shutdown)
pthread_cond_wait(&pool->cond_work, mtx);
if (pool->shutdown)
goto exit;
// Dequeue task
TaskNode *task = (TaskNode *)sub_1CBEDD0(pool->work_queue);
--pool->pending;
++pool->active_count;
pthread_mutex_unlock(mtx);
// Execute task outside the lock
if (task) {
task->func(task->arg);
free(task);
}
// Post-execution bookkeeping
pthread_mutex_lock(mtx);
--pool->active_count;
if (!pool->shutdown && pool->active_count == 0 && pool->pending == 0)
pthread_cond_signal(done); // all work complete
pthread_mutex_unlock(mtx);
}
exit:
--pool->max_threads; // decrement live worker count
pthread_cond_signal(done); // notify shutdown waiter
pthread_mutex_unlock(mtx);
return NULL;
}
The completion signal on cond_done fires only when both active_count == 0 and pending == 0. This is the condition that sub_1CB1AE0 (wait-all) blocks on.
Wait-All: sub_1CB1AE0
Blocks until all submitted tasks complete:
void sub_1CB1AE0(ThreadPool *pool) {
if (!pool) return;
pthread_mutex_lock(&pool->mutex);
while (pool->pending > 0 ||
(pool->shutdown ? pool->max_threads > 0 : pool->active_count > 0))
pthread_cond_wait(&pool->cond_done, &pool->mutex);
pthread_mutex_unlock(&pool->mutex);
}
In the non-shutdown case, it waits for pending == 0 && active_count == 0. During shutdown, it waits for max_threads == 0 (all workers have exited).
Destroy: sub_1CB1970
Initiates graceful shutdown:
void sub_1CB1970(ThreadPool *pool) {
if (!pool) return;
pthread_mutex_lock(&pool->mutex);
// Drain any remaining queued tasks
sub_1CBEBF0(pool->work_queue); // free heap contents
pool->pending = 0;
pool->shutdown = 1;
// Wake all workers so they see the shutdown flag
pthread_cond_broadcast(&pool->cond_work);
pthread_mutex_unlock(&pool->mutex);
// Wait for all workers to exit
pthread_mutex_lock(&pool->mutex);
while (pool->pending > 0 ||
(pool->shutdown ? pool->max_threads > 0 : pool->active_count > 0))
pthread_cond_wait(&pool->cond_done, &pool->mutex);
pthread_mutex_unlock(&pool->mutex);
// Destroy synchronization primitives
pthread_mutex_destroy(&pool->mutex);
pthread_cond_destroy(&pool->cond_work);
pthread_cond_destroy(&pool->cond_done);
free(pool->thread_handles);
free(pool);
}
The shutdown sequence is: (1) drain the queue, (2) set the shutdown flag, (3) broadcast on cond_work so all sleeping workers wake up and check the flag, (4) wait on cond_done until max_threads reaches zero (each exiting worker decrements it), (5) destroy synchronization primitives and free memory.
Multi-Kernel Parallel Compilation
The compilation driver (sub_446240) decides between serial and parallel execution based on the thread count at offset +668:
Serial Path (default)
for each kernel in compile_unit:
sub_43A400(kernel) // target configuration
sub_43CC70(kernel) // DAGgen → OCG → ELF → DebugInfo
Parallel Path (--split-compile / --allow-expensive-optimizations)
pool = sub_1CB18B0(thread_count)
for each kernel in compile_unit:
sub_43A400(kernel) // target config (still serial)
buf = allocate 360-byte work buffer from pool
snapshot 15 config vectors into buf
deep-copy hash maps for thread isolation
sub_1CB1A50(pool, sub_436DF0, buf)
sub_1CB1AE0(pool) // block until all kernels done
sub_1CB1970(pool) // destroy pool
Each task runs sub_436DF0, which performs the per-kernel backend pipeline:
- Set thread-local program name via
sub_430590 - Acquire jobserver token (if
--jobserveractive):sub_1CC6EC0() - Record start time
sub_432500-- run the full DAGgen+OCG pipeline- Record end time, write to timing array at
a1->timing[112 * cu_index] - Update peak wall-clock counter (under lock via
sub_607D70/sub_607D90) - Release jobserver token:
sub_1CC7040() - Free the 360-byte work buffer
Thread Isolation Strategy
Each worker thread operates on an isolated copy of compilation state:
| Resource | Isolation Mechanism |
|---|---|
| Memory pool | Per-thread pool pointer at TLS offset +24. Each thread's allocations go to a separate arena, eliminating heap contention. |
| Error state | Per-thread flags at TLS offsets 0--1 (error bytes), 8 (longjmp target), 16 (error descriptor), 48--52 (diagnostic control). Each thread tracks its own errors independently. |
| Hash maps | Deep-copied from the master compilation context before task submission. Workers never share mutable lookup tables. |
| Config vectors | Snapshot of 15 configuration vectors into a 360-byte per-task buffer. Workers read their own copy. |
| Timing data | Per-kernel slots in a pre-allocated timing array (112 bytes per kernel). Each worker writes only to its own kernel's slot. |
The only shared mutable state during parallel compilation is the peak wall-clock counter at offset +224 in the compilation driver's state block, protected by a global lock (lock index 6, via sub_607D70/sub_607D90). This lock is acquired briefly at the end of each per-kernel task to compare-and-update the maximum observed wall-clock time.
GNU Make Jobserver Integration
When both --jobserver and --split-compile are active, ptxas participates in GNU Make's parallel job token protocol. The compilation driver creates the jobserver client object before spawning the thread pool, and each per-kernel worker task must acquire a token before starting and release it when done. This throttles ptxas to never exceed the make -j N budget, even when --split-compile would otherwise use more threads.
Jobserver Object (296 bytes)
The jobserver state is a 296-byte heap object allocated once per compilation, stored at global qword_29FE128. The constructor (sub_1CC7AF0) is called from the compilation driver (sub_4428E0) when *(_BYTE*)(context + 993) is set (the --jobserver CLI flag).
| Offset | Size | Type | Field |
|---|---|---|---|
| 0 | 4 | int32 (atomic) | State code (0=OK; see state table below) |
| 4 | 4 | int32 | Saved errno from last failed syscall |
| 8 | 1 | byte | Implicit token available (1=unconsumed) |
| 16 | 8 | int64 | Pending waiters (threads blocked in acquire) |
| 24 | 8 | int64 | Active count (tokens currently held) |
| 32 | 8 | void* | Token buffer base (std::vector<char> data) |
| 40 | 8 | void* | Token buffer cursor (stack top) |
| 48 | 8 | void* | Token buffer capacity end |
| 56 | 40 | pthread_mutex_t | Inner mutex (guards token accounting) |
| 96 | 40 | pthread_mutex_t | Write mutex (guards write() to Make pipe) |
| 136 | 48 | pthread_cond_t | Condition variable (wakes acquire waiters and reader thread) |
| 184 | 1 | byte | Token-ready flag (set by reader thread / release handoff) |
| 185 | 1 | byte | Last byte read from Make pipe |
| 188 | 4 | int32 | Read fd (Make pipe/FIFO read end) |
| 192 | 4 | int32 | Write fd (Make pipe/FIFO write end) |
| 196 | 4 | int32 | Internal pipe read fd (shutdown wakeup) |
| 200 | 4 | int32 | Internal pipe write fd (shutdown wakeup) |
| 204 | 1 | byte | Opened-fds flag (1=ptxas opened the Make fds itself) |
| 205 | 1 | byte | Shutdown flag |
| 208 | 8 | void* | Reader thread handle (std::thread) |
| 216 | 80 | bytes | Outer mutexes (serializing full acquire/release operations) |
MAKEFLAGS Parser: sub_1CC7300
Called during object construction to detect the Make jobserver channel:
// sub_1CC7300 -- parse MAKEFLAGS, open pipe/FIFO
void sub_1CC7300(JobserverObject *obj) {
char *flags = getenv("MAKEFLAGS");
if (!flags) {
CAS(&obj->state, 5, 0); // no MAKEFLAGS
return;
}
std::string s(flags);
size_t pos = s.find("--jobserver-auth=");
if (pos == npos) {
CAS(&obj->state, 6, 0); // no --jobserver-auth=
return;
}
size_t val = pos + 17; // skip "--jobserver-auth="
if (s.substr(val, 5) == "fifo:") {
// --- FIFO mode ---
std::string path = s.substr(val + 5, next_space);
int fd = open(path.c_str(), O_RDWR | O_NONBLOCK); // 0x802
if (fd == -1) { CAS(&obj->state, 7, 0); return; }
obj->read_fd = fd; // same fd for both
obj->write_fd = fd;
obj->opened_fds = 1;
} else {
// --- Pipe mode ---
// parse "R,W" -- e.g. "3,4"
std::string r_str = s.substr(val, comma_pos - val);
std::string w_str = s.substr(comma_pos + 1, ...);
// validate: digits only
if (r_str.find_first_not_of("0123456789") != npos ||
w_str.find_first_not_of("0123456789") != npos) {
CAS(&obj->state, 7, 0); return;
}
int rd = dup(stoi(r_str)); // private copy
if (fcntl(rd, F_SETFD, FD_CLOEXEC) == -1) {
CAS(&obj->state, 7, 0); return;
}
int wd = dup(stoi(w_str));
if (fcntl(wd, F_SETFD, FD_CLOEXEC) == -1) {
close(rd);
CAS(&obj->state, 7, 0); return;
}
obj->read_fd = rd;
obj->write_fd = wd;
obj->opened_fds = 1;
}
}
| Protocol | --jobserver-auth= value | Detection | fd Setup |
|---|---|---|---|
| FIFO | fifo:/path/to/fifo | Prefix match on fifo: | open(path, O_RDWR|O_NONBLOCK) -- single fd for both read and write |
| Pipe | R,W (e.g. 3,4) | Comma-separated integers after auth= | dup() each fd + fcntl(F_SETFD, FD_CLOEXEC) -- prevents fd leak to children |
Object Construction: sub_1CC7AF0
After sub_1CC7300 succeeds (state == 0), the constructor continues:
- Creates an internal wakeup pipe via
pipe()-- fds stored at +196/+200 - Spawns the reader thread (
sub_1CC6720) -- passed as astd::threadfunctor viaoff_2406838 - Pre-allocates the token buffer vector to hold
thread_countbytes
If state is 5 or 6 (no MAKEFLAGS, no auth string), the caller (sub_4428E0) emits: "GNU Jobserver support requested, but no compatible jobserver found. Ignoring '--jobserver'" and proceeds without throttling.
Reader Thread: sub_1CC6720
A dedicated background thread that reads tokens from the Make pipe/FIFO and buffers them for the acquire function:
loop:
if state != 0 or shutdown → exit
lock(mutex_inner)
while pending_waiters == 0 and not shutdown:
cond_wait(cond, mutex_inner) // sleep until someone needs a token
unlock(mutex_inner)
fd_set = { read_fd, internal_pipe_read }
select(max_fd + 1, &fd_set, NULL, NULL, NULL) // block
if shutdown → exit
n = read(read_fd, &byte, 1)
if n == 1:
if pending_waiters > 0:
lock(mutex_inner + mutex_write)
push byte onto token_buffer
token_ready = 1
unlock(mutex_write)
cond_signal(cond) // wake one acquire waiter
else:
write(write_fd, &byte, 1) // no waiter → return token immediately
else if errno == EAGAIN:
continue // expected for non-blocking fd
else:
state = 11; exit // I/O error
The select() monitors two fds simultaneously: the Make pipe (for incoming tokens) and the internal wakeup pipe (for shutdown notification). The internal pipe avoids a race between checking the shutdown flag and blocking in select().
Token Acquire: sub_1CC6EC0
Called by each per-kernel worker before compilation begins. Returns 0 on success.
int sub_1CC6EC0(JobserverObject *obj) {
if (!obj) return 4;
lock(outer_mutex_0);
if (obj->state) { unlock; return obj->state; }
lock(mutex_inner);
if (obj->implicit_token_available) {
// Fast path: consume the implicit token (no pipe I/O)
obj->implicit_token_available = 0;
obj->active_count++;
unlock_all;
return 0;
}
// Slow path: wait for reader thread to supply a token
obj->pending_waiters++;
if (obj->pending_waiters == 1)
cond_signal(cond); // wake reader thread
while (!obj->token_ready && !obj->shutdown)
cond_wait(cond, mutex_inner);
if (obj->shutdown) { unlock_all; return 3; }
obj->token_ready = 0;
obj->pending_waiters--;
obj->active_count++;
unlock_all;
return 0;
}
The implicit token is the standard GNU Make convention: the parent Make gives the first child an implicit token (no byte in the pipe). The first worker to call acquire consumes it for free; subsequent workers must wait for bytes to be read from the pipe.
Token Release: sub_1CC7040
Called by each per-kernel worker after compilation completes. Returns 0 on success.
int sub_1CC7040(JobserverObject *obj) {
if (!obj) return 4;
lock(outer_mutex_1);
if (obj->state) { unlock; return obj->state; }
lock(mutex_inner);
lock(mutex_write);
if (token_buffer not empty) {
// Path A: write a buffered byte back to Make pipe
byte = pop(token_buffer);
if (write(obj->write_fd, &byte, 1) == 1) {
obj->active_count--;
unlock_all;
return 0;
}
// write error → set state 11 or 2
}
unlock(mutex_write);
if (obj->pending_waiters > 0) {
// Path B: hand off directly to a waiting acquirer
obj->token_ready = 1;
obj->active_count--;
cond_signal(cond);
unlock_all;
return 0;
}
if (!obj->implicit_token_available && obj->active_count == 1) {
// Path C: return the implicit token
obj->implicit_token_available = 1;
obj->active_count = 0;
unlock_all;
return 0;
}
// Protocol error (double-free)
CAS(&obj->state, 12, 0);
unlock_all;
return 12;
}
Release has three paths, in priority order:
| Path | Condition | Action |
|---|---|---|
| A | Token buffer non-empty | Pop byte, write() back to Make pipe |
| B | No buffered token but waiters exist | Set token_ready, signal condvar (avoids pipe round-trip) |
| C | No buffered token, no waiters, last active | Restore implicit token flag |
Per-Kernel Worker Integration
In sub_436DF0 (the per-kernel compilation task submitted to the thread pool):
void sub_436DF0(int64_t *task) {
sub_430590("ptxas", kernel_name); // set TLS program name
if (task[5] && sub_1CC6EC0(task[5])) // acquire token if jobserver active
sub_42F590(FATAL); // acquire failed → fatal error
// ... sub_432500(): full DAGgen + OCG pipeline ...
if (!task[5] || !sub_1CC7040(task[5])) // release token
return; // normal return
sub_42F590(FATAL); // release failed → fatal error
}
task[5] is populated from qword_29FE128 during task dispatch in sub_4428E0. When --jobserver is not active, task[5] == 0 and both acquire/release calls are skipped.
Destroy: sub_1CC6C20
Called after sub_1CB1AE0 (wait-all) and sub_1CB1970 (pool destroy) complete:
- Set shutdown flag (+205) via
_InterlockedCompareExchange8 - Lock inner mutex, signal condvar (wake reader thread), unlock
- Write 1 byte to internal pipe write end (+200) -- unblocks
select()in reader thread - Join reader thread
- Lock inner mutex, drain all buffered tokens by writing each byte back to write_fd
- Unlock inner mutex
- Close Make fds if
opened_fdsis set (for FIFO: close once since read_fd == write_fd; for pipe: close both if different) - Close internal pipe fds (+196, +200)
- Destroy condvar, free token buffer, free 296-byte object
State Machine
All state transitions use _InterlockedCompareExchange(state, new_value, 0) -- only the first error sticks; subsequent errors are silently dropped.
| State | Meaning | Set by |
|---|---|---|
| 0 | OK (operational) | Constructor |
| 2 | Unexpected I/O (short write/read) | Release, reader thread |
| 5 | No MAKEFLAGS environment variable | sub_1CC7300 |
| 6 | No --jobserver-auth= in MAKEFLAGS | sub_1CC7300 |
| 7 | open()/dup()/fcntl() failed | sub_1CC7300 |
| 11 | I/O error (errno recorded at +4) | Reader thread, release, constructor |
| 12 | Protocol error (double-free of token) | Release |
Throttling Semantics
With make -jN and --split-compile M where M > N:
ptxas creates M worker threads in the pool
but only N-1 pipe tokens exist + 1 implicit token = N total
workers that cannot acquire a token block in cond_wait
→ at most N kernels compile simultaneously, matching Make's budget
→ as each kernel finishes and releases its token, a blocked worker wakes
Without --jobserver, all M workers run freely with no external throttling.
Pool Allocator Thread Safety
The pool allocator (sub_424070) achieves thread safety through a combination of per-thread arenas and a global mutex:
-
Per-thread arena: The TLS context at offset +24 holds a pointer to the current thread's memory pool.
sub_424070readssub_4280C0()[3]to obtain this pointer. When non-NULL, allocations come from the thread-local slab without any locking. -
Global pool mutex: The pool struct contains a
pthread_mutex_tat offset +7128 (within the ~7,200-byte pool descriptor). This mutex is acquired only for operations that modify the global pool state: slab acquisition from the parent pool, large-block allocation viammap/malloc, and pool destruction. -
Slab-level lockfree: Within a thread-local slab (56-byte descriptor), bump-pointer allocation requires no synchronization. The allocator advances a pointer and returns; only when the slab is exhausted does it acquire the global lock to request a new slab.
Recursive Mutex Pattern
All mutexes created by sub_428620 (the mutex factory used throughout ptxas) are PTHREAD_MUTEX_RECURSIVE:
bool sub_428620(pthread_mutex_t *mutex) {
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
return pthread_mutex_init(mutex, &attr) == 0;
}
This is necessary because ptxas functions may re-enter locking code paths through the diagnostic emitter (sub_42FBA0) or pool allocator (sub_424070), both of which are called from virtually everywhere.
Global Synchronization Objects
Global TLS Mutex (mutex at 0x29FE0xx)
Protects the global doubly-linked list of TLS contexts. Acquired during:
- TLS context allocation (
sub_4280C0) - TLS context destruction (
destr_function) sub_4286A0(explicit lock for cross-thread operations)
This is a recursive mutex (initialized in ctor_001).
Global Lock Array (sub_607D70 / sub_607D90)
A global array of locks accessed by index. Lock index 6 is used to protect the peak wall-clock counter during parallel compilation. The total number of lock indices and their complete purpose map is not fully recovered; index 6 is the only one observed in the threading hot path.
sub_607D70(6); // acquire lock 6
// update peak wall-clock
sub_607D90(6); // release lock 6
Function Map
| Address | Size | Callers | Identity |
|---|---|---|---|
0x4094C0 | 204 B | 0 | ctor_001 -- TLS key + global mutex init (.init_array) |
0x427F10 | 376 B | 0 | destr_function -- TLS destructor (via pthread_key_create) |
0x4280C0 | 597 B | 3,928 | TLS context accessor (280-byte struct, lazy alloc) |
0x428600 | 27 B | -- | Mutex destroy + free wrapper |
0x428620 | 62 B | -- | Recursive mutex init factory |
0x428670 | 6 B | -- | pthread_mutex_destroy PLT thunk |
0x428680 | 6 B | -- | pthread_mutex_lock PLT thunk |
0x428690 | 6 B | -- | pthread_mutex_unlock PLT thunk |
0x4286A0 | 163 B | -- | Global mutex lazy-init + lock |
0x1CB1770 | 8 B | 1 | Priority comparator (always returns 1 = FIFO) |
0x1CB1780 | 202 B | 0 | start_routine -- worker thread main loop |
0x1CB1890 | 11 B | -- | CPU count via sysconf(_SC_NPROCESSORS_CONF) |
0x1CB18B0 | 159 B | -- | Thread pool constructor (184-byte struct) |
0x1CB1970 | 168 B | -- | Thread pool graceful destroy |
0x1CB1A50 | 90 B | -- | Task submit (24-byte task node, heap push, broadcast) |
0x1CB1AE0 | 109 B | -- | Wait-all (block until pending=0, active=0) |
0x1CBEBF0 | -- | 1 | Heap drain (free all queued elements) |
0x1CBEC10 | -- | 1 | Priority heap constructor (32-byte struct) |
0x1CBECC0 | -- | -- | Priority heap push (sift-up) |
0x1CBEDD0 | -- | -- | Priority heap pop (sift-down) |
0x1CC6720 | ~700 B | 1 | Jobserver reader thread (select loop, pushes tokens to buffer) |
0x1CC6C20 | ~300 B | 1 | Jobserver destroy (drain tokens, close fds, free 296-byte object) |
0x1CC6EC0 | 384 B | 1 | Jobserver token acquire (consume implicit or wait for pipe token) |
0x1CC7040 | ~280 B | 1 | Jobserver token release (write byte back or hand off to waiter) |
0x1CC7300 | 2,027 B | 1 | Jobserver MAKEFLAGS parser (FIFO vs pipe detection, fd setup) |
0x1CC7AF0 | ~700 B | 1 | Jobserver constructor (alloc 296B, spawn reader thread) |
Cross-References
- Entry Point & CLI --
ctor_001TLS init,sub_446240serial vs parallel dispatch - CLI Options --
--split-compile,--allow-expensive-optimizations,--jobserver - Memory Pool Allocator -- per-thread arena via TLS offset +24, global pool mutex at +7128
- Pipeline Overview -- per-kernel compilation phases run as pool tasks
- Code Generation Overview --
sub_436DF0per-kernel worker, timing lock 6