Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Thread Pool & Concurrency

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas compiles multiple entry functions (kernels) in a single PTX input file. When --split-compile or --allow-expensive-optimizations is active, the compilation driver dispatches each kernel to a worker thread for parallel DAGgen+OCG+ELF+DebugInfo processing. The threading infrastructure lives in two address ranges: the TLS system at 0x4280xx--0x4286xx (front-end) and the thread pool at 0x1CB17xx--0x1CB1Axx (shared infrastructure).

TLS accessorsub_4280C0 (597 bytes, 3,928 callers, 280-byte struct)
TLS key initctor_001 at 0x4094C0 (static constructor)
TLS destructordestr_function at 0x427F10
Thread pool ctorsub_1CB18B0 (184-byte pool struct)
Task submitsub_1CB1A50 (24-byte task node)
Worker threadstart_routine at 0x1CB1780
Wait-allsub_1CB1AE0
Pool destroysub_1CB1970
CPU countsub_1CB1890 (sysconf(_SC_NPROCESSORS_CONF))
Mutex init helpersub_428620 (recursive mutex factory)
Global mutex locksub_4286A0 (lazy-init + lock)
Jobserver clientsub_1CC7300 (GNU Make integration)

Architecture

                         ┌─────────────────────────────────────────┐
                         │          Compilation Driver             │
                         │           sub_446240                    │
                         │                                         │
                         │  if (thread_count > 0):                 │
                         │    pool = sub_1CB18B0(thread_count)     │
                         │    for each kernel:                     │
                         │      snapshot config → 360-byte buffer  │
                         │      copy hash maps for isolation       │
                         │      sub_1CB1A50(pool, sub_436DF0, buf) │
                         │    sub_1CB1AE0(pool)   // wait-all      │
                         │    sub_1CB1970(pool)    // destroy       │
                         └────────────┬────────────────────────────┘
                                      │ enqueue tasks
                                      ▼
                    ┌────────────────────────────────────┐
                    │         Thread Pool (184 bytes)     │
                    │                                     │
                    │  mutex ──────── guards all fields   │
                    │  cond_work ──── wake workers        │
                    │  cond_done ──── signal completion   │
                    │  work_queue ─── priority heap       │
                    │  pending ────── pending task count   │
                    │  shutdown ───── termination flag     │
                    └───┬──────┬──────┬──────────────────┘
                        │      │      │
                  ┌─────┘      │      └─────┐
                  ▼            ▼            ▼
            ┌──────────┐ ┌──────────┐ ┌──────────┐
            │ Worker 0 │ │ Worker 1 │ │ Worker N │
            │ (detached)│ │(detached)│ │(detached)│
            └──────────┘ └──────────┘ └──────────┘
                  │            │            │
                  ▼            ▼            ▼
            ┌──────────┐ ┌──────────┐ ┌──────────┐
            │ TLS ctx  │ │ TLS ctx  │ │ TLS ctx  │
            │ 280 bytes│ │ 280 bytes│ │ 280 bytes│
            │ + pool   │ │ + pool   │ │ + pool   │
            └──────────┘ └──────────┘ └──────────┘

CLI Activation

Two options activate the thread pool:

OptionTypeBehavior
--split-compile NintSet thread count directly. 0 = CPU count via sysconf(83). 1 = serial (pool not created). N > 1 = N worker threads.
--allow-expensive-optimizationsboolAuto-enabled at -O2 and above. Enables the thread pool with an automatically determined thread count.

Both flags ultimately result in a non-zero value at offset +668 in the compilation driver's state block. The driver checks this field to decide between sequential per-kernel iteration and thread pool dispatch.

Two internal options fine-tune pool behavior:

OptionTypePurpose
--threads-dynamic-schedulingboolEnable dynamic scheduling of thread pool tasks (work stealing vs static partitioning).
--threads-min-section-sizeintMinimum section size for thread pool work partitioning; prevents excessive task granularity.

CPU Count Detection

// sub_1CB1890 -- returns online CPU count
__int64 sub_1CB1890() {
    return sysconf(83);   // _SC_NPROCESSORS_CONF on Linux
}

When --split-compile 0 is specified, the pool constructor receives the return value of sub_1CB1890() as its nmemb argument.

Thread-Local Storage (TLS)

The TLS system is the foundation of ptxas's concurrency model. Every thread -- main and workers alike -- gets its own 280-byte context struct, accessed through the single most-called function in the binary: sub_4280C0 (3,928 call sites).

Initialization: ctor_001 (0x4094C0)

The TLS key is created in a static constructor that runs before main:

// ctor_001 -- .init_array entry
int ctor_001() {
    if (!qword_29FE0A0) {            // one-time guard
        pthread_key_create(&key, destr_function);
        pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
        pthread_mutex_init(&mutex, &attr);
        dword_29FE0F4 = sched_get_priority_max(SCHED_RR);
        dword_29FE0F0 = sched_get_priority_min(SCHED_RR);
        qword_29FE0A0 = &sentinel_node;  // marks "initialized"
        // ... initialize linked list sentinels
    }
    __cxa_atexit(cleanup_func, ...);
}

The SCHED_RR priority range is queried but never used for thread creation (all threads are created with default attributes). This appears to be vestigial infrastructure for priority-based scheduling that was never activated.

TLS Accessor: sub_4280C0

char *sub_4280C0() {
    if (!qword_29FE0A0)
        goto init_once;              // fallback init (race protection)
    
    char *result = pthread_getspecific(key);
    if (result)
        return result;               // fast path: already allocated
    
    char *ctx = malloc(0x118);       // 280 bytes
    memset(ctx, 0, 0x118);
    pthread_cond_init(ctx + 128, NULL);
    pthread_mutex_init(ctx + 176, NULL);
    sem_init(ctx + 216, 0, 0);
    
    // Insert into global doubly-linked list (under global mutex)
    pthread_mutex_lock(&mutex);
    ctx->prev = sentinel;
    ctx->next = sentinel->next;
    sentinel->next->prev = ctx;
    sentinel->next = ctx;
    pthread_mutex_unlock(&mutex);
    
    pthread_setspecific(key, ctx);
    return ctx;
}

The global doubly-linked list at offsets +256 (prev) and +264 (next) allows the system to enumerate all live TLS contexts. This is used during cleanup to destroy all thread contexts at program exit.

TLS Context Layout (280 bytes)

The full byte-level layout, verified against the constructor (sub_4280C0), destructor (destr_function), and the two diagnostic reporters (sub_42F590, sub_42FBA0):

OffsetSizeTypePurposeAccessor
01bytehas_warning_or_error flag -- set to 1 when severity > 2sub_42F590 (direct write)
11bytehas_fatal_error flag -- set to 1 when severity > 4sub_42F590 (direct write)
26--padding (zeroed by memset)--
88jmp_buf*Error recovery longjmp target (installed by sub_431ED0 before _setjmp)sub_431ED0 (save/restore)
168void*Error descriptor pointer -- set to the faulting error descriptor on longjmpsub_42F590 (write on fatal), sub_431ED0 (propagate on re-throw)
248void*Per-thread memory pool pointer (used by sub_424070)sub_42BDD0 (swap), sub_4287D0 (read)
328char*Program name string (e.g. "ptxas") -- prepended to diagnostic messagessub_430590 (set), sub_430570 (get)
408char*Diagnostic suffix string -- appended to message body when non-NULLsub_42F590 (read, format as " %s")
481byteInfo suppression flag -- suppresses severity-2 (info) messagessub_42F590 (check)
491byteDiagnostic suppression flag -- suppresses severity-3 (warning) messagessub_430560 (set)
501byteWerror promotion flag (--Werror) -- promotes warnings to errorssub_430550 (set)
511byteMachine-readable annotation flag -- adds @E@/@W@/@O@/@I@ severity tagssub_4305A0 (get)
521byteMulti-line continuation flag -- suppresses ". " prefix on wrapped linessub_4305C0 (set)
5375--padding (zeroed by memset)--
12848pthread_cond_tPer-thread condition variableconstructor: pthread_cond_init
17640pthread_mutex_tPer-thread mutexconstructor: pthread_mutex_init
21632sem_tSemaphore for cross-thread signalingconstructor: sem_init(0, 0)
2488void*Saved semaphore pointer (from pool, for destr_function to sem_post)destr_function (read at qword[31])
2568void*Linked list prev pointer (global TLS chain)constructor (qword[32])
2648void*Linked list next pointer (global TLS chain)constructor (qword[33])
2721byteDestroyed flag (prevents double-destroy)destr_function (set to 1)
2737--padding (zeroed by memset)--

The fields fall into seven functional groups:

Error state (offsets 0--1). Two byte flags set by the diagnostic reporter sub_42F590. Byte 0 records whether any error-or-above diagnostic was emitted; byte 1 records fatal errors specifically. The compilation driver reads these after each kernel compilation to determine the process exit code.

Error recovery (offsets 8--16). A setjmp/longjmp mechanism for non-local error exits. sub_431ED0 saves the current jmp_buf* and error byte flags, installs a fresh jmp_buf, then enters the compiler. On a fatal diagnostic, sub_42F590 stores the error descriptor at offset +16 and calls longjmp to the target at offset +8. If no jmp_buf is installed, the handler calls sub_4275E0 (abort).

Per-thread allocator (offset 24). The most performance-critical field. The pool allocator sub_424070 reads this pointer on every allocation (accessed as sub_4280C0()[3]). When non-NULL, allocations go to the calling thread's own slab without any locking. sub_42BDD0 provides an atomic swap primitive that replaces the pool pointer and returns the old value -- used during pool migration at compilation boundaries. This is used pervasively: 3,928 call sites to sub_4280C0 are predominantly pool allocator calls that need the current thread's arena.

Diagnostic context (offsets 32, 40). The program name at +32 (e.g. "ptxas") is prepended to all diagnostic messages. The suffix at +40 is appended after the message body. Both are set per-thread to support library mode where multiple tool names coexist in the same process.

Diagnostic flags (offsets 48--52). Five single-byte flags controlling diagnostic formatting and filtering. The info suppression flag (+48) silences informational messages. The diagnostic suppression flag (+49) silences warnings entirely. The Werror flag (+50) promotes warnings to errors. The annotation flag (+51) enables machine-readable severity tags (@E@, @W@, @O@, @I@). The continuation flag (+52) enables multi-line continuation mode where wrapped lines omit the ". " prefix.

Synchronization primitives (offsets 128--248). The condvar, mutex, and semaphore are used by the thread pool for task coordination and cross-thread signaling. The saved semaphore pointer at +248 is set by the pool when assigning work to a thread; on thread exit, the destructor calls sem_post on it to notify the pool's shutdown logic.

Global linked list (offsets 256--272). A doubly-linked list threading through all live TLS contexts, protected by the global mutex at 0x29FE0xx. Used by the atexit handler to enumerate and destroy all contexts. The destroyed flag at +272 prevents double-destroy when contexts move to the free list for recycling.

TLS Destructor: destr_function (0x427F10)

When a worker thread terminates, the POSIX TLS destructor fires:

void destr_function(char *ctx) {
    if (!ctx) return;
    
    pthread_mutex_lock(&global_mutex);
    
    if (ctx[272]) {                    // already destroyed?
        pthread_mutex_unlock(&global_mutex);
        return;
    }
    
    sem_t *sem = ctx->saved_semaphore; // offset +248
    
    // Unlink from global TLS chain
    ctx->prev->next = ctx->next;
    ctx->next->prev = ctx->prev;
    
    // Destroy sync primitives
    pthread_cond_destroy(ctx + 128);
    pthread_mutex_destroy(ctx + 176);
    sem_destroy(ctx + 216);
    
    // Move to free list (recycling, not freed)
    ctx[272] = 1;                      // mark destroyed
    ctx->next = free_list_sentinel;
    ctx->prev = free_list_head;
    free_list_head->next = ctx;
    free_list_head = ctx;
    
    pthread_mutex_unlock(&global_mutex);
    
    if (sem)
        sem_post(sem);                 // notify pool that worker exited
}

The destructor does not free() the 280-byte struct. Instead, it moves it to a free list rooted at a second sentinel node (unk_29FDC40 / unk_29FDD60). This is a deliberate choice: the destructor runs during pthread_exit() or thread detachment, and ptxas reuses TLS structs across multiple pool lifetimes within a single process invocation (library mode).

The sem_post at the end notifies the pool shutdown code (sub_1CB1970) that a worker has fully terminated.

Thread Pool Implementation

Pool Struct (184 bytes)

struct ThreadPool {           // calloc(1, 0xB8) = 184 bytes
    void       *thread_handles;  // +0:   array of (pthread_t, 16 bytes each)
    void       *work_queue;      // +8:   priority heap (from sub_1CBEC10)
    int32_t     pending;         // +16:  count of tasks awaiting execution
    // padding
    pthread_mutex_t mutex;       // +24:  guards all mutable pool state (40 bytes)
    pthread_cond_t  cond_work;   // +64:  broadcast when new work arrives (48 bytes)
    pthread_cond_t  cond_done;   // +112: signaled when all work completes (48 bytes)
    int64_t     active_count;    // +160: workers currently executing tasks
    int64_t     max_threads;     // +168: total worker count (= nmemb)
    int8_t      shutdown;        // +176: set to 1 to terminate all workers
};

Constructor: sub_1CB18B0

ThreadPool *sub_1CB18B0(size_t nmemb) {
    ThreadPool *pool = calloc(1, 0xB8);            // 184 bytes, zero-init
    pool->thread_handles = calloc(nmemb, 0x10);    // 16 bytes per thread
    pool->max_threads = nmemb;
    pool->pending = 0;
    
    pthread_mutex_init(&pool->mutex, NULL);         // default (non-recursive)
    pthread_cond_init(&pool->cond_work, NULL);
    pthread_cond_init(&pool->cond_done, NULL);
    
    // Create priority heap for task ordering
    pool->work_queue = sub_1CBEC10(sub_1CB1770, 0);
    
    // Spawn nmemb detached worker threads
    for (int i = 0; i < nmemb; i++) {
        pthread_create(&pool->thread_handles[i].tid, NULL,
                       start_routine, pool);
        pthread_detach(pool->thread_handles[i].tid);
    }
    return pool;
}

Workers are detached immediately after creation. This means the pool does not use pthread_join for cleanup -- instead, it relies on the cond_done / max_threads countdown protocol in sub_1CB1970.

Work Queue: Priority Heap

The work queue at pool offset +8 is not a simple linked list -- it is a binary min-heap (priority queue) backed by a dynamically-resized array.

Heap structure (32 bytes):

OffsetSizeTypeField
08void**Array of element pointers
88int64Current element count
168int64Allocated capacity
248fn_ptrComparator function

Constructor (sub_1CBEC10): Allocates the heap struct from the pool allocator, sets the comparator to sub_1CB1770 (which always returns 1, making the heap behave as a FIFO -- every parent "beats" every child, so new elements sink to the end).

Enqueue (sub_1CBECC0): Standard heap push with sift-up. Appends element, then bubbles up through parent comparisons. Auto-grows the backing array (doubles capacity) when full via sub_424C50 (realloc equivalent).

Dequeue (sub_1CBEDD0): Standard heap pop with sift-down. Extracts root, moves last element to root, then percolates down via comparator-guided child selection.

Since the comparator sub_1CB1770 unconditionally returns 1, the heap degenerates into FIFO order. Tasks are dispatched in submission order.

Task Nodes (24 bytes)

struct TaskNode {              // malloc(0x18) = 24 bytes
    void (*func)(void *arg);   // +0:  task function pointer
    void  *arg;                // +8:  argument passed to func
    void  *reserved;           // +16: zeroed (unused)
};

Task Submit: sub_1CB1A50

int sub_1CB1A50(ThreadPool *pool, void (*func)(void*), void *arg) {
    if (!func || !pool)
        return 0;
    
    TaskNode *task = malloc(0x18);     // 24 bytes
    task->func     = func;
    task->arg      = arg;
    task->reserved = NULL;
    
    pthread_mutex_lock(&pool->mutex);
    sub_1CBECC0(task, pool->work_queue);  // heap push
    ++pool->pending;
    pthread_cond_broadcast(&pool->cond_work);
    pthread_mutex_unlock(&pool->mutex);
    return 1;
}

The broadcast wakes all sleeping workers, not just one. This is correct for the use case: multiple tasks are typically submitted in a burst (one per kernel), and all idle workers should attempt to pick up work immediately.

Worker Thread: start_routine (0x1CB1780)

void *start_routine(ThreadPool *pool) {
    pthread_mutex_t *mtx = &pool->mutex;
    pthread_cond_t  *done = &pool->cond_done;
    
    while (1) {
        pthread_mutex_lock(mtx);
        
        // Wait for work or shutdown
        while (pool->pending == 0 && !pool->shutdown)
            pthread_cond_wait(&pool->cond_work, mtx);
        
        if (pool->shutdown)
            goto exit;
        
        // Dequeue task
        TaskNode *task = (TaskNode *)sub_1CBEDD0(pool->work_queue);
        --pool->pending;
        ++pool->active_count;
        pthread_mutex_unlock(mtx);
        
        // Execute task outside the lock
        if (task) {
            task->func(task->arg);
            free(task);
        }
        
        // Post-execution bookkeeping
        pthread_mutex_lock(mtx);
        --pool->active_count;
        if (!pool->shutdown && pool->active_count == 0 && pool->pending == 0)
            pthread_cond_signal(done);   // all work complete
        pthread_mutex_unlock(mtx);
    }
    
exit:
    --pool->max_threads;               // decrement live worker count
    pthread_cond_signal(done);          // notify shutdown waiter
    pthread_mutex_unlock(mtx);
    return NULL;
}

The completion signal on cond_done fires only when both active_count == 0 and pending == 0. This is the condition that sub_1CB1AE0 (wait-all) blocks on.

Wait-All: sub_1CB1AE0

Blocks until all submitted tasks complete:

void sub_1CB1AE0(ThreadPool *pool) {
    if (!pool) return;
    
    pthread_mutex_lock(&pool->mutex);
    while (pool->pending > 0 ||
           (pool->shutdown ? pool->max_threads > 0 : pool->active_count > 0))
        pthread_cond_wait(&pool->cond_done, &pool->mutex);
    pthread_mutex_unlock(&pool->mutex);
}

In the non-shutdown case, it waits for pending == 0 && active_count == 0. During shutdown, it waits for max_threads == 0 (all workers have exited).

Destroy: sub_1CB1970

Initiates graceful shutdown:

void sub_1CB1970(ThreadPool *pool) {
    if (!pool) return;
    
    pthread_mutex_lock(&pool->mutex);
    
    // Drain any remaining queued tasks
    sub_1CBEBF0(pool->work_queue);     // free heap contents
    pool->pending = 0;
    pool->shutdown = 1;
    
    // Wake all workers so they see the shutdown flag
    pthread_cond_broadcast(&pool->cond_work);
    pthread_mutex_unlock(&pool->mutex);
    
    // Wait for all workers to exit
    pthread_mutex_lock(&pool->mutex);
    while (pool->pending > 0 ||
           (pool->shutdown ? pool->max_threads > 0 : pool->active_count > 0))
        pthread_cond_wait(&pool->cond_done, &pool->mutex);
    pthread_mutex_unlock(&pool->mutex);
    
    // Destroy synchronization primitives
    pthread_mutex_destroy(&pool->mutex);
    pthread_cond_destroy(&pool->cond_work);
    pthread_cond_destroy(&pool->cond_done);
    free(pool->thread_handles);
    free(pool);
}

The shutdown sequence is: (1) drain the queue, (2) set the shutdown flag, (3) broadcast on cond_work so all sleeping workers wake up and check the flag, (4) wait on cond_done until max_threads reaches zero (each exiting worker decrements it), (5) destroy synchronization primitives and free memory.

Multi-Kernel Parallel Compilation

The compilation driver (sub_446240) decides between serial and parallel execution based on the thread count at offset +668:

Serial Path (default)

for each kernel in compile_unit:
    sub_43A400(kernel)          // target configuration
    sub_43CC70(kernel)          // DAGgen → OCG → ELF → DebugInfo

Parallel Path (--split-compile / --allow-expensive-optimizations)

pool = sub_1CB18B0(thread_count)

for each kernel in compile_unit:
    sub_43A400(kernel)          // target config (still serial)
    buf = allocate 360-byte work buffer from pool
    snapshot 15 config vectors into buf
    deep-copy hash maps for thread isolation
    sub_1CB1A50(pool, sub_436DF0, buf)

sub_1CB1AE0(pool)               // block until all kernels done
sub_1CB1970(pool)               // destroy pool

Each task runs sub_436DF0, which performs the per-kernel backend pipeline:

  1. Set thread-local program name via sub_430590
  2. Acquire jobserver token (if --jobserver active): sub_1CC6EC0()
  3. Record start time
  4. sub_432500 -- run the full DAGgen+OCG pipeline
  5. Record end time, write to timing array at a1->timing[112 * cu_index]
  6. Update peak wall-clock counter (under lock via sub_607D70 / sub_607D90)
  7. Release jobserver token: sub_1CC7040()
  8. Free the 360-byte work buffer

Thread Isolation Strategy

Each worker thread operates on an isolated copy of compilation state:

ResourceIsolation Mechanism
Memory poolPer-thread pool pointer at TLS offset +24. Each thread's allocations go to a separate arena, eliminating heap contention.
Error statePer-thread flags at TLS offsets 0--1 (error bytes), 8 (longjmp target), 16 (error descriptor), 48--52 (diagnostic control). Each thread tracks its own errors independently.
Hash mapsDeep-copied from the master compilation context before task submission. Workers never share mutable lookup tables.
Config vectorsSnapshot of 15 configuration vectors into a 360-byte per-task buffer. Workers read their own copy.
Timing dataPer-kernel slots in a pre-allocated timing array (112 bytes per kernel). Each worker writes only to its own kernel's slot.

The only shared mutable state during parallel compilation is the peak wall-clock counter at offset +224 in the compilation driver's state block, protected by a global lock (lock index 6, via sub_607D70/sub_607D90). This lock is acquired briefly at the end of each per-kernel task to compare-and-update the maximum observed wall-clock time.

GNU Make Jobserver Integration

When both --jobserver and --split-compile are active, ptxas participates in GNU Make's parallel job token protocol. The compilation driver creates the jobserver client object before spawning the thread pool, and each per-kernel worker task must acquire a token before starting and release it when done. This throttles ptxas to never exceed the make -j N budget, even when --split-compile would otherwise use more threads.

Jobserver Object (296 bytes)

The jobserver state is a 296-byte heap object allocated once per compilation, stored at global qword_29FE128. The constructor (sub_1CC7AF0) is called from the compilation driver (sub_4428E0) when *(_BYTE*)(context + 993) is set (the --jobserver CLI flag).

OffsetSizeTypeField
04int32 (atomic)State code (0=OK; see state table below)
44int32Saved errno from last failed syscall
81byteImplicit token available (1=unconsumed)
168int64Pending waiters (threads blocked in acquire)
248int64Active count (tokens currently held)
328void*Token buffer base (std::vector<char> data)
408void*Token buffer cursor (stack top)
488void*Token buffer capacity end
5640pthread_mutex_tInner mutex (guards token accounting)
9640pthread_mutex_tWrite mutex (guards write() to Make pipe)
13648pthread_cond_tCondition variable (wakes acquire waiters and reader thread)
1841byteToken-ready flag (set by reader thread / release handoff)
1851byteLast byte read from Make pipe
1884int32Read fd (Make pipe/FIFO read end)
1924int32Write fd (Make pipe/FIFO write end)
1964int32Internal pipe read fd (shutdown wakeup)
2004int32Internal pipe write fd (shutdown wakeup)
2041byteOpened-fds flag (1=ptxas opened the Make fds itself)
2051byteShutdown flag
2088void*Reader thread handle (std::thread)
21680bytesOuter mutexes (serializing full acquire/release operations)

MAKEFLAGS Parser: sub_1CC7300

Called during object construction to detect the Make jobserver channel:

// sub_1CC7300 -- parse MAKEFLAGS, open pipe/FIFO
void sub_1CC7300(JobserverObject *obj) {
    char *flags = getenv("MAKEFLAGS");
    if (!flags) {
        CAS(&obj->state, 5, 0);       // no MAKEFLAGS
        return;
    }
    std::string s(flags);
    size_t pos = s.find("--jobserver-auth=");
    if (pos == npos) {
        CAS(&obj->state, 6, 0);       // no --jobserver-auth=
        return;
    }
    size_t val = pos + 17;             // skip "--jobserver-auth="

    if (s.substr(val, 5) == "fifo:") {
        // --- FIFO mode ---
        std::string path = s.substr(val + 5, next_space);
        int fd = open(path.c_str(), O_RDWR | O_NONBLOCK);  // 0x802
        if (fd == -1) { CAS(&obj->state, 7, 0); return; }
        obj->read_fd  = fd;            // same fd for both
        obj->write_fd = fd;
        obj->opened_fds = 1;
    } else {
        // --- Pipe mode ---
        // parse "R,W" -- e.g. "3,4"
        std::string r_str = s.substr(val, comma_pos - val);
        std::string w_str = s.substr(comma_pos + 1, ...);
        // validate: digits only
        if (r_str.find_first_not_of("0123456789") != npos ||
            w_str.find_first_not_of("0123456789") != npos) {
            CAS(&obj->state, 7, 0); return;
        }
        int rd = dup(stoi(r_str));     // private copy
        if (fcntl(rd, F_SETFD, FD_CLOEXEC) == -1) {
            CAS(&obj->state, 7, 0); return;
        }
        int wd = dup(stoi(w_str));
        if (fcntl(wd, F_SETFD, FD_CLOEXEC) == -1) {
            close(rd);
            CAS(&obj->state, 7, 0); return;
        }
        obj->read_fd  = rd;
        obj->write_fd = wd;
        obj->opened_fds = 1;
    }
}
Protocol--jobserver-auth= valueDetectionfd Setup
FIFOfifo:/path/to/fifoPrefix match on fifo:open(path, O_RDWR|O_NONBLOCK) -- single fd for both read and write
PipeR,W (e.g. 3,4)Comma-separated integers after auth=dup() each fd + fcntl(F_SETFD, FD_CLOEXEC) -- prevents fd leak to children

Object Construction: sub_1CC7AF0

After sub_1CC7300 succeeds (state == 0), the constructor continues:

  1. Creates an internal wakeup pipe via pipe() -- fds stored at +196/+200
  2. Spawns the reader thread (sub_1CC6720) -- passed as a std::thread functor via off_2406838
  3. Pre-allocates the token buffer vector to hold thread_count bytes

If state is 5 or 6 (no MAKEFLAGS, no auth string), the caller (sub_4428E0) emits: "GNU Jobserver support requested, but no compatible jobserver found. Ignoring '--jobserver'" and proceeds without throttling.

Reader Thread: sub_1CC6720

A dedicated background thread that reads tokens from the Make pipe/FIFO and buffers them for the acquire function:

loop:
    if state != 0 or shutdown → exit

    lock(mutex_inner)
    while pending_waiters == 0 and not shutdown:
        cond_wait(cond, mutex_inner)     // sleep until someone needs a token
    unlock(mutex_inner)

    fd_set = { read_fd, internal_pipe_read }
    select(max_fd + 1, &fd_set, NULL, NULL, NULL)    // block

    if shutdown → exit

    n = read(read_fd, &byte, 1)
    if n == 1:
        if pending_waiters > 0:
            lock(mutex_inner + mutex_write)
            push byte onto token_buffer
            token_ready = 1
            unlock(mutex_write)
            cond_signal(cond)            // wake one acquire waiter
        else:
            write(write_fd, &byte, 1)    // no waiter → return token immediately
    else if errno == EAGAIN:
        continue                         // expected for non-blocking fd
    else:
        state = 11; exit                 // I/O error

The select() monitors two fds simultaneously: the Make pipe (for incoming tokens) and the internal wakeup pipe (for shutdown notification). The internal pipe avoids a race between checking the shutdown flag and blocking in select().

Token Acquire: sub_1CC6EC0

Called by each per-kernel worker before compilation begins. Returns 0 on success.

int sub_1CC6EC0(JobserverObject *obj) {
    if (!obj) return 4;
    lock(outer_mutex_0);
    if (obj->state) { unlock; return obj->state; }

    lock(mutex_inner);
    if (obj->implicit_token_available) {
        // Fast path: consume the implicit token (no pipe I/O)
        obj->implicit_token_available = 0;
        obj->active_count++;
        unlock_all;
        return 0;
    }
    // Slow path: wait for reader thread to supply a token
    obj->pending_waiters++;
    if (obj->pending_waiters == 1)
        cond_signal(cond);               // wake reader thread
    while (!obj->token_ready && !obj->shutdown)
        cond_wait(cond, mutex_inner);
    if (obj->shutdown) { unlock_all; return 3; }
    obj->token_ready = 0;
    obj->pending_waiters--;
    obj->active_count++;
    unlock_all;
    return 0;
}

The implicit token is the standard GNU Make convention: the parent Make gives the first child an implicit token (no byte in the pipe). The first worker to call acquire consumes it for free; subsequent workers must wait for bytes to be read from the pipe.

Token Release: sub_1CC7040

Called by each per-kernel worker after compilation completes. Returns 0 on success.

int sub_1CC7040(JobserverObject *obj) {
    if (!obj) return 4;
    lock(outer_mutex_1);
    if (obj->state) { unlock; return obj->state; }

    lock(mutex_inner);
    lock(mutex_write);
    if (token_buffer not empty) {
        // Path A: write a buffered byte back to Make pipe
        byte = pop(token_buffer);
        if (write(obj->write_fd, &byte, 1) == 1) {
            obj->active_count--;
            unlock_all;
            return 0;
        }
        // write error → set state 11 or 2
    }
    unlock(mutex_write);

    if (obj->pending_waiters > 0) {
        // Path B: hand off directly to a waiting acquirer
        obj->token_ready = 1;
        obj->active_count--;
        cond_signal(cond);
        unlock_all;
        return 0;
    }
    if (!obj->implicit_token_available && obj->active_count == 1) {
        // Path C: return the implicit token
        obj->implicit_token_available = 1;
        obj->active_count = 0;
        unlock_all;
        return 0;
    }
    // Protocol error (double-free)
    CAS(&obj->state, 12, 0);
    unlock_all;
    return 12;
}

Release has three paths, in priority order:

PathConditionAction
AToken buffer non-emptyPop byte, write() back to Make pipe
BNo buffered token but waiters existSet token_ready, signal condvar (avoids pipe round-trip)
CNo buffered token, no waiters, last activeRestore implicit token flag

Per-Kernel Worker Integration

In sub_436DF0 (the per-kernel compilation task submitted to the thread pool):

void sub_436DF0(int64_t *task) {
    sub_430590("ptxas", kernel_name);     // set TLS program name
    if (task[5] && sub_1CC6EC0(task[5]))  // acquire token if jobserver active
        sub_42F590(FATAL);                // acquire failed → fatal error
    // ... sub_432500(): full DAGgen + OCG pipeline ...
    if (!task[5] || !sub_1CC7040(task[5]))  // release token
        return;                             // normal return
    sub_42F590(FATAL);                    // release failed → fatal error
}

task[5] is populated from qword_29FE128 during task dispatch in sub_4428E0. When --jobserver is not active, task[5] == 0 and both acquire/release calls are skipped.

Destroy: sub_1CC6C20

Called after sub_1CB1AE0 (wait-all) and sub_1CB1970 (pool destroy) complete:

  1. Set shutdown flag (+205) via _InterlockedCompareExchange8
  2. Lock inner mutex, signal condvar (wake reader thread), unlock
  3. Write 1 byte to internal pipe write end (+200) -- unblocks select() in reader thread
  4. Join reader thread
  5. Lock inner mutex, drain all buffered tokens by writing each byte back to write_fd
  6. Unlock inner mutex
  7. Close Make fds if opened_fds is set (for FIFO: close once since read_fd == write_fd; for pipe: close both if different)
  8. Close internal pipe fds (+196, +200)
  9. Destroy condvar, free token buffer, free 296-byte object

State Machine

All state transitions use _InterlockedCompareExchange(state, new_value, 0) -- only the first error sticks; subsequent errors are silently dropped.

StateMeaningSet by
0OK (operational)Constructor
2Unexpected I/O (short write/read)Release, reader thread
5No MAKEFLAGS environment variablesub_1CC7300
6No --jobserver-auth= in MAKEFLAGSsub_1CC7300
7open()/dup()/fcntl() failedsub_1CC7300
11I/O error (errno recorded at +4)Reader thread, release, constructor
12Protocol error (double-free of token)Release

Throttling Semantics

With make -jN and --split-compile M where M > N:

ptxas creates M worker threads in the pool
but only N-1 pipe tokens exist + 1 implicit token = N total
workers that cannot acquire a token block in cond_wait
→ at most N kernels compile simultaneously, matching Make's budget
→ as each kernel finishes and releases its token, a blocked worker wakes

Without --jobserver, all M workers run freely with no external throttling.

Pool Allocator Thread Safety

The pool allocator (sub_424070) achieves thread safety through a combination of per-thread arenas and a global mutex:

  1. Per-thread arena: The TLS context at offset +24 holds a pointer to the current thread's memory pool. sub_424070 reads sub_4280C0()[3] to obtain this pointer. When non-NULL, allocations come from the thread-local slab without any locking.

  2. Global pool mutex: The pool struct contains a pthread_mutex_t at offset +7128 (within the ~7,200-byte pool descriptor). This mutex is acquired only for operations that modify the global pool state: slab acquisition from the parent pool, large-block allocation via mmap/malloc, and pool destruction.

  3. Slab-level lockfree: Within a thread-local slab (56-byte descriptor), bump-pointer allocation requires no synchronization. The allocator advances a pointer and returns; only when the slab is exhausted does it acquire the global lock to request a new slab.

Recursive Mutex Pattern

All mutexes created by sub_428620 (the mutex factory used throughout ptxas) are PTHREAD_MUTEX_RECURSIVE:

bool sub_428620(pthread_mutex_t *mutex) {
    pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
    return pthread_mutex_init(mutex, &attr) == 0;
}

This is necessary because ptxas functions may re-enter locking code paths through the diagnostic emitter (sub_42FBA0) or pool allocator (sub_424070), both of which are called from virtually everywhere.

Global Synchronization Objects

Global TLS Mutex (mutex at 0x29FE0xx)

Protects the global doubly-linked list of TLS contexts. Acquired during:

  • TLS context allocation (sub_4280C0)
  • TLS context destruction (destr_function)
  • sub_4286A0 (explicit lock for cross-thread operations)

This is a recursive mutex (initialized in ctor_001).

Global Lock Array (sub_607D70 / sub_607D90)

A global array of locks accessed by index. Lock index 6 is used to protect the peak wall-clock counter during parallel compilation. The total number of lock indices and their complete purpose map is not fully recovered; index 6 is the only one observed in the threading hot path.

sub_607D70(6);    // acquire lock 6
// update peak wall-clock
sub_607D90(6);    // release lock 6

Function Map

AddressSizeCallersIdentity
0x4094C0204 B0ctor_001 -- TLS key + global mutex init (.init_array)
0x427F10376 B0destr_function -- TLS destructor (via pthread_key_create)
0x4280C0597 B3,928TLS context accessor (280-byte struct, lazy alloc)
0x42860027 B--Mutex destroy + free wrapper
0x42862062 B--Recursive mutex init factory
0x4286706 B--pthread_mutex_destroy PLT thunk
0x4286806 B--pthread_mutex_lock PLT thunk
0x4286906 B--pthread_mutex_unlock PLT thunk
0x4286A0163 B--Global mutex lazy-init + lock
0x1CB17708 B1Priority comparator (always returns 1 = FIFO)
0x1CB1780202 B0start_routine -- worker thread main loop
0x1CB189011 B--CPU count via sysconf(_SC_NPROCESSORS_CONF)
0x1CB18B0159 B--Thread pool constructor (184-byte struct)
0x1CB1970168 B--Thread pool graceful destroy
0x1CB1A5090 B--Task submit (24-byte task node, heap push, broadcast)
0x1CB1AE0109 B--Wait-all (block until pending=0, active=0)
0x1CBEBF0--1Heap drain (free all queued elements)
0x1CBEC10--1Priority heap constructor (32-byte struct)
0x1CBECC0----Priority heap push (sift-up)
0x1CBEDD0----Priority heap pop (sift-down)
0x1CC6720~700 B1Jobserver reader thread (select loop, pushes tokens to buffer)
0x1CC6C20~300 B1Jobserver destroy (drain tokens, close fds, free 296-byte object)
0x1CC6EC0384 B1Jobserver token acquire (consume implicit or wait for pipe token)
0x1CC7040~280 B1Jobserver token release (write byte back or hand off to waiter)
0x1CC73002,027 B1Jobserver MAKEFLAGS parser (FIFO vs pipe detection, fd setup)
0x1CC7AF0~700 B1Jobserver constructor (alloc 296B, spawn reader thread)

Cross-References