SASS Code Generation

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

This page covers the top-level compilation orchestration layer of ptxas: the code that sits between the PTX front-end (parsing, directive handling) and the 159-phase optimization pipeline. It is responsible for validating the parsed PTX, selecting a compilation strategy, computing register constraints, dispatching per-kernel compilation (either sequentially or via a thread pool), and collecting per-kernel outputs for finalization. The orchestrator is the single largest function in the front-end region at 2,141 decompiled lines.

Key Facts


Core orchestrator	`sub_4428E0` (13,774 bytes, 2,141 decompiled lines)
Per-kernel worker	`sub_43A400` (4,696 bytes, 647 lines)
Per-kernel DAGgen+OCG	`sub_64BAF0` (~30 KB, 1,006 decompiled lines)
Per-entry output	`sub_43CC70` (5,425 bytes, 1,077 decompiled lines)
Thread pool worker	`sub_436DF0` (485 bytes, 59 decompiled lines)
Thread pool constructor	`sub_1CB18B0` (184-byte pool struct, calls `pthread_create`)
Finalization	`sub_432500` (461 bytes, 47 decompiled lines)
Regalloc finalize	`sub_4370F0` (522 bytes, 64 decompiled lines)
Compilation strategies	4 (normal, compile-only, debug, non-ABI)
Error recovery	`setjmp`/`longjmp` (non-local, no C++ exceptions)

Architecture

sub_446240 (top-level driver)
  |
  v
sub_4428E0 (core orchestrator, 2141 lines)
  |
  |-- Option validation: .version/.target, --compile-only, --compile-as-tools-patch
  |-- Cache config: def-load-cache, force-load-cache, def-store-cache, force-store-cache
  |-- Strategy selection: 4 function-pointer pairs (see below)
  |-- Register constraints: sub_43B660 per kernel (via strategy function)
  |-- Compile-unit table: 48-byte per-CU entry at a1+336
  |-- Timing array: 112-byte per-kernel entry at a1+256
  |
  +-- IF single-threaded (thread_count == 0):
  |     |
  |     FOR EACH compile unit:
  |       |
  |       +-- sub_43A400 (per-kernel setup, 647 lines)
  |       |     |-- Target-specific defaults ("ptxocg.0.0", cache, texmode)
  |       |     |-- ABI configuration, fast-compile shortcuts
  |       |     +-- Error recovery via setjmp
  |       |
  |       +-- sub_432500 (finalization wrapper, 47 lines)
  |       |     |-- setjmp error guard
  |       |     +-- vtable call at a1+96: invokes the actual OCG pipeline
  |       |
  |       +-- sub_4370F0 (timing finalization, 64 lines)
  |       |     +-- Accumulates per-kernel timing into 112-byte records
  |       |
  |       +-- sub_43CC70 (per-entry output, 1077 lines)
  |             |-- Skip __cuda_dummy_entry__
  |             |-- Generate .sass and .ucode sections
  |             +-- Emit "# ============== entry %s ==============" header
  |
  +-- IF multi-threaded (thread_count > 0):
        |
        |-- sub_1CB18B0(thread_count) --> create thread pool
        |
        FOR EACH compile unit:
          |
          +-- sub_43A400 (per-kernel setup, same as above)
          +-- Snapshot 15 config vectors (v158[3]..v158[17])
          +-- Copy hash maps for thread isolation
          +-- sub_1CB1A50(pool, sub_436DF0, task_struct) --> enqueue
          |
          v
        sub_436DF0 (thread pool worker, 59 lines)
          |-- sub_430590("ptxas", ...) -- set thread-local program name
          |-- Jobserver slot check (sub_1CC6EC0)
          |-- sub_432500(...) -- finalize via vtable call (DAGgen+OCG+SASS)
          |-- Timing: wall-clock and phase timers into per-CU record
          |-- sub_693630 (release compiled output to downstream)
          +-- sub_4248B0 (free task struct)
        |
        sub_1CB1AE0(pool)  --> wait for all tasks
        sub_1CB1970(pool)  --> destroy pool
        sub_4370F0(a1, -1) --> finalize aggregate timing

The Core Orchestrator: sub_4428E0

This is a 2,141-line monolith that drives the entire compilation after the PTX has been parsed. Its responsibilities, in execution order:

1. Cache and Option Validation

The first 200+ lines read four cache-mode knobs from the options store at a1+904:

def_load_cache   = get_knob(a1->options, "def-load-cache");
force_load_cache = get_knob(a1->options, "force-load-cache");
def_store_cache  = get_knob(a1->options, "def-store-cache");
force_store_cache = get_knob(a1->options, "force-store-cache");

It then validates a long series of option combinations:

--compile-as-tools-patch (a1+727) incompatibility checks against shared memory, textures, surfaces, samplers, constants
--assyscall (a1+627) resource allocation checks
--compile-only (a1+726) vs unified functions
Non-ABI mode (a2+218, a2+235): disables --fast-compile, --extensible-whole-program
--position-independent-code vs --extensible-whole-program mutual exclusion
Architecture version checks: .target SM version vs --gpu-name SM version
-noFwdPrg forward-progress flag against SM version
--legacy-bar-warp-wide-behavior against SM >= 70

2. Strategy Selection

Four compilation strategies are expressed as pairs of function pointers (v314, v293), selected at lines 756-779 of the decompilation. Each strategy pair consists of a register-constraint calculator and a compile-unit builder:

Strategy	Condition	Register Calculator (`v314`)	CU Builder (`v293`)
Compile-only	`--compile-only` OR `--assyscall` OR `--compile-as-tools-patch`	`sub_43C6F0` (225 lines)	`sub_4383B0` (177 lines)
Debug	`--compile-as-tools-patch` AND NOT debug mode	`sub_43CAE0` (91 lines)	`sub_4378E0` (250 lines)
Non-ABI	`--extensible-whole-program`	`sub_43C570` (77 lines)	`sub_438B50` (375 lines)
Normal	default	`sub_43CA80` (24 lines)	`sub_438B50` (375 lines)

The selection logic:

if (compile_only || assyscall || tools_patch) {
    calc_regs = sub_43C6F0;
    build_cus = sub_4383B0;
} else if (tools_patch_mode) {
    calc_regs = debug_mode ? sub_43C6F0 : sub_43CAE0;
    build_cus = debug_mode ? sub_4383B0 : sub_4378E0;
} else {
    calc_regs = extensible_whole_program ? sub_43C570 : sub_43CA80;
    build_cus = sub_438B50;
}

3. Register Constraint Calculation

Each strategy's register calculator iterates the kernel list and calls sub_43B660 to compute per-kernel register limits. The result is stored in a hash map at a1+1192:

// sub_43CA80 (normal strategy, 24 lines) -- simplest form
for (node = kernel_list; node; node = node->next) {
    entry = node->data;
    name  = entry->func_ptr;    // at +16
    limit = sub_43B660(a1, name, a1->opt_level, entry->thread_count);
    map_put(a1->reg_limit_map, name, limit);  // a1+1192
}

sub_43B660 (3,843 bytes) balances register pressure against occupancy by considering .maxnreg, --maxrregcount, .minnctapersm, and .maxntid. String evidence: ".minnctapersm and .maxntid", "threads per SM", "computed using thread count", "of .maxnreg", "of maxrregcount option", "global register limit specified".

4. Compile-Unit Table Construction

The CU builder (v293) constructs a linked list of 72-byte compile-unit descriptors. Each descriptor contains:

struct CompileUnitDescriptor {  // 72 bytes
    int32   index;          // +0:  CU ordinal
    void*   dep_list;       // +8:  dependency tracking set
    void*   entry_ptr;      // +16: pointer to entry function symbol
    bool    is_entry;       // +25: true if .entry, false if .func
    int32   regalloc[2];    // +28: register allocation mode pair
    bool    flags[4];       // +36: has_shared_mem, has_surfaces, has_textures, has_samplers
    int16   cap_flags;      // +40: capability flags from func_attr+240
    int32   min_regs;       // +44: minimum register count (from profile check)
    // +48..55: additional attribute OR bits from func_attr+208..236
    // +56..63: reserved
    void*   smem_info;      // +48: 24-byte sub-struct for shared memory
};

The builder allocates this via sub_424070(pool, 72), populates it from the function attribute struct at entry_ptr+80, and enqueues it into the output list via sub_42CA60.

5. Per-Kernel Dispatch

After building the CU list, the orchestrator enters one of two dispatch modes based on the thread count at a1+668:

Single-Threaded Path (thread_count == 0)

The loop at decompilation lines 1607-1686 iterates each CU sequentially:

for (node = cu_list; node; node = node->next) {
    cu_desc = node->data;
    // Record start time in 112-byte timing record
    timing_record[cu_desc->index].start = get_time();
    
    // Allocate 360-byte work buffer
    work = pool_alloc(pool, 360);
    memset(work, 0, 360);
    
    // Per-kernel setup: target config, cache defaults, ABI
    sub_43A400(a1, parser_state, cu_desc, elf_builder, work);
    
    // Finalization: runs the actual DAGgen + OCG pipeline
    sub_432500(a1, cu_desc + 16, work[0], work[1]);
    
    // Timing finalization for this kernel
    sub_4370F0(a1, cu_desc->index);
    
    // Per-entry output: .sass/.ucode sections
    sub_43CC70(a1, parser_state, cu_desc, work);
    
    pool_free(work);
}

Multi-Threaded Path (thread_count > 0)

The thread pool path (decompilation lines 1709-1929) uses the pthread-based thread pool:

// Phase 1: prepare all tasks
pool_obj = sub_1CB18B0(thread_count);  // create pool with N threads

for (node = cu_list; node; node = node->next) {
    cu_desc = node->data;
    
    // Allocate 360-byte work buffer (same as single-threaded)
    work = pool_alloc(pool, 360);
    
    // Extra per-thread state: 3 hash maps for thread isolation
    work[288] = hashmap_new(8);   // per-thread reg constraints
    work[296] = hashmap_new(8);   // per-thread symbol copies
    work[304] = hashmap_new(8);   // per-thread attribute copies
    
    sub_43A400(a1, parser_state, cu_desc, elf_builder, work);
    
    // Snapshot 15 config vectors from global state (a1+1072..a1+1296)
    // into work[48]..work[288] for thread-safe access
    for (i = 0; i < 15; i++)
        work[48 + 16*i] = load_128bit(a1 + 1072 + 16*i);
    
    // Copy hash maps from shared state into per-thread copies
    // (reg constraints, symbol tables, attribute maps)
    
    // Enqueue: sub_1CB1A50(pool, sub_436DF0, task_struct)
    task = malloc(48);
    task->pool_data = work;
    task->timing_base = ...;
    sub_1CB1A50(pool_obj, sub_436DF0, task);
}

// Phase 2: wait for all tasks
sub_1CB1AE0(pool_obj);   // wait-for-all
sub_1CB1970(pool_obj);   // destroy pool

// Phase 3: aggregate timing
sub_4370F0(a1, -1);      // -1 = aggregate all CUs

Jobserver Integration

When --split-compile is active with GNU Make, the thread pool integrates with Make's jobserver protocol. The worker function sub_436DF0 checks sub_1CC6EC0() (jobserver slot acquire) before starting compilation and calls sub_1CC7040() (jobserver slot release) after completion. This prevents ptxas from exceeding the -j slot limit during parallel builds.

Per-Kernel Worker: sub_43A400

This 647-line function sets up target-specific defaults for each kernel before the OCG pipeline runs. Key responsibilities:

Timing instrumentation -- records start timestamps, wall-clock time
Target configuration -- reads "ptxocg.0.0" defaults, sets cache mode, texturing mode, "specified texturing mode" string evidence
Fast-compile shortcuts -- when --fast-compile is active, reduces optimization effort
ABI setup -- configures parameter passing, return address register, scratch registers
Error recovery -- establishes setjmp point for fatal errors during kernel compilation

The function allocates a _jmp_buf on the stack for error recovery. If any phase in the downstream pipeline calls the fatal diagnostic path (sub_42F590 with severity >= 6), execution longjmps back to sub_43A400's recovery handler, which cleans up the partially-compiled kernel and continues to the next.

String evidence: "def-load-cache", "force-load-cache", "--sw4575628", "NVIDIA", "ptxocg.0.0", "specified texturing mode", "Indirect Functions or Extern Functions".

Finalization: sub_432500

This 47-line wrapper function is the bridge between the orchestrator and the actual DAGgen+OCG pipeline. It:

Retrieves the thread-local context via sub_4280C0
Saves and replaces the jmp_buf pointer in the TLS struct (for nested error recovery)
Saves the current error/warning flags
Clears the error flags to create a clean compilation context
Calls through a vtable pointer at a1+96 to invoke the actual compilation:

// sub_432500 -- simplified
bool finalize(Context* ctx, CUDesc* cu, void* sass_out, void* ucode_out) {
    char* tls = get_tls();
    jmp_buf* old_jmp = tls->jmp_buf;
    tls->jmp_buf = &local_jmp;
    char saved_err = tls->error_flags;
    char saved_warn = tls->warning_flags;
    tls->error_flags = 0;
    tls->warning_flags = 0;
    
    if (setjmp(local_jmp)) {
        // Error path: restore state, cleanup, report ICE
        tls->jmp_buf = old_jmp;
        tls->error_flags = 1;  tls->warning_flags = 1;
        release_output(ucode_out);
        release_output(cu->output);
        report_internal_error();
        return false;
    }
    
    // Normal path: invoke the pipeline
    bool ok = ctx->vtable->compile(ctx->state, sass_out, ctx + 384);
    if (!ok) report_internal_error();
    
    // Merge error flags
    tls->jmp_buf = old_jmp;
    tls->error_flags = saved_err ? true : (tls->error_flags != 0);
    tls->warning_flags = saved_warn ? true : (tls->warning_flags != 0);
    return ok;
}

The vtable call at a1+96 is the entry point into sub_64BAF0 (the 1,006-line function that runs DAGgen, the 159-phase OCG pipeline, and Mercury SASS encoding for a single kernel).

Regalloc Finalization: sub_4370F0

This 64-line function accumulates per-kernel timing results into the master timing array at a1+256. Each entry in this array is a 112-byte record:

struct KernelTimingRecord {  // 112 bytes, at a1->timing_array + 112*index
    char*   kernel_name;     // +0
    float   ocg_time;        // +20
    float   total_time;      // +36
    float   cumulative;      // +40
    double  wall_clock;      // +72
    // ... other timing fields
};

When called with index == -1 (aggregate mode after multi-threaded compilation), it sums all per-kernel records into the global timing counters at a1+176 (total parse time), a1+184 (total OCG time), and a1+208 (peak wall-clock).

Per-Entry Output: sub_43CC70

This 1,077-line function produces the final per-kernel output artifacts. Key behaviors:

Skip dummy entries -- checks for "__cuda_dummy_entry__" and returns immediately
Section generation -- creates .sass and .ucode ELF sections for each kernel
Entry banner -- emits "\n# ============== entry %s ==============\n" to the SASS text output
Register map -- calls "reg-fatpoint" to annotate the register allocation
Verbose SASS output -- when --verbose is active, formats and writes human-readable SASS text
Multiple output paths -- supports mercury, capmerc, and direct SASS output modes

Thread Pool Worker: sub_436DF0

The worker function dispatched to each thread pool thread is compact (59 lines) but carefully structured for thread safety:

void thread_worker(Context* a1, TaskStruct* task) {
    set_thread_program_name("ptxas", task);
    
    // Acquire jobserver slot if applicable
    if (a1->jobserver_enabled && !jobserver_acquire())
        report_fatal_error();  // Cannot get build slot
    
    float time_before = get_pool_time(a1->timer);
    double wall_before = get_wall_time();
    
    // Run the actual compilation
    sub_432500(a1->state, task + 64, task->sass_output, task->ucode_output);
    
    float time_after = get_pool_time(a1->timer);
    double wall_after = get_wall_time();
    
    // Record timing in per-CU record
    int cu_index = *(int*)task->cu_desc;
    TimingRecord* rec = &a1->timing_array[cu_index];
    rec->ocg_time = time_after - time_before;
    rec->cumulative += (time_after - time_before);
    if (wall_after - wall_before > 0)
        rec->wall_clock = wall_after - wall_before;
    
    // Peak wall-clock tracking (under lock)
    if (a1->compiler_stats && a1->per_kernel_stats) {
        lock_timing(6);
        double peak = a1->peak_wall_clock;
        if (get_wall_time() - a1->start_time > peak)
            a1->peak_wall_clock = get_wall_time() - a1->start_time;
        unlock_timing(6);
    }
    
    // Emit compiled output downstream
    if (a1->dump_sass)
        dump_sass(task->ucode_output);
    release_output(task->ucode_output);
    
    // Transfer kernel name to output
    task->output->name = **(task->cu_desc->entry_ptr + 88);
    
    // Release jobserver slot
    if (a1->jobserver_enabled && !jobserver_release())
        report_fatal_error();
    
    pool_free(task);
}

The timing lock at index 6 (sub_607D70(6) / sub_607D90(6)) serializes access to the peak wall-clock counter across threads. This is the only shared mutable state in the multi-threaded path -- all other per-kernel state is isolated in the 360-byte work buffer and per-thread hash map copies.

Data Flow Summary

PTX text
  |
  v (parsed by sub_451730 into AST at parser_state+88)
  |
sub_4428E0: strategy_calc(kernel_list)  --> reg_limit_map (a1+1192)
sub_4428E0: strategy_build(kernel_list) --> cu_descriptor_list (72-byte nodes)
  |
  v (for each CU descriptor)
  |
sub_43A400: target_config(cu_desc) --> 360-byte work buffer
  |
sub_432500: vtable->compile()
  |  invokes sub_64BAF0 (DAGgen + 159-phase OCG + Mercury)
  |    |
  |    +-- Ori IR construction (DAGgen phase)
  |    +-- 159 phases via PhaseManager (sub_C62720 / sub_C64F70)
  |    +-- Mercury SASS encoding (phases 113-122)
  |    |
  v    v
work[0] = .sass output    work[1] = .ucode output
  |
sub_4370F0: record timing
  |
sub_43CC70: emit .sass/.ucode sections, entry banner, verbose output
  |
  v
ELF builder (sub_612DE0)

Cross-References

Pipeline Overview -- end-to-end compilation flow
Entry Point & CLI -- the top-level driver that calls sub_4428E0
Optimization Pipeline (159 Phases) -- the OCG pipeline invoked per-kernel
Code Generation Overview -- detailed codegen subsystem
SASS Instruction Encoding -- Mercury encoding phases
Register Allocation -- Fatpoint algorithm invoked at phase 101
Thread Pool & Concurrency -- thread pool struct and jobserver
Memory Pool Allocator -- pool allocator used throughout
Knobs System -- cache-mode knobs read by sub_4428E0

Function Map

Address	Size	Lines	Identity	Confidence
`sub_4428E0`	13,774 B	2,141	Core compilation orchestrator	HIGH
`sub_43CA80`	192 B	24	Normal strategy: register calculator	HIGH
`sub_438B50`	2,419 B	375	Normal/non-ABI strategy: CU builder	HIGH
`sub_43C6F0`	1,600 B	225	Compile-only strategy: register calculator	HIGH
`sub_4383B0`	1,320 B	177	Compile-only/debug strategy: CU builder	HIGH
`sub_43CAE0`	648 B	91	Debug strategy: register calculator	HIGH
`sub_4378E0`	2,010 B	250	Debug strategy: CU builder	HIGH
`sub_43C570`	577 B	77	Non-ABI strategy: register calculator	HIGH
`sub_43A400`	4,696 B	647	Per-kernel worker (target config + setup)	HIGH
`sub_64BAF0`	~30 KB	1,006	DAGgen + OCG + SASS (per-kernel pipeline)	MEDIUM
`sub_43CC70`	5,425 B	1,077	Per-entry output (.sass/.ucode sections)	HIGH
`sub_436DF0`	485 B	59	Thread pool worker function	HIGH
`sub_432500`	461 B	47	Finalization wrapper (setjmp + vtable call)	HIGH
`sub_4370F0`	522 B	64	Timing finalization (per-kernel + aggregate)	HIGH
`sub_43B660`	3,843 B	~300	Register/resource constraint calculator	HIGH
`sub_1CB18B0`	~200 B	33	Thread pool constructor (184-byte struct)	HIGH
`sub_1CB1A50`	~200 B	21	Thread pool task submit	HIGH
`sub_1CB1AE0`	--	--	Thread pool wait-for-all	HIGH
`sub_1CB1970`	--	--	Thread pool destructor	HIGH
`sub_1CC7300`	2,027 B	--	GNU Make jobserver client	HIGH

Diagnostic Strings

String	Location	Purpose
`"def-load-cache"`	sub_4428E0	Cache mode knob read
`"force-load-cache"`	sub_4428E0	Cache mode knob read
`"def-store-cache"`	sub_4428E0	Cache mode knob read
`"force-store-cache"`	sub_4428E0	Cache mode knob read
`"--compile-only"`	sub_4428E0	Option validation
`"--compile-as-tools-patch"`	sub_4428E0	Option validation
`"--extensible-whole-program"`	sub_4428E0	Option validation
`"calls without ABI"`	sub_4428E0	Non-ABI mode diagnostic
`"compilation without ABI"`	sub_4428E0	Non-ABI mode diagnostic
`"unified Functions"`	sub_4428E0	Compile-only restriction
`"suppress-debug-info"`	sub_4428E0	Debug info suppression warning
`"position-independent-code"`	sub_4428E0	PIC mode configuration
`"__cuda_dummy_entry__"`	sub_43CC70	Dummy entry skip check
`"reg-fatpoint"`	sub_43CC70	Register map annotation
`".sass"`, `".ucode"`	sub_43CC70	Output section names
`"# ============== entry %s =="`	sub_43CC70	Per-entry SASS banner
`"ptxocg.0.0"`	sub_43A400	Target config identifier
`"specified texturing mode"`	sub_43A400	Texturing mode diagnostic
`".local_maxnreg"`	sub_438B50	Per-function register limit
`"device functions"`	sub_438B50	Compile-only device function handling
`"--compile-only option"`	sub_438B50	Compile-only restriction
`"ptxas"`	sub_436DF0	Thread-local program name

Keyboard shortcuts

PTXAS Reverse Engineering Reference