Entry Point & CLI
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The ptxas binary has a deceptively simple entry point. The exported main at 0x409460 is an 84-byte wrapper that sets up unbuffered I/O and immediately tail-calls sub_446240 -- the real compilation driver. This driver is a monolithic 11 KB function that allocates a 1,352-byte master options block on the stack, establishes setjmp-based error recovery, parses all command-line options through a generic framework, reads PTX input, and then loops over compile units running the full Parse -> CompileUnitSetup -> DAGgen -> OCG -> ELF -> DebugInfo pipeline for each. The entire error-handling strategy is non-local: any of the 2,350 call sites to the central diagnostic emitter sub_42FBA0 can trigger a longjmp back to the driver's recovery point on fatal errors.
The same binary doubles as an in-process library. When nvcc loads ptxas as a shared object rather than spawning a subprocess, three extra arguments to the driver carry an output buffer pointer, an extra option count, and an extra options array. Callback function pointers at fixed offsets in the options block allow the host process to receive diagnostics and progress notifications without going through stderr.
| main() | 0x409460 (84 bytes) -- setvbuf + tail-call to sub_446240 |
| Real main | sub_446240 (11,064 bytes, ~900 lines) |
| Options block | 1,352 bytes on stack |
| Error recovery | setjmp / longjmp (no C++ exceptions) |
| Option registration | sub_432A00 (6,427 bytes, ~100 options via sub_1C97210) |
| Option parser | sub_434320 (10,289 bytes, ~800 lines) |
| Diagnostic emitter | sub_42FBA0 (2,388 bytes, 2,350 callers, 7 severity levels) |
| TLS context | sub_4280C0 (597 bytes, 3,928 callers, 280-byte per-thread struct) |
| Pipeline phases | Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo |
| Library mode | sub_446240(argc, argv, output_buf, extra_opt_count, extra_opts) |
Architecture
main (0x409460, 84B)
│
├─ nullsub_1(*argv) // store program name (no-op)
├─ setvbuf(stdout, _IONBF)
├─ setvbuf(stderr, _IONBF)
│
└─ sub_446240(argc, argv, 0, 0, 0) // REAL MAIN
│
├─ setjmp(jmp_buf) // fatal error recovery point
│
├─ sub_434320(opts_block, ...) // OPTION PARSER
│ └─ sub_432A00(...) // register ~100 options via sub_1C97210
│
├─ sub_4428E0(...) // PTX INPUT SETUP
│ ├─ validate .version / .target
│ ├─ handle --input-as-string
│ └─ generate __cuda_dummy_entry__ if --compile-only
│
├─ sub_43A400(...) // TARGET CONFIGURATION
│ └─ set cache defaults, texmode, arch-specific flags
│
├─ FOR EACH compile unit:
│ ├─ sub_451730(...) // parser/lexer init + special regs
│ ├─ sub_43B660(...) // register constraint calculator
│ ├─ sub_43F400(...) // function/ABI setup
│ └─ sub_43CC70(...) // per-entry: DAGgen → OCG → ELF → DebugInfo
│
├─ timing / memory stats output (--compiler-stats)
│
└─ cleanup + return exit code
Pre-main Static Constructors
Before main executes, four static constructors run as part of the ELF .init_array. Three of them populate ROT13-obfuscated lookup tables that are foundational to the rest of the binary. This obfuscation is deliberate -- it prevents trivial string searching for internal opcode names and tuning knob identifiers in the stripped binary.
ctor_001 -- Thread Infrastructure (0x4094C0, 204 bytes)
Initializes the POSIX threading foundation used throughout ptxas:
pthread_key_create(&key, destr_function); // TLS key for sub_4280C0
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
pthread_mutex_init(&mutex, &attr);
dword_29FE0F4 = sched_get_priority_max(SCHED_RR);
dword_29FE0F0 = sched_get_priority_min(SCHED_RR);
__cxa_atexit(cleanup_func, ...); // registered destruction
The TLS key created here is the one used by sub_4280C0 (3,928 callers), making it the single most important piece of global state in the binary.
ctor_003 -- PTX Opcode Name Table (0x4095D0, 17,007 bytes)
Populates a table at 0x29FE300+ with approximately 900 ROT13-encoded PTX opcode mnemonic strings. Each entry is a (string_ptr, length) pair. The ROT13 encoding maps A-Z to N-Z,A-M and a-z to n-z,a-m, leaving digits and punctuation unchanged.
| Encoded | Decoded | Instruction |
|---|---|---|
NPDOHYX | ACQBULK | Bulk acquire |
NPDFUZVAVG | ACQSHMINIT | Shared memory acquire init |
OFLAP | BSYNC | Barrier sync |
PPGY.P | CCTL.C | Cache control |
SZN | FMA | Fused multiply-add |
FRGC | SETP | Set predicate |
ERGHEA | RETURN | Return |
RKVG | EXIT | Thread exit |
These decoded names are the canonical PTX opcode mnemonics used during parsing and validation. The table is consumed by the PTX lexer initialization at sub_451730 and the opcode-to-handler dispatch table at sub_46E000 (93 KB, the largest function in the front-end range).
ctor_005 -- Mercury Tuning Knob Registry (0x40D860, 80,397 bytes)
The single largest function in the front-end address range. Registers 2,000+ ROT13-encoded internal tuning knob names, each paired with a hexadecimal default value string. These are the "Mercury" (OCG) backend tuning parameters that control every aspect of code generation, scheduling, and register allocation.
| Encoded Name | Decoded Name | Default |
|---|---|---|
ZrephelHfrNpgvirGuernqPbyyrpgvirVafgf | MercuryUseActiveThreadCollectiveInsts | 0x3e40 |
ZrephelGenpxZhygvErnqfJneYngrapl | MercuryTrackMultiReadsWarLatency | — |
ZrephelCerfhzrKoybpxJnvgOrarsvpvny | MercuryPresumeXblockWaitBeneficial | — |
ZrephelZreteCebybthrOybpxf | MercuryMergePrologueBlocks | — |
ZrephelTraFnffHPbqr | MercuryGenSassUCode | — |
The knob system is documented in detail in the Knobs System page. The ROT13 encoding applies identically to all knob name strings in all four constructors.
ctor_007 -- Scheduler Knob Registry (0x421290, 7,921 bytes)
A smaller companion to ctor_005 that registers 98 scheduler-specific knobs. These control the instruction scheduler (Mercury/OCG) behavior at a finer granularity than the general knobs:
Decoded examples: XBlockWaitOut, XBlockWaitInOut, XBlockWaitInOnTarget, WarDeploySyncsFlush_SW4397903, WaitToForceCTASwitch, VoltageWar_SW4981360PredicateOffDummies, TrackMultiReadsWarLatency, ScavInlineExpansion, ScavDisableSpilling.
Knob names containing _SW followed by a number (e.g., _SW4397903) indicate workarounds for specific hardware bugs identified by NVIDIA's internal bug tracking system.
Real Main -- sub_446240
The exported main() tail-calls sub_446240 with three zero arguments appended. This function is the complete compilation orchestrator: it owns the options block, the error recovery, the compilation loop, and the statistics output.
| Field | Value |
|---|---|
| Address | 0x446240 |
| Size | 11,064 bytes |
| Stack frame | 1,352+ bytes (master options block + locals) |
| Callers | 1 (main) |
| Error recovery | setjmp at function entry |
Signature and Library Mode
int sub_446240(int argc, char **argv,
void *output_buf, // a3: cubin output buffer (NULL for standalone)
int extra_opt_count, // a4: count of extra options from nvcc
char **extra_opts); // a5: array of extra option strings
When main calls this, a3/a4/a5 are all zero -- standalone mode. When nvcc loads ptxas as a shared library and calls the entry point directly, these arguments carry non-null values:
- a3 (output_buf): Pointer to a memory buffer where the compiled cubin is written. Eliminates the need for temporary files and filesystem round-trips, which matters for large CUDA compilations where nvcc may invoke ptxas hundreds of times.
- a4 (extra_opt_count): Number of additional option strings injected by nvcc beyond what appears on the command line.
- a5 (extra_opts): Array of those extra option strings.
Additionally, callback function pointers at offsets 37--39 of the 1,352-byte options block (byte offsets ~296, ~304, ~312) allow the host process to receive progress notifications and diagnostic messages in-process rather than through stderr.
Error Recovery with setjmp/longjmp
The first significant action in sub_446240 is establishing a setjmp recovery point:
if (setjmp(jmp_buf) != 0) {
// Fatal error occurred somewhere in the pipeline.
// Clean up and return non-zero exit code.
goto cleanup;
}
This is the only error recovery mechanism in ptxas -- there are no C++ exceptions (the binary is compiled as C, not C++). Any function anywhere in the call tree that encounters an unrecoverable error calls sub_42FBA0 with severity >= 6, which internally calls longjmp(jmp_buf, 1) to unwind directly back to this point. The approach is simple but has a critical implication: all resources allocated between the setjmp and the fatal error are leaked unless explicitly tracked and cleaned up at the recovery site.
The 1,352-Byte Options Block
The master options block lives on the stack and accumulates all compilation state during option parsing. It is passed by pointer to virtually every subsystem. Key fields (approximate offsets based on access patterns):
| Offset Range | Purpose |
|---|---|
| 0--63 | Input/output file paths, PTX version, target SM |
| 64--127 | Optimization level, debug flags, cache modes |
| 128--255 | Register limits, occupancy constraints |
| 256--295 | Warning/error control flags |
| 296--319 | Library-mode callback function pointers (offsets 37--39) |
| 320--1351 | Per-pass configuration, knob overrides, feature flags |
Compilation Loop
After option parsing and PTX input setup, the driver enters a loop over compile units. Each unit corresponds to one entry function (or device function in --compile-only mode). The per-entry processing is handled by sub_43CC70, which prints a separator:
printf("\n# ============== entry %s ==============\n", entry_name);
and then sequences: DAGgen (PTX-to-Ori lowering), OCG (optimization and code generation), ELF (binary emission), and DebugInfo (DWARF generation). The special entry __cuda_dummy_entry__ is silently skipped.
Timing and Memory Statistics
When --compiler-stats is active, sub_446240 prints per-phase timing and peak memory after all compile units complete:
CompileTime = 42.3 ms (100%)
Parse-time : 12.1 ms (28.61%)
CompileUnitSetup-time : 1.4 ms ( 3.31%)
DAGgen-time : 8.7 ms (20.57%)
OCG-time : 15.2 ms (35.93%)
ELF-time : 3.8 ms ( 8.98%)
DebugInfo-time : 1.1 ms ( 2.60%)
PeakMemoryUsage = 2048.000 KB
When --compiler-stats-file is specified, the same data is written in JSON format using the shared JSON builder (sub_1CBA950). When --fdevice-time-trace is active, sub_439880 parses Chrome DevTools trace format JSON and merges ptxas timing events into the trace.
Option Parser -- sub_434320 and sub_432A00
Option parsing is split into two phases: registration and processing.
Option Registration -- sub_432A00
This 6,427-byte function calls sub_1C97210 approximately 100 times, once per recognized option. Each call provides the option's long name, short name, value type, help text, and default value to the generic option framework (implemented in the 0x1C96xxx--0x1C97xxx range, shared with other NVIDIA tools).
| Option | Short | Type | Help Text |
|---|---|---|---|
--arch | -arch | string | "Specify the 'sm_' name of the target architecture" |
--output-file | -o | string | "Specify name and location of the output file" |
--opt-level | -O | int | "Specify optimization level" |
--maxrregcount | — | int | "Specify the maximum number of registers" |
--register-usage-level | — | enum(0..10) | Register usage reporting level |
--verbose | -v | bool | Verbose output |
--version | -V | — | Print version and exit |
--compile-only | — | bool | Compile without linking |
--compile-functions | — | string | "Entry function name" |
--input-as-string | — | string | "PTX string" (compile from memory) |
--fast-compile | — | bool | Reduce compile time at cost of code quality |
--suppress-stack-size-warning | — | bool | Suppress stack size warnings |
--warn-on-local-memory-usage | — | bool | Warn when local memory is used |
--warn-on-spills | — | bool | Warn on register spills |
--warn-on-double-precision-use | — | bool | Warn on FP64 usage |
--compiler-stats | — | bool | Print compilation timing |
--compiler-stats-file | — | string | "/path/to/file" (JSON output) |
--fdevice-time-trace | — | string | Chrome trace JSON output |
--def-load-cache | — | enum | Default load cache operation |
--force-load-cache | — | enum | Force load cache operation |
--position-independent-code | — | bool | Generate PIC |
--compile-as-tools-patch | — | bool | CUDA sanitizer/tools patch mode |
--extensible-whole-program | — | bool | Whole-program compilation |
--cloning | — | enum(yes/no) | Inline cloning control |
--ptxlen | — | — | PTX length statistics |
--list-version / --version-ls | — | — | List supported PTX versions |
--disable-smem-reservation | — | bool | Disable shared memory reservation |
--generate-relocatable-object | -c | bool | Generate relocatable object |
Option Processing -- sub_434320
The 10,289-byte parser iterates over argv (and any extra options from library mode), matches each argument against registered options via the framework, and populates fields in the 1,352-byte options block. Special handling exists for:
--version: Prints the identification string "Ptx optimizing assembler" followed by the version (e.g., "Cuda compilation tools, release 13.0, V13.0.88") and exits.--help: Delegates tosub_403588, which prints"Usage : %s [options] <ptx file>,...\n"followed by all registered options, then exits.--fast-compile: Validated against conflicting optimization options.-cloning=yes/-cloning=no: Inline cloning control parsed as an equality option.
Generic Option Framework
The option parsing library lives in the 0x1C96000--0x1C97FFF range and is shared with other NVIDIA tools (nvlink, fatbinary, etc.):
| Address | Identity | Role |
|---|---|---|
sub_1C960C0 | Option parser constructor | Creates the option parser state |
sub_1C96680 | Argv processor | Matches argv entries against registered options |
sub_1C97210 | Option registrar | Registers one option with name, type, help |
sub_1C97640 | Help printer | Iterates all registered options, prints help text |
Diagnostic System -- sub_42FBA0
The central diagnostic emitter is the most important error-reporting function in ptxas. With 2,350 call sites, it handles every warning, error, and fatal message in the entire binary.
Signature
void sub_42FBA0(
int *descriptor, // a1: points to severity level at *a1
void *location, // a2: source location context
... // variadic: printf-style format args
);
Severity Levels
| Level | Prefix | Tag | Behavior |
|---|---|---|---|
| 0 | (none) | — | Suppressed -- message is silently discarded |
| 1 | "info " | @I@ | Informational |
| 2 | "info " | @I@ | Informational (alternate) |
| 3 | "warning " / "error " | @W@ / @E@ | Warning, promoted to error if TLS[50] set |
| 4 | "error* " | @O@ | Non-fatal error with special marker |
| 5 | "error " | @E@ | Non-fatal error |
| 6 | "fatal " | @E@ | Fatal -- triggers longjmp(jmp_buf, 1) |
The machine-readable tags (@E@, @W@, @O@, @I@) allow nvcc and other tools to parse ptxas output programmatically, extracting severity without parsing the human-readable text.
Warning-to-Error Promotion
Severity level 3 has context-dependent behavior controlled by two flags in the thread-local storage:
v5 = *a1; // severity
if (v5 == 3) {
if (sub_4280C0()[49]) // TLS offset 49: suppression flag
return; // silently discard
if (sub_4280C0()[50]) // TLS offset 50: Werror flag
prefix = "error ";
else
prefix = "warning ";
}
This implements the --Werror equivalent: when the Werror flag is active in the TLS context, all warnings become errors.
Output Format
<filename>, line <N>; <severity>: <message>
When source is available, the diagnostic emitter reads the PTX input file, seeks to line N, and prints the source line prefixed with "# ". To avoid O(n) seeking through large files on every diagnostic, it maintains a hash map (sub_426150/sub_426D60) that caches file byte offsets every 10 lines for fast random access to arbitrary line numbers.
Fatal Error Handler -- sub_42BDB0
A 14-byte wrapper called from 3,825 sites (nearly every allocation in ptxas). It fires whenever the pool allocator sub_424070 returns NULL:
void sub_42BDB0(...) {
return sub_42F590(&unk_29FA530, ...); // internal error descriptor
}
The descriptor at unk_29FA530 has severity 6 (fatal), so this always triggers longjmp back to the driver's recovery point.
Thread-Local Storage -- sub_4280C0
The most-called function in the entire binary (3,928 callers). Returns a pointer to a 280-byte per-thread context struct, allocating and initializing it on first access.
void *sub_4280C0(void) {
void *ctx = pthread_getspecific(key);
if (ctx) return ctx;
ctx = malloc(0x118); // 280 bytes
memset(ctx, 0, 0x118);
pthread_cond_init(ctx + 128, NULL);
pthread_mutex_init(ctx + 176, NULL);
sem_init(ctx + 216, 0, 0);
pthread_setspecific(key, ctx);
return ctx;
}
TLS Context Layout (280 bytes)
| Offset | Size | Type | Purpose |
|---|---|---|---|
| 0 | 8 | int/flags | Error/warning state flags |
| 8 | 8 | int | has_error flag |
| 49 | 1 | byte | Diagnostic suppression flag |
| 50 | 1 | byte | Werror promotion flag |
| 128 | 48 | pthread_cond_t | Condition variable |
| 176 | 40 | pthread_mutex_t | Per-thread mutex |
| 216 | 32 | sem_t | Semaphore for synchronization |
The TLS key is created by ctor_001 before main runs, and a destructor function registered via pthread_key_create frees the 280-byte struct when a thread terminates. This per-thread context enables concurrent compilation of multiple compile units (when the thread pool is active), with each thread maintaining independent error state and diagnostic suppression flags.
PTX Input Setup -- sub_4428E0
After options are parsed, this 13,774-byte function reads and preprocesses the PTX input:
-
Version and target validation. Checks
.versionand.targetdirectives in the input. Emits synthetic headers ("\t.version %s\n","\t.target %s\n") when needed. -
Compile-only mode. When
--compile-onlyis active and no real entries exist, generates a dummy entry:"\t.entry %s { ret; }\n"with name__cuda_dummy_entry__. -
Input-as-string mode. When
--input-as-stringis active, PTX is read from memory (passed as a string argument) rather than from a file. This is used by the library-mode interface. -
Whole-program mode.
--extensible-whole-programenables inter-function optimization across all entries in the compilation unit. -
Cache and debug configuration. Applies
--def-load-cache,--def-store-cache,--force-load-cache,--force-store-cache, andsuppress-debug-infosettings. -
Tools-patch mode.
--compile-as-tools-patchactivates the CUDA sanitizer compilation path, checking for__cuda_sanitizersymbols.
Key diagnostic strings from this function:
"'--fast-compile'""calls without ABI""compilation without ABI""device-debug or lineinfo""unified Functions"
Target Configuration -- sub_43A400
A 4,696-byte function that configures target-specific defaults after option parsing completes. It reads the SM architecture number from the options block and sets:
- Texturing mode:
texmode_unifiedvs raw texture mode. - Cache defaults: Based on architecture capabilities.
- Feature flags: Hardware-specific workaround flags (e.g.,
--sw4575628). - Indirect function support:
"Indirect Functions or Extern Functions"validation.
The function references "NVIDIA" and "ptxocg.0.0" (the internal name for the OCG optimization pass), suggesting it also initializes the pass pipeline configuration for the target architecture.
Register Constraint Calculator -- sub_43B660
A 3,843-byte function that resolves potentially conflicting register limit specifications into a single register budget per function. Register constraints come from four sources with different priorities:
| Source | Directive/Option | Priority |
|---|---|---|
| PTX directive | .maxnreg N | Per-function, highest priority |
| CLI option | --maxrregcount N | Global, overridden by .maxnreg |
| PTX directive | .minnctapersm N | Occupancy target, derived limit |
| PTX directive | .maxntid Nx,Ny,Nz | Thread block size, derived limit |
The occupancy-derived limit is computed from .minnctapersm and .maxntid: given a minimum number of CTAs per SM and a maximum thread count per CTA, the function calculates the maximum register count that allows the requested occupancy level, accounting for per-SM register file size.
Diagnostic strings indicate the resolution process:
"computed using thread count"-- derived from.maxntid"of .maxnreg"-- explicit per-function limit"of maxrregcount option"-- CLI override"global register limit specified"-- global cap applied
Per-Entry Compilation -- sub_43CC70
A 5,425-byte function that processes each entry function through the complete backend pipeline. For each entry:
- Skips
__cuda_dummy_entry__(generated by compile-only mode). - Prints the entry separator:
"\n# ============== entry %s ==============\n". - Runs DAGgen (PTX-to-Ori lowering).
- Runs OCG (the 159-phase optimization pipeline + SASS code generation).
- Generates
.sassand.ucodeELF sections. - Generates DWARF debug information if requested.
The function also handles reg-fatpoint configuration (the register allocation algorithm, documented in the Fatpoint Algorithm page).
Function/ABI Setup -- sub_43F400
A 9,078-byte function that configures the calling convention for each function before compilation. This includes:
| Resource | Diagnostic String |
|---|---|
| Parameter passing registers | "number of registers used for parameter passing" |
| First parameter register | "first parameter register" |
| Return address register | "return address register" |
| Scratch data registers | "scratch data registers" |
| Scratch control barriers | "scratch control barriers" |
| Call prototype | "callprotoype" (sic -- misspelled in binary) |
| Call target | "calltarget" |
The function handles both entry functions (kernels launched from the host) and device functions (callable from other device code), with different ABI requirements for each. Entry functions use a simplified ABI where parameters come from constant memory, while device functions use register-based parameter passing.
The --compile-as-tools-patch and --sw200428197 flags activate a special ABI variant for CUDA sanitizer instrumentation, which inserts additional scratch registers for sanitizer state.
Function Map
| Address | Size | Callers | Identity |
|---|---|---|---|
0x409460 | 84 B | — | main (entry point thunk) |
0x4094C0 | 204 B | — | ctor_001 (thread infrastructure init) |
0x4095D0 | 17 KB | — | ctor_003 (ROT13 opcode table, ~900 entries) |
0x40D860 | 80 KB | — | ctor_005 (ROT13 knob registry, 2000+ entries) |
0x421290 | 8 KB | — | ctor_007 (scheduler knob registry, 98 entries) |
0x403588 | 75 B | 1 | Usage printer (--help) |
0x4280C0 | 597 B | 3,928 | TLS context accessor (280-byte struct) |
0x42BDB0 | 14 B | 3,825 | OOM fatal error handler |
0x42FBA0 | 2.4 KB | 2,350 | Central diagnostic emitter |
0x42F590 | — | 1 | Internal fatal error handler |
0x430570 | — | 2 | Program name getter |
0x432A00 | 6.4 KB | 1 | Option registration (~100 options) |
0x434320 | 10 KB | 1 | Option parser and validator |
0x439880 | 2.9 KB | 1 | Chrome trace JSON parser |
0x43A400 | 4.7 KB | 1 | Target configuration |
0x43B660 | 3.8 KB | 1 | Register constraint calculator |
0x43CC70 | 5.4 KB | 1 | Per-entry compilation processor |
0x43F400 | 9 KB | 1 | Function/ABI setup |
0x4428E0 | 13.8 KB | 1 | PTX input setup and preprocessing |
0x446240 | 11 KB | 1 | Compilation driver (real main) |
0x451730 | 14 KB | 1 | Parser/lexer init + special registers |
0x46E000 | 93 KB | 1 | Opcode-to-handler dispatch table builder |
0x1C960C0 | — | — | Option parser constructor |
0x1C96680 | — | — | Argv processor |
0x1C97210 | — | ~100 | Option registrar (per-option) |
0x1C97640 | — | 1 | Help text printer |
0x1CBA950 | — | — | JSON context constructor |
0x1CBAC20 | 2.9 KB | 3 | JSON recursive descent parser |
Cross-References
- Pipeline Overview -- full PTX-to-SASS compilation flow
- CLI Options -- complete option catalog
- Knobs System -- the 2,000+ Mercury tuning knobs registered in ctor_005
- Memory Pool Allocator -- the allocator (
sub_424070) that callssub_42BDB0on OOM - Hash Tables & Bitvectors -- the hash map used by diagnostics for line offset caching
- Thread Pool & Concurrency -- thread pool that creates the TLS contexts
- PTX Parser -- the parser initialized by
sub_451730 - Optimization Pipeline -- the 159-phase pipeline invoked per compile unit
- Fatpoint Algorithm -- register allocation referenced in per-entry compilation