Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Optimization Levels

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The --opt-level (-O) flag controls how aggressively ptxas optimizes during the 159-phase pipeline. The option is parsed into a 32-bit integer at options block offset +148 by sub_434320 (line 216: sub_1C96470(v10, "opt-level", a3 + 148, 4)). The default value is 3. The documented range is 0--4, but the internal NvOpt recipe system supports levels 0--5, and the scheduler and rematerialization passes distinguish level 5 from lower values.

The optimization level propagates through the compilation context and is read by individual passes via sub_7DDB50 (232 bytes at 0x7DDB50), which combines the opt-level check with a knob 499 guard. Passes that call sub_7DDB50 receive the opt-level value (stored at compilation context offset +2104) only if knob 499 is enabled; otherwise, the function returns 1 (effectively clamping the level to O1 behavior).

CLI option--opt-level / -O
Options block offset+148 (int32)
Default3
Documented range0--4
Internal range0--5 (NvOpt levels)
Accessorsub_7DDB50 (0x7DDB50, 232 bytes)
Knob guard499 (if disabled, accessor returns 1)
Parse locationsub_434320 line 216
Debug overridesub_431A40 forces level to 0 when -g is set
Ofast-compile overridesub_434320 lines 635--679

Level Summary

LevelNameUse Case
0No optimizationDebug builds (-g), maximum source fidelity
1Minimal optimizationFast compile, basic folding/DCE
2Standard optimizationBalanced speed/compile-time (previous default)
3Full optimizationDefault; all standard passes enabled
4Aggressive optimizationExtra loop peeling, speculative hoisting
5Maximum optimizationFull SinkRemat+Cutlass, highest compile time

Options Block Fields Affected by Opt Level

The option parser (sub_434320) and debug handler (sub_431A40) modify several options block fields based on the optimization level. Key interactions discovered from decompiled code:

OffsetFieldO0O1O2+Source
+148opt_level012, 3, 4Direct from -O
+160register_usage_levelForced to 5 if set with -O0 (warning issued)5 (default)5 (default)sub_434320 line 359--363
+235cloning_disabled0 (disabled)per-CLIper-CLIsub_434320 line 776
+288device_debug(from -g)(from -g)(from -g)CLI only
+292sp_bounds_check1 (auto-enabled)per-CLIper-CLIsub_434320 line 775
+326no_cloning1 (when -g)per-CLIper-CLIsub_431A40 line 42
+408allow_expensive_optimizationsfalsefalsetruesub_434320 line 768
+477fast_compileforced 0 with -gper-CLIper-CLIsub_431A40 line 28

The critical line at sub_434320 offset 768:

// allow-expensive-optimizations defaults to (opt_level > 1)
if (!user_set_allow_expensive)
    options->allow_expensive_optimizations = (options->opt_level > 1);

Debug Mode Override (-g)

When --device-debug (-g) is active, sub_431A40 (at 0x431A40) forces the optimization level to 0 and disables most optimization features:

// sub_431A40 pseudocode
void ApplyDebugOverrides(options_block* opts, bool suppress_warning) {
    if (suppress_warning) {
        opts->device_debug = 1;
        opts->sp_bounds_check_pair = 0x0101;  // +16 = {1, 1}
    }
    opts->sp_bounds_check = 1;                // +292

    // Warn if user explicitly set opt-level with -g
    if (was_set("opt-level") && opts->opt_level != 0)
        warn("ignoring -O with -g");

    // Warn about incompatible options
    if (was_set("register-usage-level"))
        warn("'--device-debug' overrides '--register-usage-level'");

    // Force register_usage_level to {5, 5} (pair at +160)
    *(int64*)(opts + 160) = 0x500000005LL;

    if (opts->fast_compile)
        warn("'--device-debug' overrides '--fast-compile'");
    opts->fast_compile = 0;

    if (opts->ofast_compile is "max" or "mid" or "min")
        warn("'--device-debug' overrides '--Ofast-compile'");
    opts->ofast_compile = "0";
    opts->opt_level = 0;                      // +148

    // Handle cloning
    if (was_set("cloning") && opts->device_debug && !opts->no_cloning)
        warn("-cloning=yes incompatible with -g");
    opts->cloning_disabled = 0;               // +235
}

The 0x500000005LL write to offset +160 sets both the 32-bit register-usage-level and the adjacent 32-bit field to 5, resetting any user override.

Ofast-Compile Interaction

The --Ofast-compile (-Ofc) option provides a compile-time vs code-quality tradeoff orthogonal to -O. It has four settings: 0 (disabled), min, mid, max. Each setting overrides the opt-level and related flags:

Ofast-compileForces opt_level toCloningSplit-compileExpensive optsNotes
0(no change)(no change)(no change)(no change)Default
min1disabled(no change)true (forced)Warns if -O set to non-{0,1}
mid1disabled(no change)(no change)Disables cloning when no split-compile
max0disabled(no change)(no change)Most aggressive compile-time reduction

From sub_434320 lines 635--679:

if (ofast_compile == "max") {
    if (was_set("cloning") && !no_cloning)
        warn("-cloning=yes incompatible with --Ofast-compile=max");
    no_cloning = 1;
    if (was_set("opt-level") && opt_level != 0)
        warn("-opt-level=<1,2,3> incompatible with --Ofast-compile=max");
    opt_level = 0;
}
if (ofast_compile == "mid") {
    no_cloning = 1;
    if (!split_compile) cloning_disabled = 1;
    if (was_set("opt-level") && opt_level != 1)
        warn("-opt-level=<0,2,3> incompatible with --Ofast-compile=mid");
    opt_level = 1;
    fast_compile_mode = 1;
}
if (ofast_compile == "min") {
    no_cloning = 1;
    if (!split_compile) cloning_disabled = 1;
    if (was_set("opt-level") && opt_level != 1)
        warn("-opt-level=<0,2,3> incompatible with --Ofast-compile=min");
    opt_level = 1;
    fast_compile_mode = 1;
    if (was_set("allow-expensive-optimizations") && !allow_expensive)
        warn("-allow-expensive-optimizations=false incompatible with --Ofast-compile=min");
    allow_expensive_optimizations = 1;
}

Per-Phase Gating by Optimization Level

Optimization levels control the pipeline through two mechanisms:

  1. Static isNoOp() overrides -- The AdvancedPhase gate vtables are overridden at pipeline construction time based on the target architecture and opt-level.
  2. Runtime opt-level checks -- Individual pass execute functions call sub_7DDB50 (the opt-level accessor) and early-return when the level is below their threshold.

Gate Accessor: sub_7DDB50

// sub_7DDB50 pseudocode (232 bytes at 0x7DDB50)
int getOptLevel(compilation_context* ctx) {
    knob_vtable* kv = ctx->knob_state;             // ctx + 1664
    query_func qf = kv->vtable[19];                // vtable + 152

    if (qf == sub_67EB60) {                         // fast-path: known vtable
        check_func cf = kv->vtable[9];              // vtable + 72
        bool knob_499;
        if (cf == sub_6614A0)
            knob_499 = (kv->state[35928] != 0);     // direct field read
        else
            knob_499 = cf(ctx->knob_state, 499);     // indirect query
        if (!knob_499)
            return ctx->opt_level;                   // ctx + 2104
        int64 state = kv->state;
        int iteration = state[35940];
        if (state[35936] > iteration) {
            state[35940] = iteration + 1;            // increment pass counter
            return ctx->opt_level;
        }
    } else if (qf(ctx->knob_state, 499, 1)) {
        return ctx->opt_level;
    }
    return 1;                                        // fallback: treat as O1
}

When knob 499 is disabled (or its iteration budget is exhausted), sub_7DDB50 returns 1 regardless of the actual opt-level. This provides a master kill-switch: setting knob 499 to false effectively caps all opt-level-gated behavior at O1.

Phase Activation Table

The following table lists every phase where the optimization level has been confirmed to affect behavior, based on decompiled isNoOp() methods and execute-function guard checks.

Threshold notation: > N means the phase requires opt_level > N (i.e., level N+1 and above).

PhaseNameThresholdEffect at thresholdSource
14DoSwitchOptFirst> 0Branch/switch optimization enabledisNoOp returns true at O0
15OriBranchOpt> 0Branch folding enabledisNoOp returns true at O0
18OriLoopSimplification4--5Aggressive loop peeling enabled at O4+sub_78B430 checks opt_level
22OriLoopUnrolling> 1Loop unrolling requires at least O2Execute guard via sub_7DDB50
24OriPipelining> 1Software pipelining requires at least O2Execute guard
26OriRemoveRedundantBarriers> 1Barrier optimization at O2+Gating: opt_level > 1
28SinkRemat> 1 / > 4O2+: basic path; O5: full cutlass modeTwo-tier guard in sub_913A30
30DoSwitchOptSecond> 0Second switch pass at O1+isNoOp returns true at O0
38OptimizeNestedCondBranches> 0Nested branch simplification at O1+isNoOp returns true at O0
49GvnCse> 1GVN-CSE requires at least O2Execute guard
54OriDoRematEarly> 1Early rematerialization at O2+sub_7DDB50 check
63OriDoPredication> 1If-conversion at O2+Execute guard
69OriDoRemat> 1Late rematerialization at O2+sub_7DDB50 check
71OptimizeSyncInstructions> 1Sync optimization at O2+Gating: opt_level > 1
72LateExpandSyncInstructions> 2Late sync expansion at O3+Gating: opt_level > 2
95SetAfterLegalization> 1Post-legalization flag at O2+sub_7DDB50 check
99OriDoSyncronization> 1Sync insertion at O2+Gating: opt_level > 1
100ApplyPostSyncronizationWars> 1WAR fixup at O2+Gating: opt_level > 1
110PostSchedule> 0Full post-schedule at O1+Mode selection
115AdvancedScoreboardsAndOpexes> 0Full scoreboard generation at O1+Hook activated at O1+
116ProcessO0WaitsAndSBs== 0Conservative scoreboards at O0 onlyActive only at O0

O-Level Feature Matrix

FeatureO0O1O2O3O4O5
Basic block mergingoffononononon
Branch/switch optimizationoffononononon
Copy propagation + const foldingoffononononon
Dead code eliminationpartialononononon
Loop canonicalizationbasicbasicfullfullaggressiveaggressive
Loop unrollingoffoffonononon
Software pipeliningoffoffonononon
Strength reductionoffononononon
GVN-CSEoffoffonononon
Predication (if-conversion)offoffonononon
Rematerialization (early)offoffonononon
Rematerialization (late)offoffonononon
SinkRemat (full)offoffoffoffoffon
Cutlass iterative rematoffoffoffoffoffon
Loop peeling (aggressive)offoffoffoffonon
Barrier optimizationoffoffonononon
Sync instruction optimizationoffoffonononon
Late sync expansionoffoffoffononon
Post-legalization markoffoffonononon
Allow expensive optimizationsoffoffonononon
Speculative hoistingoffoffonononon
Hot/cold partitioningoffononononon
Full scoreboard analysis (115)offononononon
Conservative scoreboards (116)onoffoffoffoffoff
Stack-pointer bounds checkauto-onoffoffoffoffoff
Cloningdisabledononononon

Notes:

  • "partial" DCE at O0: EarlyOriSimpleLiveDead (phase 10) still runs for basic cleanup even at O0.
  • O4 and O5 are not documented in --help but are accepted internally. O4 is equivalent to O3 plus aggressive loop peeling. O5 adds the full SinkRemat pass with cutlass iteration support.

Scoreboard Path Selection

The scoreboard generation subsystem has two mutually exclusive paths, selected by optimization level:

O0: Conservative Path (Phase 116)

Phase 116 (ProcessO0WaitsAndSBs) inserts maximum-safety scheduling metadata:

For every instruction:
    stall_count = 15            // maximum stall (15 cycles)
    wait_mask   = 0x3F          // wait on all 6 barriers
    write_barrier = 7           // no barrier assignment (7 = none)
    read_mask   = 0             // no read barriers
    yield       = 1             // yield after every instruction

This eliminates all instruction-level parallelism. Every instruction waits the maximum time and clears all dependency barriers before executing. The result is correct but extremely slow code -- suitable only for debugging.

O1+: Full Analysis Path (Phase 115)

Phase 115 (AdvancedScoreboardsAndOpexes) runs the complete dependency analysis pipeline:

  1. sub_A36360 (52 KB) -- Master control word generator with per-opcode dispatch
  2. sub_A23CF0 (54 KB) -- DAG list scheduler heuristic for barrier assignment
  3. Per-field encoders for stall, yield, barrier, and scoreboard dependency fields

The full path computes precise stall counts based on actual instruction latencies from the hardware profile, assigns the minimum necessary dependency barriers (6 available per SM), and inserts yield hints only where the warp scheduler benefits from switching.

Scheduling Direction

The scheduling infrastructure (sub_8D0640) selects scheduling direction based on opt-level:

ConditionDirectionStrategy
opt_level <= 2Forward-passRegister-pressure-reducing: prioritizes freeing registers
opt_level > 2Reverse-passLatency-hiding: prioritizes ILP and memory latency overlap

At the default O3, the scheduler uses the reverse-pass strategy, which hides memory latencies at the cost of potentially higher register pressure. At O1--O2, the forward-pass strategy minimizes peak register usage.

The direction selection happens in PreScheduleSetup (sub_8CBAD0), called from the scheduling orchestrator with the boolean opt_level > 2:

PreScheduleSetup(sched, opt_level > 2);    // sub_8CBAD0

Additionally, the ScheduleInstructionsReduceReg phase (mode 0x39) is enabled by default at O3 and above, providing a dedicated register-pressure-reduction scheduling pass before the ILP pass.

Register Allocation Differences

The register allocator itself (fat-point greedy at sub_957160) does not directly branch on the optimization level. However, the opt-level affects register allocation indirectly through:

  1. --register-usage-level (offset +160, range 0--10, default 5): At O0 with -g, this is forced to 5 regardless of user setting. The value modulates the per-class register budget at alloc + 32*class + 884.

  2. allow-expensive-optimizations (offset +408): Defaults to true when opt_level > 1. When true, the allocator and related passes are permitted to spend more compile time on better solutions (e.g., more spill-retry iterations, more aggressive coalescing).

  3. Phase gating: At O0, passes that reduce register pressure (rematerialization, predication, loop optimizations) are disabled, so the allocator receives un-optimized IR with higher register demand. This typically results in more spills at O0.

NvOpt Recipe System

The NvOpt recipe system (Phase 1: ApplyNvOptRecipes, option 391) provides an additional optimization-level axis. When enabled, the PhaseManager allocates a 440-byte NvOptRecipe sub-manager that configures per-phase aggressiveness:

NvOpt levelBehavior
0Minimal optimization (fast-compile path, many phases set to isNoOp())
1--2Standard optimization
3--4Aggressive optimization (loop unrolling, speculative hoisting enabled)
5Maximum optimization (may significantly increase compile time)

The NvOpt level is validated in sub_C173E0 via the string "Invalid nvopt level : %d.", confirming the range 0--5. Recipe data lives at NvOptRecipe+312 with per-phase records at stride 584 bytes.

The NvOpt level is distinct from the -O CLI level. The -O level controls which phases run at all (via isNoOp() and sub_7DDB50 guards); the NvOpt level controls how aggressively the phases that do run behave (via recipe parameters).

Knob Defaults That Change Per Level

Several knobs have default values that vary by optimization level. The most significant:

KnobO0 DefaultO1 DefaultO2+ DefaultEffect
487enabledenabledenabledMaster optimization enable (checked by many passes)
499(varies)(varies)(varies)Guard knob for sub_7DDB50 opt-level accessor
595truetruetrueScheduling enable (but O0 uses conservative path)
419------Forward scheduling mode (bit 3 in scheduler flags)

Knob 487 is the most pervasive: it is checked by loop simplification, barrier optimization, sync optimization, predication, rematerialization, and scheduling passes. Disabling it overrides the opt-level and turns off the corresponding pass regardless of -O setting.

Key Decompiled Evidence

Options block opt_level field (offset +148)

// sub_434320 line 216: parse opt-level from CLI
sub_1C96470(v10, "opt-level", a3 + 148, 4);

allow-expensive-optimizations defaults to (opt_level > 1)

// sub_434320 line 768
if (!user_set_allow_expensive)
    *(a3 + 408) = *(a3 + 148) > 1;

O0 forces sp-bounds-check and disables cloning

// sub_434320 lines 773-776
if (opt_level == 0) {
    *(a3 + 292) = 1;       // sp_bounds_check = true
    *(a3 + 235) = 0;       // cloning_disabled = false (misleading name: 0 = no cloning)
}

Debug mode forces O0

// sub_431A40 line 33
*(a3 + 148) = 0;           // opt_level = 0

Ofast-compile=max forces O0

// sub_434320 line 646
*(a3 + 148) = 0;           // opt_level = 0 for Ofast=max

Ofast-compile=mid/min forces O1

// sub_434320 lines 659, 674
*(a3 + 148) = 1;           // opt_level = 1 for Ofast=mid and min
*(a3 + 572) = 1;           // fast_compile_mode = 1

Scoreboard mutually exclusive paths

// Phase 115 (AdvancedScoreboardsAndOpexes): isNoOp() returns true at O0
// Phase 116 (ProcessO0WaitsAndSBs): isNoOp() returns true at O1+

SinkRemat two-tier gating

// sub_913A30 (SinkRemat core, phase 28)
if (getOptLevel(ctx) <= 1) return;      // O0/O1: skip entirely
if (getOptLevel(ctx) <= 4) return;      // O2-O4: skip full sink+remat
// O5: proceed to cutlass iterative mode

Sync barrier gating

// Phase 99 (OriDoSyncronization), Phase 71 (OptimizeSyncInstructions):
call   sub_7DDB50              ; get opt_level
cmp    eax, 1
jle    return                  ; skip if opt_level <= 1

// Phase 72 (LateExpandSyncInstructions):
// Requires opt_level > 2

Cross-References

Key Functions

AddressSizeRoleConfidence
sub_434320--CLI option parser; parses --opt-level at line 216, handles --Ofast-compile at lines 635--679, sets allow-expensive-optimizations default at line 7680.95
sub_431A40--Debug mode override; forces opt-level to 0, disables cloning, resets register-usage-level when -g is active0.95
sub_7DDB50232BOpt-level accessor; returns ctx+2104 opt-level if knob 499 is enabled, otherwise returns 1 (O1 fallback). Called by 20+ passes as the runtime opt-level gate0.95
sub_1C96470--Generic CLI argument reader; called by sub_434320 to read --opt-level into options block offset +1480.85
sub_67EB60--Fast-path knob query vtable function; identified inside sub_7DDB50 for knob 499 check0.80
sub_6614A0--Knob state direct-read function; used by sub_7DDB50 to read knob 499 via direct field access at offset 359280.80
sub_78B430--OriLoopSimplification execute function; checks opt_level for aggressive loop peeling (O4+)0.85
sub_913A30--SinkRemat core (phase 28); two-tier opt-level guard: skips at O0--O1, limited at O2--O4, full cutlass mode at O50.90
sub_8D0640--Scheduling infrastructure; selects forward (O1--O2) vs reverse (O3+) scheduling direction0.85
sub_8CBAD0--PreScheduleSetup; called with opt_level > 2 boolean to configure scheduling direction0.85
sub_957160--Fat-point greedy register allocator; does not directly branch on opt-level but is affected indirectly0.90
sub_A3636052KBMaster scoreboard control word generator (phase 115, O1+ path)0.90
sub_A23CF054KBDAG list scheduler heuristic for barrier assignment (phase 115, O1+ path)0.90
sub_C173E0--NvOpt level validator; emits "Invalid nvopt level : %d." for out-of-range values0.85