Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Knobs System

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The knobs system is ptxas's internal configuration mechanism -- a separate layer beneath the public CLI flags that exposes 1,294 tuning parameters to NVIDIA developers. Every significant compiler heuristic (register allocation thresholds, scheduling priorities, pass enable/disable, peephole rules) has a corresponding knob. The system is shared with cicc via a common header (generic_knobs_impl.h) but ptxas instantiates it twice: once for the DAG scheduler pipeline (99 knobs) and once for the OCG (Optimizing Code Generator) backend (1,195 knobs). All knob names are stored ROT13-encoded in the binary, a lightweight obfuscation that prevents casual strings discovery while being trivially reversible.

The knobs infrastructure lives primarily in two address regions: 0x6F0000--0x6F8000 (DAG knob instantiation, shared with the Mercury SASS pipeline) and 0x797000--0x7A2000 (OCG knob instantiation, the larger set). Both regions are compiled from the same template in generic_knobs_impl.h.

Total knobs1,294 (99 DAG + 1,195 OCG)
Source header/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/common/utils/generic/impl/generic_knobs_impl.h
DAG GetKnobIndexsub_6F0820 (2,782 bytes)
OCG GetKnobIndexsub_79B240 (518 bytes)
ParseKnobValuesub_6F7360 / sub_79F540 (DAG: 18KB, OCG: 18KB)
ReadKnobsFilesub_79D070 (9,879 bytes)
KnobsInit (master)sub_79D990 (40,817 bytes)
KnobInit (per-knob)sub_7A0C10 (13,874 bytes)
Knob descriptor64 bytes per entry
Knob runtime value72 bytes per slot
Name obfuscationROT13 with case-insensitive comparison
Setting mechanisms-knob NAME=VALUE, knobs file ([knobs] header), PTX pragma, env var
Debug dumpDUMP_KNOBS_TO_FILE environment variable

Architecture

                  ┌──────────────────────────────────────────┐
                  │            KnobsInit (sub_79D990)        │
                  │  Called once from global init sub_662920  │
                  └─────┬──────────┬──────────┬──────────────┘
                        │          │          │
              ┌─────────▼──┐  ┌───▼──────┐  ┌▼───────────────┐
              │ ReadKnobsFile│  │ -knob CLI│  │ PTX pragma     │
              │ sub_79D070   │  │ parsing  │  │ (unless        │
              │ [knobs] fmt  │  │          │  │ DisablePragma) │
              └─────────┬───┘  └───┬──────┘  └┬───────────────┘
                        │          │           │
                        ▼          ▼           ▼
              ┌─────────────────────────────────────────────┐
              │       ParseKnobsString (sub_79B530)         │
              │  Handles WHEN=, INJECTSTRING, ~-delimited   │
              └──────────────────┬──────────────────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │   GetKnobIndex           │
                    │   sub_6F0820 (DAG)       │
                    │   sub_79B240 (OCG)       │
                    │   ROT13 decode + lookup  │
                    └────────────┬─────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │   ParseKnobValue         │
                    │   sub_6F7360 (DAG)       │
                    │   sub_79F540 (OCG)       │
                    │   Type-specific parsing  │
                    └────────────┬─────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │   Runtime knob array     │
                    │   72 bytes per slot      │
                    │   Accessed by index      │
                    └──────────────────────────┘

ROT13 Name Obfuscation

Every knob name in the binary is stored as a ROT13-encoded string. The GetKnobIndex function decodes each character inline during comparison, without ever materializing the cleartext name in memory. The decode is combined with a case-insensitive tolower() comparison against the user-supplied query.

The inline ROT13 decode from sub_6F0820:

// For each character in the stored ROT13 name:
char c = stored_name[i];
if ((unsigned char)((c & 0xDF) - 65) <= 12)
    c += 13;                   // A-M (or a-m) -> N-Z (or n-z)
else if ((unsigned char)((c & 0xDF) - 78) < 13)
    c -= 13;                   // N-Z (or n-z) -> A-M (or a-m)
// Then compare case-insensitively:
if (tolower(query_char) != tolower(c))
    goto mismatch;

The & 0xDF trick converts lowercase to uppercase before range-checking, so both 'a'-'m' and 'A'-'M' hit the first branch. Non-alphabetic characters pass through unchanged. This means knob names like SchedNumBB_Limit with underscores and digits are handled correctly -- only the alphabetic portion rotates.

To reverse-engineer knob names from the binary: extract the ROT13 strings from the knob definition table (64-byte stride at the table base pointer), apply ROT13, and you get the cleartext name.

Knob Descriptor Layout

Each knob is described by a 64-byte entry in the knob definition table. The table is an array at (knob_state + 16) with count at (knob_state + 24).

Offset  Size  Field
──────  ────  ─────────────────────────────────────
+0      8     name_ptr          Pointer to ROT13-encoded primary name
+8      8     name_len          Length of primary name
+16     1     type_tag          Knob type (OKT_* enum, 1-12)
+17     7     (padding)
+24     16    (reserved)
+40     8     alias_ptr         Pointer to ROT13-encoded alias name
+48     8     alias_len         Length of alias name
+56     8     (reserved)
──────  ────
        64    Total

Both primary and alias names are checked during lookup. A knob matches if either its primary name or alias decodes to the query string (case-insensitive). The alias mechanism allows backward-compatible renaming of knobs across toolkit versions.

Knob Value Layout

Runtime knob values are stored in a flat array of 72-byte slots at (knob_state + 72 * index). The slot layout depends on the type:

Offset  Size  Field
──────  ────  ─────────────────────────────────────
+0      1     type_tag          Runtime type (0=unset, 1-10)
+1      7     (padding)
+8      8     value / pointer   Primary value (int32, int64, float, double, or pointer)
+16     8     list_begin        For list types: first element pointer
+24     8     list_sentinel     For list types: sentinel node
+32     4     aux_value         Secondary value (e.g., int-range high bound)
+36     4     (padding)
+40     8     list_tail         For list types: last element pointer
+48     8     list_head         For list types: head pointer
+56     4     element_count     For list types: number of elements
+60     4     (padding)
+64     8     allocator         Arena allocator pointer for list/range types
──────  ────
        72    Total

The type tag at runtime differs from the definition-table type tag. The definition type drives parsing; the runtime type reflects what was actually stored:

Runtime TypeMeaningPayload
0Unset / invalidNone
1int32*(int32*)(slot + 8)
2float*(float*)(slot + 8)
3double / int64*(int64*)(slot + 8)
4boolean (true)No payload; presence = true
5string*(char**)(slot + 8)
6when-condition listDoubly-linked list at +16..+48, count at +56
7int32 with secondary*(int32*)(slot + 8), *(int32*)(slot + 12)
8int-range*(int32*)(slot + 8) = low, *(int32*)(slot + 12) = high
9opcode-string-listDoubly-linked list (same structure as type 6)
10int-list (dynamic)Growable array at +16, count at +24

Per-Type Slot Usage (confirmed from decompilation)

Types 1, 2, 3, 4, 5, 7, 8 -- scalar types using only bytes +0 through +15:

Type 1 (int32):      +0 = 0x01, +8 = int32 value (4 bytes)
Type 2 (float):      +0 = 0x02, +8 = float value (4 bytes, upper 4 undefined)
Type 3 (double):     +0 = 0x03, +8 = double value (8 bytes)
Type 4 (boolean):    +0 = 0x04  (no payload -- presence = true)
Type 5 (string):     +0 = 0x05, +8 = char* pointer (8 bytes, NOT owned)
Type 7 (budget):     +0 = 0x07, +8 = int32 primary, +12 = int32 secondary
Type 8 (int-range):  +0 = 0x08, +8 = int32 low, +12 = int32 high

Types 6 and 9 -- doubly-linked list types using the full 72 bytes:

+0:   byte   type tag (6 or 9)
+8:   ptr    next pointer (initially 0)
+16:  ptr    → slot+24 (sentinel backward link)
+24:  ptr    → slot+8 (sentinel forward link)
+32:  int64  (unused, set to 0)
+40:  ptr    tail of list
+48:  ptr    head of list
+56:  int32  element count (starts at 2 for sentinel nodes)
+64:  ptr    arena allocator (for node allocation)

Each list node is 24 bytes, allocated from the arena at +64:

Type 6 node: [next(8), prev(8), string_ptr(8)]
Type 9 node: [next(8), prev(8), opcode_id(4) | int_value(4)]

Type 10 -- dynamic growable array:

+0:   byte   = 0x0A
+8:   ptr    arena allocator
+16:  ptr    array base (int32 elements, grown via sub_6EFD20)
+24:  int32  element count (initialized to 0xFFFFFFFF = -1; first insert sets to 0)

The array grows by calling sub_6EFD20(slot+8, count+2) before each insertion, which reallocates if capacity is exceeded. Elements are 4-byte int32 values stored contiguously starting at the base pointer.

Knob Type System

The definition-table type tag (at descriptor offset +16) determines how ParseKnobValue interprets the value string. There are 10 logical knob types with 1,294 total registrations:

Type TagNameCountParse Rule
1OKT_NONE139Boolean flag -- presence = true, no value needed
2OKT_INT616strtol(value, NULL, 0) -- accepts decimal, hex (0x), octal (0)
3OKT_BDGT88Same as INT but stores with secondary field zeroed (budget type)
4OKT_IRNG8"lo..hi" range -- two integers separated by ..
5OKT_ILIST3Comma-separated integers: "1,2,3,4"
6OKT_FLOAT12sscanf(value, "%f", &result)
7OKT_DBL100sscanf(value, "%lf", &result)
8OKT_STR28Direct string assignment (pointer copy)
9OKT_WHEN2When-condition string; parsed into linked list of condition nodes
10OKT_OPCODE_STR_LIST4Opcode-name,integer pairs: "FADD,3,FMUL,2"
11OKT_STR (variant)Same as type 8 (alternate string slot)
12OKT_ILIST (variant)Int-list with pre-initialized allocator

The INT type (616 knobs, 47.6%) dominates. These control thresholds, limits, and numeric heuristic parameters across the entire compiler. BDGT (budget) knobs (88) are semantically similar to INT but carry a secondary field used for budget-tracking in cost models. The 100 DBL knobs control floating-point heuristic weights (scheduling priorities, cost ratios, etc.).

Definition-Type to Runtime-Type Mapping

The definition-table type tag drives parsing; ParseKnobValue writes a different runtime type tag into the 72-byte slot. The mapping is not 1:1 -- several definition types collapse into the same runtime type, and compound types undergo a pre-initialization phase before the main parse:

Def TypeDefinition NameRuntime TypeRuntime NamePre-init?
1OKT_NONE4boolean (true)No
2OKT_INT1int32No
3OKT_BDGT7int32 + secondaryNo
4OKT_IRNG8int-range (low, high)No
5OKT_ILIST10int-list (dynamic array)No
6OKT_FLOAT2float (single precision)No
7OKT_DBL3double (8-byte)No
8OKT_STR5string (pointer)No
9OKT_WHEN6linked list (when-condition)Yes
10OKT_OPCODE_STR_LIST9linked list (opcode-string)Yes
11OKT_STR (variant)5string (pointer)No
12OKT_ILIST (variant)10int-list (dynamic array)Yes

Types 11 and 12 are aliases: type 11 shares the exact handler with type 8 (both produce runtime type 5), and type 12 shares parsing logic with type 5 but its pre-switch initializes the allocator from the knob state object instead of inline.

ParseKnobValue Dispatch Algorithm

ParseKnobValue (sub_79F540, source lines 435--551 of generic_knobs_impl.h) implements a two-phase dispatch. The first switch pre-initializes compound types; the second switch parses the value string.

Phase 1 -- Pre-initialization (compound types only):

// v15 = definition type tag at (knob_descriptor + 16)
// v14 = runtime slot at (knob_state[9] + 72 * index)
switch (v15) {
case 9:   // OKT_WHEN -> runtime type 6
    KnobValueReset(v14);
    v14[0] = 6;
    // Initialize doubly-linked list with two sentinel nodes:
    //   +8  = 0 (next), +16 -> +24, +24 -> +8 (circular sentinels)
    //   +40 = tail, +48 = head, +56 = count (starts at 2)
    //   +64 = allocator from knob_state[1]
    break;

case 10:  // OKT_OPCODE_STR_LIST -> runtime type 9
    KnobValueReset(v14);
    v14[0] = 9;
    // Same linked-list initialization as case 9
    break;

case 12:  // OKT_ILIST variant -> runtime type 10
    KnobValueReset(v14);
    v14[0] = 10;
    *(ptr*)(v14 + 16) = NULL;           // growable array base
    *(ptr*)(v14 + 8)  = allocator;      // from knob_state[1]
    *(int32*)(v14 + 24) = 0xFFFFFFFF;   // sentinel count (-1)
    break;
}

Phase 2 -- Value parsing (all types):

Type 1 (OKT_NONE, boolean): No value string needed. Stores runtime type 4 (boolean true). Presence alone indicates the knob is set.

Type 2 (OKT_INT, integer): Calls sub_6F71D0(value, NULL) -- a strtol wrapper with base 0, which auto-detects decimal, hex (0x prefix), and octal (0 prefix). Stores runtime type 1, value at slot+8 as int32.

Type 3 (OKT_BDGT, budget): Same integer parsing as type 2. Stores runtime type 7 with the primary value at slot+8 and the secondary (budget counter) at slot+12 zeroed. Cost models decrement the secondary field as optimization budget is consumed.

Type 4 (OKT_IRNG, integer range): Parses "low..high" format with these edge cases:

"100..200"    -> low=100,  high=200        Standard range
"100.."       -> low=100,  high=0x7FFFFFFF  Open upper bound
"..200"       -> low=0x80000000, high=200   Open lower bound
".."          -> low=0x80000000, high=0x7FFFFFFF  Full range
"42"          -> low=42, high=42            Degenerate (single value)
""            -> error "Empty integer range value"

The .. separator is detected by checking *endptr == '.' && endptr[1] == '.'. Default bounds are INT_MIN (0x80000000) and INT_MAX (0x7FFFFFFF). Stores runtime type 8 with low at slot+8, high at slot+12.

Type 5 (OKT_ILIST, integer list): Parses comma-separated integers. Validation requires each element to start with a digit or -. Uses a growable array (runtime type 10) at slot+16, grown via sub_6EFD20(slot+8, count+2) before each insertion. Elements are 4-byte int32 values stored contiguously. Example: "1,2,3,4" produces a 4-element array.

Type 6 (OKT_FLOAT, float): Calls sscanf(value, "%f", &result). Stores runtime type 2, value at slot+8 as a 4-byte IEEE 754 single. Returns error "Invalid floating point value" if sscanf does not return 1.

Type 7 (OKT_DBL, double): Calls sscanf(value, "%lf", &result). Stores runtime type 3, value at slot+8 as an 8-byte IEEE 754 double. Returns error "Invalid double value" if sscanf does not return 1.

Type 8/11 (OKT_STR, string): Both handled identically. Stores runtime type 5 with a direct pointer copy: *(char**)(slot+8) = value. The string is NOT duplicated -- the pointer references the original buffer, so the caller must ensure the string's lifetime exceeds the knob's.

Type 9 (OKT_WHEN, when-condition): Pre-switch already initialized the linked list (runtime type 6). Allocates a 24-byte node via the allocator's vtable (allocator_vtable[3](allocator, 24)). Node layout: [next_ptr(8), prev_ptr(8), string_ptr(8)]. The condition string pointer is stored at node+16. Nodes are inserted at the tail of the doubly-linked list. Error if value is NULL; empty string is permitted.

Type 10 (OKT_OPCODE_STR_LIST, value-pair list): Pre-switch already initialized the linked list (runtime type 9). Parsing loop:

  1. Call vtable+40 to split the next comma-delimited token into opcode name and integer value strings
  2. If opcode name is NULL: error "Empty opcode string" (line 520)
  3. If integer value is NULL: error "Empty integer value" (line 522)
  4. Parse integer via strtol(nptr, 0, 10) (base 10 only, unlike OKT_INT)
  5. Resolve opcode name to internal ID via vtable+56 (SASS opcode table lookup)
  6. Allocate 24-byte node: [next(8), prev(8), opcode_id(4) | int_value(4)]
  7. Insert into linked list; loop until input exhausted

Format: "FADD,3,FMUL,2" produces two nodes: (FADD_id, 3) and (FMUL_id, 2). The opcode resolution uses the same 11,240-byte opcode recognition table as the peephole optimizer.

Type 12 (OKT_ILIST variant, opcode list): Pre-switch already initialized the growable array (runtime type 10). Parsing loop:

  1. Call vtable+64 to extract the next comma-delimited opcode name
  2. Resolve to internal ID via vtable+56
  3. Grow array via sub_6EFD20(slot+8, count+2)
  4. Store opcode ID as int32 in the array

Format: "FADD,FMUL,IADD3" -- opcode names only, no integers. Each is resolved to its internal opcode ID.

Default: Error "Invalid knob type" (line 551).

Parse Error Messages

ParseKnobValue (sub_79F540 / sub_6F7360) produces these diagnostic strings on parse failure:

Error StringSource LineDef TypeCondition
"Empty when-string"4359WHEN knob with NULL value
"Empty integer range value"4454IRNG knob with NULL or empty value
"Empty integer list value"4515ILIST knob with NULL or empty value
"Integer list value is not an integer"4535First char not digit or -
"End of integer range value is not ',' or null character"4575ILIST terminator not , or \0
"Empty integer value"4702INT knob with NULL or empty value
"Empty integer value"4783BDGT knob with NULL or empty value
"Empty floating point value"4916FLOAT knob with NULL or empty value
"Invalid floating point value"4966sscanf returns != 1
"Empty double value"5027DBL knob with NULL or empty value
"Invalid double value"5067sscanf returns != 1
"Empty value pair list"51510OPCODE_STR_LIST with NULL value
"Empty opcode string"52010Opcode name resolves to NULL
"Empty integer value"52210Integer after opcode resolves to NULL
"Empty opcode list"53612Opcode-list variant with NULL value
"Invalid knob type"551Unrecognized type tag in definition table
"Invalid knob identifier"395GetKnobIndex -- name not found

All errors carry source attribution: generic_knobs_impl.h with a line number and function name ("GetKnobIndex", "ParseKnobValue", "ReadKnobsFile"). Error constructors: sub_79CDB0 (simple format string) and sub_79AED0 (format with knob name and value context).

Setting Knobs

Method 1: -knob CLI Flag

ptxas -knob SchedNumBB_Limit=100 -knob DisableCSE=1 input.ptx -o output.cubin

Multiple -knob flags accumulate. Each is parsed by KnobsInit (sub_79D990) during startup. The knob name is looked up via GetKnobIndex, then the value is parsed according to the knob's type.

Method 2: Knobs File

A knobs file is a plain-text file with a required [knobs] section header:

; Comments or metadata can appear before the header.
; ReadKnobsFile ignores everything until [knobs] is found.
[knobs]
SchedNumBB_Limit=100
DisableCSE=1
RegAllocBudget=5000
; WHEN= syntax is also supported inside the file:
WHEN=SH=0xDEADBEEF;SchedNumBB_Limit=200

ReadKnobsFile (sub_79D070, source lines 1060--1090 of generic_knobs_impl.h) processes the file:

1. fopen(path, "r")                               line ~1060
2. fseek(file, 0, SEEK_END)                        line 1075
3. size = ftell(file)                               line 1075
4. fseek(file, 0, SEEK_SET)                         line 1075
5. buffer = allocator->vtable[2](allocator, size+1) (heap alloc)
6. bytes = fread(buffer, 1, size, file)             line 1070
7. buffer[bytes] = '\0'                             (null-terminate)
8. marker = strstr(buffer, "[knobs]")               line 1065
9. if (!marker) error "Knobs header not found"
10. content = marker + 7                            (skip "[knobs]")
11. vtable[4](result, knob_state, content, 0)       (parse callback)
12. fclose(file)                                    line 1085

Key implementation details:

  • Entire file read at once. The file is fseek/ftell-measured, then fread into a single buffer of size + 1 bytes. No line-by-line streaming.
  • strstr-based header detection. The [knobs] marker is located via strstr, so it can appear anywhere in the file -- not necessarily on the first line. Everything before it (comments, version metadata, other INI sections) is silently ignored.
  • Parsing starts at marker+7. Exactly 7 characters ([knobs]) are skipped. The parse callback is ParseKnobsString (sub_79B530), which processes newline-delimited key=value pairs. The ~ separator and WHEN= conditional syntax are supported.
  • Result/Expected monad. Every I/O operation has a corresponding error path. Errors are accumulated via sub_79A3D0 (ErrorChainAppend) and propagated through a tagged result object. Multiple errors from a single file are chained, not short-circuited.

Error strings with source line numbers:

Error StringSource LineCondition
"fseek() error knobsfile %s"1075fseek(SEEK_END) or fseek(SEEK_SET) fails
"fseek() error for knobsfile %s"1080fseek(SEEK_END) fails (alternate path)
"fread() error knobsfile %s"1070fread returns <= 0
"Knobs header not found in %s"1065strstr(buffer, "[knobs]") returns NULL
"fclose() error for knobsfile %s"1085fclose returns non-zero

Method 3: PTX Pragma

Knobs can be set from PTX source via .pragma directives, unless the DisablePragmaKnobs knob is set. The pragma string is copied into a temporary buffer and parsed by ParseKnobsString (sub_79B530), following the same key=value syntax.

Method 4: WHEN= Conditional Overrides

The most powerful mechanism allows setting knobs conditionally, based on shader hash or instruction hash. The override string uses ~ (tilde) as a record separator:

WHEN=SH=0xDEADBEEF;SchedNumBB_Limit=200~WHEN=IH=0x12345;DisableCSE=1

ParseKnobsString (sub_79B530) recognizes these prefixes (case-insensitive):

  • WHEN= -- conditional knob application
  • SH= -- match by shader hash (decimal, hex with 0x, or range with ..)
  • IH= -- match by instruction hash
  • K= -- direct knob setting (no condition)
  • INJECTSTRING -- special directive terminated by ;; (double semicolon)

The full conditional override system is parsed by ParseKnobOverrides (sub_79C210), which iterates a linked list of override entries at knob_state + 68904. Each entry carries the condition (hash match criterion) and the knob assignment to apply when matched.

Hash matching uses FNV-1a (magic 0x811C9DC5, prime 16777619) for the per-function override table lookup at ctx+120 → +1128. See IsPassDisabledFull (sub_7992A0).

Priority Order

When the same knob is set by multiple mechanisms, the last write wins. KnobsInit (sub_79D990) processes sources in this order:

  1. Environment variable overrides (getenv)
  2. Knobs file (if specified via -knobs-file or equivalent)
  3. -knob CLI flags
  4. PTX pragma knobs (applied per-function at compile time)
  5. WHEN= conditional overrides (applied per-function when hash matches)

Later sources override earlier ones for the same knob index.

Two Instantiations: DAG and OCG

The knob system is a C++ template instantiated twice with different knob definition tables:

DAG Knobs (sub_6F0820)

The DAG (Directed Acyclic Graph) scheduler knob table contains 99 entries. These control the Mercury SASS pipeline: instruction expansion, WAR hazard handling, scoreboard configuration, and the decode/expand/opex pipeline stages.

PropertyValue
GetKnobIndexsub_6F0820
ParseKnobValuesub_6F7360
InitializeKnobssub_6F68C0 (9KB, 24 references to generic_knobs_impl.h)
Table size99 entries x 64 bytes = 6,336 bytes

DAG knobs referenced in the binary include knob indices 8 and 17 (pipeline options in sub_6F52F0), 16 (WAR generation options in sub_6FBC20), and 743/747 (expansion options in sub_6FFDC0).

OCG Knobs (sub_79B240)

The OCG (Optimizing Code Generator) knob table contains 1,195 entries -- the vast majority of all knobs. These control the optimization passes, register allocation, instruction scheduling, and code generation.

PropertyValue
GetKnobIndexsub_79B240
ParseKnobValuesub_79F540
KnobsInitsub_79D990 (40,817 bytes, master initializer)
KnobInitsub_7A0C10 (per-knob state constructor)
Table size1,195 entries x 64 bytes = 76,480 bytes
Runtime values1,195 entries x 72 bytes = 86,040 bytes

OCG knob indices referenced across the codebase include: 185 (pass-disable string, offset 13320), 294 (epilogue instruction count, used in tepid scheduling), 487 (LoopMakeSingleEntry enablement), 956-957 (shader hint settings at offsets 68832/68904).

Knob State Object

The master knob state object is constructed by KnobInit (sub_7A0C10):

Offset    Size    Field
────────  ──────  ──────────────────────────────
+0        8       vtable pointer (off_21C0738)
+8        8       arena allocator
+16       8       knob definition table pointer
+24       8       knob count
+32       40      (zero-initialized control fields)
+72       var     knob value array (72 * count bytes)
+80       4       max knob index (initially 0xFFFFFFFF)
+88       16      DUMP_KNOBS_TO_FILE path (growable string)

The vtable at off_21C0738 provides virtual methods for knob access:

  • vtable+72: IsKnobSet(index) -- check if a knob has a value
  • vtable+152: GetKnobIntValue(index) -- retrieve int32 value
  • And others for bool, string, double retrieval

Knob Access Helpers

Throughout the codebase, knobs are accessed by index via small helper functions:

FunctionAddressPurpose
GetKnobIntValuesub_7A1B80Returns *(int32*)(state + 72*idx + 8)
GetKnobBoolValuesub_7A1CC0Checks type == 4, returns presence
GetKnobStringValuesub_7A1E10Returns string pointer (type 5/8)
SetKnobValuesub_7A2860Writes value with optional WHEN=SH= condition
IsKnobSet(inlined)Checks *(byte*)(state + 72*idx) != 0

Access is O(1) by index -- no hash lookup or name comparison at runtime. The GetKnobIndex name-to-index translation happens only during initialization.

Pass Disable Mechanism

The knobs system provides a string-based pass disable mechanism through knob index 185 (OCG offset 13320). The string contains +-delimited pass names:

-knob DisablePhases=LoopMakeSingleEntry+SinkCodeIntoBlock

Two check functions consult this string:

IsPassDisabled (sub_799250)

Simple version. Reads the disable flag byte at ctx+13320:

  • If byte == 0: no pass-disable configured, returns false
  • If byte == 5: string pointer at ctx+13328, performs substring match via sub_6E1520 (strcasestr-like)

Called from 16+ sites across the codebase: sub_78B430 (LoopMakeSingleEntry), sub_78DB70 (SinkCodeIntoBlock), sub_8236B0, sub_8D0640, sub_8F45E0, and others.

IsPassDisabledFull (sub_7992A0)

Full version with per-function overrides. First checks a per-function hash table at ctx+120 → +1128 using FNV-1a on the function identifier. If the function has a specific override entry, reads the disable string from there. Otherwise falls back to the global disable string at ctx+72 → +13320.

// FNV-1a hash for per-function lookup
uint32_t hash = 0x811C9DC5;
for (each byte b in function_id)
    hash = 16777619 * (hash ^ b);
uint32_t bucket = hash & (table_size - 1);

The + character is used as a delimiter between alternative phase names in the disable string, allowing "phaseA+phaseB" to match either name.

NamedPhases Parser (sub_798B60)

Parses a comma-separated list of name=value pairs into parallel arrays (max 256 entries). Used by KnobsInitFromEnv (sub_79C9D0) to process environment variable-based knob overrides.

Input:  "knob1=value1,knob2=value2,knob3=value3"
Output: names[256], values[256], full_strings[256]

Knob Categories

The 1,294 knobs cluster into functional categories. Prefix analysis of decoded knob names reveals these major groups:

PrefixCountDomain
Sched* / PostSched* / Sb*89Instruction scheduling heuristics and thresholds
RegAlloc* / Reg*87Register allocation parameters, spill cost model, target selection
Disable*75Pass/feature disable switches (boolean)
Remat* / SinkRemat*35Rematerialization cost model, enable switches, placement control
Mercury* / Merc*21Mercury encoder configuration
URF*24Uniform Register File optimization
Enable*19Pass/feature enable switches (boolean)
Dump*15Debug dump controls (DUMPIR, DumpSched, etc.)
Peephole*~20Peephole optimization rules
Loop*~15Loop optimization parameters
Sync* / Barrier*~12Synchronization and barrier handling
WAR*~8Write-after-read hazard parameters
GMMA* / MMA*~10Matrix multiply-accumulate configuration
Spill*~8Spill code generation parameters
Budget*~10Cost model budgets (BDGT type knobs)
Copy* / CSE*~8Copy propagation and CSE parameters
(other)~577Miscellaneous per-pass tuning knobs

Notable Individual Knobs

Selected knobs referenced by address in the binary:

IndexName (decoded)TypeReferenced AtPurpose
8(DAG pipeline)INTsub_6F52F0Pipeline option flag
16(WAR generation)INTsub_6FBC20WAR pass behavior
17(DAG pipeline)INTsub_6F52F0Pipeline option flag
185(pass-disable string)STRsub_799250, sub_7992A0DisablePhases string
294(epilogue count)INTsub_7A46E0Tepid scheduling divisor
487(loop single-entry)BOOLsub_78B430LoopMakeSingleEntry enable
743(expansion option)INTsub_6FFDC0Mercury expansion control
747(expansion option)INTsub_6FFDC0Mercury expansion control
956(shader hint)sub_79C210Shader hint knob (offset 68832)
957(shader hint)sub_79C210Shader hint linked list (offset 68904)

Register Allocation Knobs (87 knobs, indices 613--699)

The register allocator is the most heavily parameterized subsystem in ptxas. Its 87 knobs span indices 613 through 699 in the OCG knob table, registered in ctor_005 at addresses 0x4197F0--0x41B2E0. The knobs cluster into seven functional sub-categories. All names decoded from ROT13 strings at 0x21B9730--0x21BA6C0.

A. Spill Cost Model (26 knobs)

The spill guidance engine (sub_96D940, 84 KB) uses these knobs to compute per-candidate spill costs. The model multiplies hardware-specific latency and resource metrics by configurable scale factors, then applies threshold-based activation logic.

IndexNameTypePurpose
658RegAllocSpillBarriersAcrossSuspendNONEEnable spill barriers across suspend points
659RegAllocSpillBitINTMaster spill-bit mode selector
660RegAllocSpillBitHighRegCountHeurINTHigh register count heuristic for spill-bit decisions
661RegAllocSpillBitHighRegScaleDBLScale factor for high-register-count spill cost
662RegAllocSpillBitInfPerRegThresholdINTInterference-per-register threshold for spill-bit activation
663RegAllocSpillBitLowRegCountHeurINTLow register count heuristic for spill-bit decisions
664RegAllocSpillBitLowRegScaleDBLScale factor for low-register-count spill cost
665RegAllocSpillBitMediumRegScaleDBLScale factor for medium-register-count spill cost
666RegAllocSpillBitNonRematSpillThresholdINTThreshold for non-rematerializable spill-bit activation
667RegAllocSpillBitRLivePerRegThresholdINTLive-per-register threshold for R-type spill decisions
668RegAllocSpillBitRLiveThresholdINTGlobal R-live threshold for spill activation
669RegAllocSpillForceXBlockHoistRefillINTForce cross-block hoisting of refill instructions
670RegAllocSpillLatencyScaleDBLScale factor for latency in spill cost model
671RegAllocSpillLatencyScale2DBLSecondary latency scale (nested loops)
672RegAllocSpillMemResScaleDBLScale factor for memory resource pressure in spill cost
673RegAllocSpillMioHeavyThresholdDBLThreshold for MIO-heavy (memory-intensive) spill classification
674RegAllocSpillOptBudgetBDGTBudget for spill optimization passes
675RegAllocSpillResourceScaleDBLScale factor for resource usage in spill cost
676RegAllocSpillResCostsScaleDBLScale factor for resource costs (secondary weighting)
677RegAllocSpillReturnRegisterINTSpill handling mode for return-value registers
678RegAllocSpillSmemFlatModeINTShared memory spill: flat addressing mode selector
679RegAllocSpillSmemLatencyScaleDBLScale factor for shared-memory spill latency
680RegAllocSpillTexDepScaleDBLScale factor for texture dependency in spill cost
681RegAllocSpillValidateDebugINTDebug: validate spill correctness (0=off, >0=level)
682RegAllocSpillXBlockINTCross-block spill mode (hoist/refill strategy)
683RegAllocSpillXBlock2INTSecondary cross-block spill mode

The cost model uses three register-count tiers (low/medium/high), each with independent scale factors (664, 665, 661). The tier boundaries are set by the heuristic knobs (663, 660). Latency scales (670, 671) multiply the estimated stall cycles, while resource scales (672, 675, 676) multiply memory bandwidth consumption. The MIO-heavy threshold (673) triggers a separate cost path when the basic block is already saturated with memory operations.

B. Rematerialization (11 knobs)

Rematerialization recomputes values instead of spilling them. The allocator treats remat as a first-class spill alternative with its own budget and candidate ordering.

IndexNameTypePurpose
619RegAllocCtxSensitiveRematINTEnable context-sensitive rematerialization
622RegAllocEnableOptimizedRematINTEnable optimized rematerialization pass
627RegAllocLiveRematINTEnable live-range-aware rematerialization
632RegAllocMaxRematHeightINTMax expression DAG height for remat candidates
633RegAllocMaxRematInstINTMax instructions in a remat sequence
635RegAllocMultiRegclassRematINTEnable remat across multiple register classes
636RegAllocMultiRegRematINTEnable multi-register rematerialization
637RegAllocMultiRegRematBudgetBDGTBudget for multi-register remat attempts
650RegAllocRematDisableRangeIRNGDisable remat for instruction index range lo..hi
651RegAllocRematEnableINTMaster enable for rematerialization (0=off)
652RegAllocRematReuseBudgetBDGTBudget for remat-reuse optimization attempts
654RegAllocOrderRematCandHeuristicINTHeuristic for ordering remat candidates

Knob 650 (RegAllocRematDisableRange) is unique as the only IRNG-type knob in the set, accepting "lo..hi" to disable rematerialization for a range of instruction indices -- a debugging aid for bisecting remat-related miscompiles.

C. Pre-Assignment / MAC (8 knobs)

MAC (Machine-level Allocation with Constraints) pre-assigns physical registers to high-priority operands before the main Fatpoint allocator runs. Entry: sub_94A020 (331 lines).

IndexNameTypePurpose
613RegAllocAvoidBankConflictMacINTEnable bank-conflict-aware MAC pre-assignment
614RegAllocAvoidBankConflictMacPenaltyINTPenalty weight for bank conflicts during MAC pre-assignment
615RegAllocAvoidBankConflictMacWindowSizeINTInstruction window size for bank conflict analysis
628RegAllocMacForceNONEForce MAC-level pre-allocation path
629RegAllocMacVregAllocOrderINTVreg processing order during MAC allocation
630RegAllocMacVregAllocOrderCompileTimeINTCompile-time variant of MAC vreg allocation order
646RegAllocPrefMacOperandsINTMAC operand preference level (1=read, 2=write, 3=both)
647RegAllocPrefMacOperandsMaxDepthINTMax operand chain depth for MAC preference propagation

D. Coalescing (3 knobs)

Register coalescing eliminates unnecessary register-to-register copies by merging live ranges.

IndexNameTypePurpose
617RegAllocCoalesceBudgetBDGTBudget limit for coalescing iterations
618RegAllocCoalescingNONEEnable register coalescing
634RegAllocMmaCoalescingNONEEnable MMA-specific coalescing

E. Performance-Difference Backoff (5 knobs)

Progressive constraint relaxation: on retry iteration N, if the performance difference exceeds a limit, constraints relax between the begin and end iterations.

IndexNameTypePurpose
641RegAllocPerfDiffBackoffNONEEnable perf-diff based constraint backoff
642RegAllocPerfDiffBackoffBeginINTIteration at which backoff begins
643RegAllocPerfDiffBackoffEndINTIteration at which full relaxation is reached
644RegAllocPerfDiffConflictWeightINTWeight factor for conflicts in perf-diff calculation
645RegAllocPerfDiffLimitINTPerformance difference limit triggering relaxation

F. Register Target Selection (13 knobs)

The target selection phase determines how many physical registers to aim for -- the occupancy/performance tradeoff. More registers per thread means fewer warps can execute concurrently.

IndexNameTypePurpose
687RegTargetListILISTComma-separated list of target register counts to try
688RegTgtLowerLimitMMASlackINTSlack added to MMA lower register limit
689RegTgtLowerLimitTCGENSlackINTSlack added to TCGEN lower register limit
690RegTgtLowerLimitSPARSIFYSlackINTSlack added to SPARSIFY lower register limit
691RegTgtLowerLimitDECOMPRESSSlackINTSlack added to DECOMPRESS lower register limit
692RegTgtSelHigherWarpCntHeurINTHeuristic mode for higher-warp-count target selection
693RegTgtSelHigherWarpCntHeurValueDBLWeight value for higher-warp-count heuristic
694RegTgtSelHighLiveRangeHeurValueDBLWeight for high-live-range target selection heuristic
695RegTgtSelLowerWarpCntHeurINTHeuristic mode for lower-warp-count target selection
696RegTgtSelLowerWarpCntHeurValueDBLWeight value for lower-warp-count heuristic
697RegTgtSelLowLiveRangeHeurValueDBLWeight for low-live-range target selection heuristic
698RegTgtSelWithSMemSpillHeurINTHeuristic mode when shared-memory spilling is active
699RegUsageLevelINTRegister usage reporting level

The four "Slack" knobs (688--691) fine-tune lower register limits for specific architectural features that have minimum register requirements: MMA (matrix multiply), TCGEN (tensor core generation), SPARSIFY (structured sparsity), DECOMPRESS (decompression).

G. General Allocation Control (12 knobs)

IndexNameTypePurpose
616RegAllocCacheSizeINTCache size parameter for interference graph
620RegAllocDebugConflictDetailsINTDebug: print conflict graph details (verbosity level)
621RegAllocDepDistanceThresholdForHighConflictsINTDep-distance threshold above which high-conflict registers are deprioritized
624RegAllocIndexAbiScratchRegsINTIndex into ABI scratch register set
639RegAllocNumNonSpillTrialsINTNon-spill allocation trials before allowing spills
640RegAllocOptLevelINTRegalloc optimization level (controls aggressiveness)
648RegAllocPrintDetailsNONEEnable detailed regalloc diagnostic printing
649RegAllocRefineInfINTRefine interference graph iteration limit
653RegAllocOptimizeABIINTEnable ABI-aware register optimization (setmaxnreg handling)
655RegAllocReportMaxRegsAllowedINTReport maximum registers allowed per thread (diagnostic)
656RegAllocCudaSmemSpillEnableINTEnable CUDA shared memory spill path
685RegAllocUserSmemBytesPerCTAINTUser-specified shared memory bytes per CTA (overrides computed)

H. Miscellaneous (8 knobs)

IndexNameTypePurpose
623RegAllocEstimatedLoopIterationsSTRString hint providing estimated loop iteration counts for spill cost weighting
625RegAllocL1SpillRegThresINTRegister count threshold for L1 spill mode activation
626RegAllocL1SpillScaleDBLScale factor for L1 cache spill cost
631RegAllocMaxGmmaDisallowedRegINTMax registers disallowed during GMMA (warp group MMA) allocation
638RegAllocNoRetargetPrefsNONEDisable retarget-preference optimization
657RegAllocSortRegsINTSorting order for register candidates during allocation
684RegAllocThresholdForDiscardConflictsINTInterference count above which conflicts are discarded (default 50)
686RegAttrReuseVectorBudgetBDGTBudget for register-attribute vector reuse optimization

Scheduling Knobs (89 knobs, indices 229--978)

The instruction scheduler is the second most heavily parameterized subsystem after register allocation. Its 89 knobs span two contiguous blocks (indices 738--811 for the core Sched* set, and 569--574 for the PostSched* set) plus 11 scattered entries for scheduling-adjacent features. All names decoded from ROT13 strings at 0x21B6CB0--0x21BE100, registered in ctor_005 at code addresses 0x411FF0--0x420A00.

The knobs control every aspect of the list scheduler: how latencies are modeled, which functional units are treated as busy, how aggressively cross-block motion is attempted, and how register pressure feedback loops interact with the priority function. Three Blackwell-era SchedResBusy* knobs (QMMA at 964, OMMA at 977, MXQMMA at 978) sit outside the main block because they were appended in a later toolkit version for new MMA unit types.

A. Resource Busy Overrides (28 knobs)

The SchedResBusy* knobs override the hardware-profile resource busy times for individual functional units. Each knob sets the number of cycles the named unit is considered occupied after issuing an instruction to it. When unset, the scheduler uses the value from the latency model's per-SM hardware profile. Setting a SchedResBusy* knob to 0 effectively makes the unit appear always free to the scheduler.

Two knobs accept string values instead of integers: SchedResBusyOp and SchedResBusyMachineOpcode take a string identifying a specific opcode or machine opcode to override, enabling per-instruction busy-time tuning.

IndexNameTypeFunctional Unit
781SchedResBusyADUINTAddress divergence unit
782SchedResBusyALUINTArithmetic logic unit
783SchedResBusyCBUINTConvergence barrier unit
784SchedResBusyDMMAINTDouble-precision MMA unit
785SchedResBusyFMAINTFused multiply-add unit
786SchedResBusyFMAWideINTWide FMA unit (multi-cycle)
787SchedResBusyFP16INTHalf-precision FP unit
788SchedResBusyFP64INTDouble-precision FP unit
789SchedResBusyGMMAINTWarp group MMA (WGMMA) unit
790SchedResBusyHMMA16INTHalf-precision MMA, 16-wide
791SchedResBusyHMMA16816INTHalf-precision MMA, 16x8x16 shape
792SchedResBusyHMMA1688INTHalf-precision MMA, 16x8x8 shape
793SchedResBusyHMMA32INTHalf-precision MMA, 32-wide
794SchedResBusyIMMAINTInteger MMA unit
795SchedResBusyLSUINTLoad/store unit
796SchedResBusyLSUL1INTLoad/store unit (L1 path)
797SchedResBusyOpSTRPer-opcode override (string: opcode name)
798SchedResBusyMachineOpcodeSTRPer-machine-opcode override (string)
799SchedResBusyUDPINTUniform datapath unit
800SchedResBusyXU64INTExtended-precision (64-bit) unit
964SchedResBusyQMMAINTQuarter-precision MMA unit (Blackwell)
977SchedResBusyOMMAINTOctal MMA unit (Blackwell)
978SchedResBusyMXQMMAINTMX-quantized MMA unit (Blackwell)

The five HMMA variants (790--793) correspond to different tensor core shapes: HMMA16 for 16-wide half-precision, HMMA1688 for the 16x8x8 tile used on Volta/Turing, HMMA16816 for the 16x8x16 tile used on Ampere+, and HMMA32 for 32-wide half-precision operations. IMMA (794) handles integer tensor operations (INT8/INT4).

B. Latency Overrides (12 knobs)

These override the default latency values the scheduler uses for dependency edges. The SchedRead* prefix indicates read-after-write latencies; the SchedTex* and SchedLDS* variants target texture and shared-memory operations specifically.

IndexNameTypePurpose
757SchedLDSLatencyINTShared memory (LDS) load latency in cycles
771SchedReadLatencyINTDefault read-after-write latency
772SchedReadSBBaseLatencyINTScoreboard base read latency
773SchedReadSBBaseUseLSULatBOOLUse LSU latency as scoreboard base
774SchedReadSbDmmaLatencyINTScoreboard read latency for DMMA operations
775SchedReadSbLdgstsLatencyINTScoreboard read latency for LDGSTS (async copy) operations
802SchedSyncsLatencyINTSynchronization barrier latency
803SchedSyncsPhasechkLatencyINTPhase-check synchronization latency
804SchedTex2TexIssueRateINTMinimum cycles between back-to-back texture issues
808SchedTexLatencyINTTexture fetch latency in cycles
811SchedXU64LatencyINTExtended 64-bit unit latency
770SchedReadAvailTargetINTTarget availability delay for read operands

C. Register Pressure Feedback (8 knobs)

The scheduler's priority function incorporates register pressure awareness through these knobs. They control how aggressively the scheduler tries to reduce live register count: SchedMaxRTarget sets the target register count, while the SchedMaxRLive* knobs define slack bands around that target. SchedReduceIncLimit* throttles how quickly the scheduler increases its pressure-reduction efforts.

IndexNameTypePurpose
758SchedLocalRefRatioDBLLocal reference ratio weight in priority function
760SchedMaxRLiveCarefulSlackINTSlack before aggressive register pressure reduction
761SchedMaxRLiveOKslackINTSlack band where register pressure is acceptable
762SchedMaxRLiveOKslackColdBlocksINTOK-slack for cold (infrequently executed) blocks
763SchedMaxRTargetINTTarget maximum register count for scheduling
776SchedReduceIncLimitINTLimit on incremental register pressure reduction steps
778SchedReduceIncLimitHighINTUpper bound on incremental reduction
779SchedReduceRegBudgetBDGTBudget for register-pressure-reduction iterations

D. Cross-Block Scheduling (8 knobs)

Cross-block motion allows the scheduler to move instructions across basic block boundaries for better latency hiding. These knobs control the scope and cost limits of cross-block speculation.

IndexNameTypePurpose
742SchedCrossBlockINTMaster cross-block scheduling mode selector
743SchedCrossBlockInstsToSpeculateINTMax instructions to speculate across block boundary
744SchedCrossBlockLimitINTOverall cross-block motion limit
745SchedCrossBlockSpeculateINTSpeculation mode for cross-block motion
746SchedCrossBlockSpeculateBudgetBDGTBudget for cross-block speculation attempts
747SchedCrossBlockTexToSpeculateINTMax texture instructions to speculate across blocks
288EnableXBlockSchedInMultiBlockInMMALoopINTEnable cross-block scheduling within multi-block MMA loops
738SbXBlockINTCross-block scoreboard tracking mode

E. Texture Batching (7 knobs)

Texture operations have high latency, so the scheduler groups them into batches to maximize memory-level parallelism. These knobs control batch formation and target selection.

IndexNameTypePurpose
741SchedCountLoadsPerTexINTMax loads to count per texture operation
756SchedLDGBatchDelayBiasINTDelay bias for global load batching
755SchedLastHybridInBBWithIssueRateINTLast hybrid scheduler position in BB with issue rate
805SchedTexBatchTargetSelectRegisterTargetINTBatch formation: prefer register-target-aware grouping
806SchedTexBatchTargetSelectSchedulerTargetINTBatch formation: prefer scheduler-target grouping
807SchedTexBatchTargetTexReadTogetherINTBatch formation: prefer grouping tex reads together
931UseGroupOpexesForResourceSchedulingINTUse grouped opexes for resource scheduling decisions

F. Dependency Modeling (6 knobs)

These control how the scheduler builds and refines the dependency graph between instructions.

IndexNameTypePurpose
753SchedAddDepFromGlobalMembarToCBINTAdd dependency edge from global membar to CB unit
759SchedMaxMemDepINTMax memory dependencies per instruction
764SchedMemNoAliasNONEAssume no memory aliasing (aggressive scheduling)
777SchedReduceRefPsuedoDepLimitINTLimit on reducing reference pseudo-dependencies
780SchedRefineMemDepBudgetBDGTBudget for memory dependency refinement iterations
801SchedSymmetricAntiDepConflictWindowBOOLEnable symmetric anti-dependency conflict window

G. Post-Scheduler (6 knobs)

The post-scheduler runs after register allocation (phase 103) and adjusts the schedule to account for actual register assignments. It primarily inserts stall cycles and adjusts issue delays.

IndexNameTypePurpose
569PostSchedAdvLatencyHidingBOOLEnable advanced latency hiding in post-scheduler
570PostSchedBudgetBDGTBudget for post-scheduler iterations
571PostSchedEarlyStallINTEarly stall insertion mode
572PostSchedForceReverseOrderINTForce reverse traversal order in post-scheduler
573PostSchedIssueDelayBOOLEnable issue delay computation
574PostSchedIssueDelayForNoWBStallsBOOLCompute issue delays for no-writeback stalls

H. Ordering and Preservation (5 knobs)

These control whether the scheduler preserves the original instruction order (from the optimizer or PTX source) versus reordering freely.

IndexNameTypePurpose
229ForcePreserveSchedOrderSameNvOptINTForce preserve scheduling order from NvOpt pass
594PreserveSchedOrderNONEPreserve source scheduling order (boolean)
595PreserveSchedOrderSameBOOLPreserve scheduling order for same-priority instructions
751SchedForceReverseOrderINTForce reverse scheduling order (bottom-up)
769SchedPrefFurthestDepBOOLPrefer instructions with furthest dependency

I. Scoreboard (4 knobs)

The hardware scoreboard tracks instruction completion. These knobs tune how the scheduler predicts scoreboard occupancy to avoid stalls.

IndexNameTypePurpose
738SbXBlockINTCross-block scoreboard tracking mode
739SbXBlockLLSBINTCross-block long-latency scoreboard tracking
772SchedReadSBBaseLatencyINTScoreboard base read latency
773SchedReadSBBaseUseLSULatBOOLUse LSU latency as scoreboard base

Note: SbXBlock appears in both cross-block (D) and scoreboard (I) categories because it serves both purposes -- it controls whether the scoreboard state propagates across block boundaries, which is a prerequisite for cross-block scheduling correctness.

J. MMA Coupling (3 knobs)

Matrix multiply-accumulate instructions on certain architectures share functional unit resources. These knobs control how the scheduler models coupled execution.

IndexNameTypePurpose
752SchedFP16CoupledMaxellPascalINTFP16 coupled execution mode on Maxwell/Pascal
754SchedHmmaImmaBmmaCoupledAmperePlusINTHMMA/IMMA/BMMA coupled execution on Ampere+
366GroupOpexesForResourceSchedulingThresholdDBLThreshold for grouping opexes in resource scheduling

K. Scheduler Model (4 knobs)

These control how the scheduler models the hardware pipeline and instruction movement costs.

IndexNameTypePurpose
765SchedModelIdentityMoveINTModel identity moves as zero-latency
766SchedModelSharedPhysicalPipeINTModel shared physical pipe contention
767SchedMultiRefDeltaLiveINTDelta-live threshold for multi-reference instructions
768SchedMultiRefDeltaLiveMinRefsINTMinimum reference count for delta-live calculation

L. Budget, Scale, and Control (7 knobs)

General scheduling control knobs covering budgets, loop iteration estimates, the master disable switch, and validation.

IndexNameTypePurpose
740SchedBumpScaleAugmentFactorDBLAugment factor for priority bump scaling
748SchedDisableAllINTMaster disable for all scheduling passes
749SchedDynBatchBudgetBDGTBudget for dynamic batching iterations
750SchedEstimatedLoopIterationsSTREstimated loop iterations (string: per-loop hints)
809ScheduleKILsINTSchedule KIL (kill/discard) instructions
810SchedValidateLivenessINTEnable liveness validation after scheduling
811SchedXU64LatencyINTXU64 unit latency override

Disable Switches (75 knobs)

The disable switches are boolean knobs that turn off specific passes, optimizations, or workarounds. All 75 knobs containing "Disable" were decoded from ROT13 strings at 0x21BDE30--0x21BFA10. Nearly all are OKT_NONE (boolean) type -- setting them with no value or any value disables the corresponding feature. The single exception is RegAllocRematDisableRange, which is OKT_IRNG and accepts a "lo..hi" instruction index range.

The bare Disable knob at 0x21BE860 appears to be a master pass-disable switch. SchedDisableAll is the master scheduler disable. DisablePragmaKnobs prevents PTX .pragma directives from setting knobs -- a meta-level control that protects the knob system itself.

A. Workaround (WAR) Switches (9 knobs)

These disable hardware or compiler bug workarounds. Each War_SW* knob corresponds to an NVIDIA internal bug tracker ID. Disabling a WAR reverts to the unpatched behavior -- useful for bisecting whether a WAR is causing a regression.

NameFeature Disabled
DisableWar_SW200655588Workaround for bug SW-200655588
DisableWar_SW2549067Workaround for bug SW-2549067
DisableWar_SW2789503Workaround for bug SW-2789503
DisableWar_SW2965144Workaround for bug SW-2965144
DisableWar_SW3093632Workaround for bug SW-3093632
DisableForwardProgressWar1842954Forward-progress guarantee workaround (bug 1842954)
DisableForwardProgressWar1842954ForDeferBlockingSame WAR, variant for defer-blocking scheduling
DisableHMMARegAllocWarHMMA (half-precision MMA) register allocation workaround
DisableMultiViewPerfWARMulti-view rendering performance workaround

B. Memory and Addressing (11 knobs)

These control address computation, memory access conversion, and shared-memory optimizations.

NameFeature Disabled
DisableCvtaForGenmemToSmemGeneric-to-shared address space conversion via cvta
DisableDoubleIndexedAddressDouble-indexed addressing mode optimization
DisableErrbarAfterMembarError barrier (BAR.SYNC 15) insertion after membar.sys
DisableForceLDCTOLDCUConvLDC to LDCU (constant uniform load) conversion
DisableImplicitMemDescImplicit memory descriptor inference
DisableLDCU256LDCU.256 -- 256-bit constant uniform load
DisableLDCUWithURbLDCU with uniform register base addressing
DisableLongIntArithAddressFoldingLong integer arithmetic folding into address computation
DisableRemoveSmemLeaShared memory LEA (load effective address) removal
DisableSmemSizePerCTACheckShared memory size per CTA validation check
DisableStrideOnAddrStride-on-address optimization (base+stride*index folding)

C. Register Allocation and Uniform Registers (9 knobs)

These control uniform register (UR) file usage, live range management, and remat-related disable ranges.

NameTypeFeature Disabled
DisableConvergentWriteURNONEConvergent write-to-UR optimization
DisableExtendedLiveRangeNONEExtended live range optimization
DisableU128NONE128-bit uniform register support
DisableURLiveAcrossConvBoundNONEUR liveness across convergence boundaries
DisableURLivenessTradeOffNONEUR liveness trade-off heuristic
DisableUregNONEUniform register file usage entirely
MercuryDisableLegalizationOfTexToURBoundNONEMercury tex-to-UR-bound legalization
RegAllocRematDisableRangeIRNGRematerialization for instruction index range lo..hi
RematDisableTexThrottleRegTgtNONETexture throttle register target during remat

D. Loop Optimization (6 knobs)

NameFeature Disabled
DisableAlignHotLoopsHot loop alignment (NOP padding for fetch efficiency)
DisableDeadLoopEliminationDead loop elimination pass
DisableLoopLevelVaryingAnalysisLoop-level varying/invariant analysis
DisableLoopPrecheckForYieldsLoop pre-check insertion for yield points (cooperative groups)
DisableMeshVCTALoopMesh shader virtual CTA loop optimization
DisablePartialUnrollOverflowCheckOverflow check during partial loop unrolling

E. Code Motion and Scheduling (6 knobs)

NameFeature Disabled
DisableLatTransitivityLatency transitivity in scheduling dependency chains
DisableMoveCommoningMOV-based equivalence propagation (commoning walker)
DisableNestedHoistNested code hoisting (loop-invariant-like motion)
DisableOffDeckOff-deck scheduling (prefetch to off-deck buffer)
DisableSourceOrderSource-order scheduling constraint
SchedDisableAllMaster switch: all scheduling passes

F. Vectorization (4 knobs)

NameFeature Disabled
DisableFastvecEnhancementFast vectorization enhancement pass
DisableHalfPartialVectorWritesHalf-precision partial vector write coalescing
DisableReadVectorizationLoad vectorization (coalescing scalar reads into vector loads)
DisableWriteVectorizationStore vectorization (coalescing scalar writes into vector stores)

G. Predication and Branching (4 knobs)

NameFeature Disabled
CmpToMovPredCrossBlockDisableCMP-to-MOV predicate propagation across basic blocks
DisableBranchPredInputBranch predicate input optimization
DisableCmpToPredCMP-to-predicate conversion
DisablePredicationPredication pass (phase 63, OriDoPredication)

H. Synchronization and Barriers (2 knobs)

NameFeature Disabled
DisableRedundantBarrierRemovalRedundant barrier removal pass
DisableStageAndFenceStage-and-fence synchronization insertion

I. Dead Code and Store Elimination (2 knobs)

NameFeature Disabled
DisableDeadStoreEliminationDead store elimination pass
DisableStraightenInSimpleLiveDeadStraightening within simple live/dead analysis

J. Control Flow Merging (5 knobs)

NameFeature Disabled
DisableEarlyExtractBCOEarly extraction of BCO (branch code optimization objects)
DisableMergeEquivalentConditionalFlowPhase 133: tail merging of equivalent conditional branches
DisableMergeFp16MovPhiFP16 MOV-PHI merge optimization
DisableMergeSamRamBlocksSAM/RAM block merging (surface/texture access coalescing)
DisableOptimizeHotColdFlowHot/cold flow optimization (code layout splitting)

K. Pass Control (2 knobs)

NameFeature Disabled
DisableMaster disable switch (bare name)
DisablePragmaKnobsPTX .pragma-based knob overrides

L. Sanitizer (3 knobs)

These control the address sanitizer instrumentation for different memory spaces. When the sanitizer is active, these knobs can selectively disable checking for one space while keeping the others.

NameFeature Disabled
SanitizeDisableGlobalAddress sanitizer for global memory accesses
SanitizeDisableLocalAddress sanitizer for local memory accesses
SanitizeDisableSharedAddress sanitizer for shared memory accesses

M. Floating Point (2 knobs)

NameFeature Disabled
FPFoldDisableFloating-point constant folding
FPRefactoringDisableFloating-point expression refactoring

N. Miscellaneous (10 knobs)

NameFeature Disabled
DisableBW225LongIntArithBW225 (Blackwell) long integer arithmetic optimization
DisableBptTrapNoReturnBPT.TRAP no-return semantics (debugger breakpoint trap)
DisableDependentConstExprDependent constant expression optimization
DisableISBESharingISBE (indexed set buffer entry) sharing for bindless textures
DisableMarkF2FPackbTo16BitMarking F2F.PACKB as 16-bit operation
DisableNonUniformQuadDerivativesNon-uniform quad derivative computation
DisablePaddingNOP padding insertion (alignment and scheduling)
DisablePicCodeGenPosition-independent code generation
DisableSopSrSOP (scalar operation) on special registers (SR)
DisableSuperUdpSuper-UDP (enhanced uniform datapath) optimization

Rematerialization Knobs (35 knobs)

Rematerialization knobs control the three dedicated remat pipeline phases (Phase 28: SinkRemat, Phase 69: OriDoRemat) and the cost model that decides whether recomputing a value is cheaper than keeping it live in a register. These are separate from the 12 RegAlloc*Remat* knobs documented above in section B, which control allocator-integrated rematerialization. The distinction matters: allocator-integrated remat fires during register allocation itself (sub_93AC90), while these knobs tune the standalone pre-allocation and post-predication remat passes.

The 35 knobs split into two contiguous blocks in the descriptor table plus one outlier:

  • Remat* (27 knobs, indices 702--728): Late rematerialization (Phase 69) and shared cost model
  • SinkRemat* (8 knobs, indices 824--831): Early sink+remat (Phase 28)

A. Remat Enable/Disable (5 knobs)

IndexNameTypePurpose
709RematDisableTexThrottleRegTgtINTDisable texture-throttle register targeting during remat
710RematEarlyEnableINTEnable Phase 54 early remat mode activation
711RematEnableINTMaster enable for Phase 69 late rematerialization
712RematEnablePRegNONEEnable predicate register rematerialization (boolean flag)
726RematStressTestNONEForce all remat candidates to be rematerialized (debug, boolean flag)

Knob 711 (RematEnable) is the master switch. When zeroed via -knob RematEnable=0, Phase 69 skips its core loop entirely. Knob 710 (RematEarlyEnable) independently controls Phase 54's mode flag write (ctx+1552 = 4). Knob 726 (RematStressTest) is a debug-only boolean that forces every candidate to be rematerialized regardless of profitability -- useful for stress-testing correctness.

B. Remat Cost Model (10 knobs)

IndexNameTypePurpose
702RematAbsCostFactorDBLAbsolute cost scaling factor for remat profitability
703RematBackOffRegTargetFactorDBLBack-off factor for register pressure target during remat
705RematColdBlockRatioDBLCost discount ratio for cold (rarely executed) blocks
713RematGlobalCostFactorDBLGlobal cost multiplier for cross-block rematerialization
714RematGlobalLowCostFactorDBLCost factor for low-cost (cheap ALU: MOV, IADD, LOP3) remat
716RematLdcCostDBLCost weight assigned to LDC (load-from-constant-bank) remat
719RematMemCostDBLCost weight for memory-sourced (LD/ST) rematerialization
722RematReadUAsLdcINTTreat uniform address reads as LDC for cost classification
727RematTexInstRatioThresholdDBLTexture instruction ratio threshold for throttle activation
728RematTexThrottleRegTgtScaleDBLScale factor for register target when texture throttle is active

These 10 knobs parameterize the remat profitability function (sub_90B790). The cost model computes remat_cost = instruction_cost * factor and compares against register savings. The DBL-typed knobs (8 of 10) are floating-point multipliers that allow fine-grained tuning. The texture-specific knobs (727, 728) implement a throttle: when the ratio of texture instructions exceeds the threshold, the register target is scaled to avoid excessive register use that would harm texture unit throughput.

C. Register Pressure Control (5 knobs)

IndexNameTypePurpose
706RematConservativeRegSlackINTExtra registers to reserve beyond target (conservative mode)
708RematCostRegLimitINTMax register count considered during cost analysis
718RematMaxRegCountINTAbsolute ceiling on registers for remat decisions
723RematRegTargetFactorDBLScaling factor for computing the register pressure target
724RematRegTargetTrialLimitINTMax iterations when searching for optimal register target

The register target is the pressure level below which rematerialization becomes profitable. RematRegTargetFactor (723) scales the occupancy-derived target. RematRegTargetTrialLimit (724) caps the binary-search iterations in the target-finding loop. RematMaxRegCount (718) is a hard ceiling -- if current pressure exceeds this value, the remat pass operates in aggressive mode.

D. Instruction and Code Limits (2 knobs)

IndexNameTypePurpose
707RematCostInstLimitINTMax instruction count for inclusion in cost model
715RematInflationSlackINTAllowed code-size inflation slack (extra instructions from remat)

RematCostInstLimit (707) prevents the cost model from analyzing extremely large remat sequences. RematInflationSlack (715) limits how many extra instructions rematerialization may introduce before the pass backs off.

E. Placement Control (4 knobs)

IndexNameTypePurpose
717RematLowCostPlacementLimitDBLMax placement distance for low-cost remat candidates
720RematMinDistanceINTMinimum def-to-remat distance (instructions) before remat is attempted
721RematPlacementLookbackINTLookback window size for placement-site search
725RematSortRematChainINTSort remat chain by priority before placement (0=off, 1=on)

These knobs control where rematerialized instructions are placed relative to their uses. RematMinDistance (720) ensures remat is not attempted for short live ranges where the original definition is close enough. RematPlacementLookback (721) limits how far back the placement algorithm scans when searching for a profitable insertion point.

F. Remat Budget (1 knob)

IndexNameTypePurpose
704RematBudgetBDGTOptimization budget for the late remat pass (phase 69)

BDGT-typed knobs carry a primary value and a secondary counter. The budget is decremented as each remat decision is committed. When exhausted (secondary reaches zero), the pass stops processing further candidates. This provides a deterministic cap on compile-time cost.

G. SinkRemat (Phase 28) Knobs (8 knobs, indices 824--831)

IndexNameTypePurpose
824SinkRematAbsCostLimitDBLAbsolute cost ceiling for sinking+remat decisions
825SinkRematBudgetBDGTOptimization budget for the sink+remat pass
826SinkRematDeltaRegsRatioDBLRegister pressure delta ratio threshold for sink profitability
827SinkRematEnableINTMaster enable for Phase 28 SinkRemat
828SinkRematMinDefPlaceDistINTMinimum definition-to-placement distance for sinking
829SinkRematMinPlaceRefDistINTMinimum placement-to-reference distance for sinking
830SinkRematMultiRefXBlkUsesPenaltyFactorDBLPenalty multiplier for multi-reference cross-block uses
831SinkRematPredPenaltyFactorDBLPenalty multiplier for sinking predicated instructions

Phase 28's SinkRemat pass (entry: sub_913A30, core: sub_A0F020) sinks instructions closer to their uses and marks remat candidates. Knob 827 (SinkRematEnable) is the master switch. The distance knobs (828, 829) prevent unprofitable micro-sinks. The penalty factors (830, 831) make the cost model more conservative for predicated instructions and for instructions with multiple cross-block uses, where sinking may duplicate code along multiple paths.

IndexNameTypePurpose
475MovWeightForRematDBLMOV instruction weight in remat profitability scoring

This knob sits in the general MOV-weight family (indices 474--476) rather than the Remat block. It tunes how MOV instructions contribute to the scheduling cost model's remat profitability calculation. When the remat candidate is a MOV chain, this weight determines the per-MOV cost used to decide whether rematerialization beats keeping the value live.

DUMP_KNOBS_TO_FILE

The DUMP_KNOBS_TO_FILE environment variable triggers a full dump of all knob values to a file. Checked during KnobInit (sub_7A0C10) via getenv("DUMP_KNOBS_TO_FILE"):

char* dump_path = getenv("DUMP_KNOBS_TO_FILE");
if (dump_path) {
    size_t len = strlen(dump_path);
    // Store into SSO string at knob_state+88..104
}

The path is stored in a small-string-optimized (SSO) buffer at knob_state offsets +88 through +104:

Offset  Size  Field
──────  ────  ─────────────────────────────────────
+88     8     data pointer (or first 8 inline bytes if len <= 15)
+96     8     string length
+104    8     capacity (or remaining inline bytes)

Paths of 15 bytes or fewer are stored inline without heap allocation. Longer paths allocate via the arena allocator at knob_state+8. The dump is produced later during compilation -- KnobInit only stores the path; the actual file write happens after all knobs are resolved.

This is the primary mechanism for discovering which knobs exist and what their current values are. Setting it produces a text file with all 1,294 knob names and their resolved values.

Error Handling

The knob system uses structured error descriptors (96 bytes each) allocated from an arena:

Offset  Size  Field
──────  ────  ─────────────────────────────────────
+0      8     formatted message string pointer
+8      8     message length
+16     8     source file path pointer
+24     8     source file path length
+32     8     line number
+40     8     function name pointer
+48     48    (additional context fields)

Two error constructor functions:

FunctionAddressPurpose
FormatKnobErrorsub_79CDB0General knob error with vsnprintf formatting
FormatKnobErrorWithContextsub_79AED0Error with additional context (knob name, value)
KnobError::Mergesub_79A780Chains multiple errors for accumulated reporting

Errors propagate through a tagged result: bit 0 of *(result + 16) is set on error, cleared on success. The GetKnobIndex return protocol:

// Success:
*(byte*)(result + 16) &= ~1;    // clear error bit
*(int32*)(result) = knob_index;  // store index

// Failure:
*(byte*)(result + 16) |= 1;     // set error bit
*(result + 0..15) = error_desc;  // store error descriptor

KnobValue Lifecycle

Construction

KnobValue::Destroy (sub_797790) resets a 72-byte value slot before writing a new value. It switches on the type tag:

TypeDestruction Action
0-5, 7, 8No-op (POD types, no heap allocation)
6 (int-list)Walk doubly-linked list, free each node via allocator+32
9 (opcode-list)Walk doubly-linked list, free each node via allocator+32
10 (int-list dynamic)Free the growable array block

Deep Copy

KnobValue::CopyFrom (sub_7978F0) handles deep copy of value slots, switching on type to properly duplicate linked lists and allocated buffers.

KnobInit (sub_7A0C10) constructs a new knob state object by allocating 72 * count bytes for the value array, then deep-copying each slot from a source state if one exists.

Function Map

AddressSizeFunctionConfidence
sub_6F04B06,824ReportKnobError (DAG)HIGH
sub_6F08202,782GetKnobIndex (DAG)CERTAIN
sub_6F0A308,700RegisterKnob (DAG)HIGH
sub_6F0FF013,000GetKnobValue (DAG)HIGH
sub_6F1B1013,000BuildKnobTable (DAG)HIGH
sub_6F238014,000ParseKnobString (DAG)HIGH
sub_6F68C09,000InitializeKnobs (DAG)HIGH
sub_6F736018,306ParseKnobValue (DAG)CERTAIN
sub_6F83C0ParseWhenShorthand (DAG)MEDIUM
sub_797790385KnobValue::DestroyHIGH
sub_7978F0240KnobValue::CopyFromMEDIUM
sub_7973E0400KnobType::GetSizeMEDIUM
sub_798280900ParsePhaseNameFragmentMEDIUM
sub_798B601,776NamedPhases::ParsePhaseListCERTAIN
sub_79925068IsPassDisabledHIGH
sub_7992A0894IsPassDisabledFullHIGH
sub_79A490600KnobError::AppendContextMEDIUM
sub_79A5D0800KnobError::FormatMEDIUM
sub_79A7802,200KnobError::MergeMEDIUM
sub_79AED01,000FormatKnobErrorWithContextHIGH
sub_79B240518GetKnobIndex (OCG)CERTAIN
sub_79B450200GetKnobIndexWithValidationHIGH
sub_79B5303,296ParseKnobsStringHIGH
sub_79C2102,200ParseKnobOverridesHIGH
sub_79C9D01,600KnobsInitFromEnvHIGH
sub_79CDB01,400FormatKnobErrorHIGH
sub_79D0702,312ReadKnobsFileCERTAIN
sub_79D9907,073KnobsInit (master)HIGH
sub_79F5403,640ParseKnobValue (OCG)CERTAIN
sub_7A0A90350KnobValue::CopyListValueMEDIUM
sub_7A0C101,745KnobInit (per-knob)HIGH
sub_7A1B80400GetKnobIntValueMEDIUM
sub_7A1CC0350GetKnobBoolValueMEDIUM
sub_7A1E10400GetKnobStringValueMEDIUM
sub_7A28602,100SetKnobValueMEDIUM
sub_7ACEA03,700OCGKnobSetupMEDIUM

Reimplementation Notes

To reimplement the knobs system:

  1. Define the knob table as a compile-time array of descriptors (name, alias, type). No need for ROT13 -- that is purely obfuscation. Use an enum for knob indices so call sites reference KNOB_SchedNumBB_Limit instead of magic index 294.

  2. Parse order matters. Process sources in the documented priority order (env, file, CLI, pragma, WHEN). Last-write-wins semantics.

  3. The WHEN= system is the complex part. You need FNV-1a hashing of function identifiers and a per-function override table. The hash table at ctx+120 → +1128 uses open addressing with linear probing.

  4. Budget knobs (OKT_BDGT) are just integers with a secondary tracking field. The secondary starts at 0 and is used by cost models to track how much "budget" remains during optimization.

  5. Int-range knobs (OKT_IRNG) use .. as the range separator: "100..200" means [100, 200]. Missing bounds default to INT_MIN (0x80000000) / INT_MAX (0x7FFFFFFF).

  6. The opcode-string-list type (OKT_OPCODE_STR_LIST) carries pairs of (opcode_name, integer). The opcode name is resolved to an internal opcode ID via the SASS opcode table. Used for per-instruction tuning overrides.

Cross-References