Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Latency Model & Functional Units

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas instruction scheduler uses a static hardware performance model to estimate instruction latencies, functional unit occupancy, and pipeline conflicts. The model is architecture-parameterized: a family of 15+ profile-builder functions at 0x8E7300--0x8E9DC0 construct per-SM latency/throughput tables consumed by the scheduling engine. A separate 85 KB function (sub_89FBA0, SetOpcodeLatencies) assigns per-opcode scheduling classes that index into these tables. The combination produces a cost model that drives stall-count computation, priority scoring, and dual-issue pairing decisions.

Per-opcode classifiersub_89FBA0 (85 KB) -- assigns scheduling class per Ori opcode
HW profile buildersub_8E5CA0 (20 KB) -- assembles scheduling control word tables
Warp profilesub_8E4400 (3.3 KB) -- maps SM ID to warp/dispatch parameters
SM-specific tablessub_8E7300--sub_8E97B0 -- 15 architecture-specific builders
Latency querysub_693BC0 (22 lines) -- memory space classification
Long-latency checksub_8CCF80 (2.3 KB) -- returns true if latency > 19
Resource modelsub_A08A00 (345 lines) -- per-instruction FU cost computation
Register querysub_A08910 (39 lines) -- operand register count/cost
Stall updatesub_A09530 (91 lines) -- per-instruction stall cycle accumulation
FU class mappersub_8F0CD0 -- maps (opcode, unit_name) to scheduling class
FU unit querysub_704D30 (14 KB) -- maps SASS opcodes to functional unit IDs
Cutlass detectorsub_8F47E0 -- detects cutlass kernels for tuned scheduling
Pipe class assignersub_13710B0 (7.1 KB) -- SASS-level execution pipe assignment

Architecture of the Latency Model

The model has three layers:

Layer 1: Per-Opcode Classification
  sub_89FBA0 reads each instruction's Ori opcode (field at instr+72,
  masked with 0xFFFFCFFF) and assigns:
    - Scheduling class ID (stored at descriptor+4, range 1..772+)
    - 9-bit latency index (low 9 bits of descriptor+196)
    - Execution pipe mask (bits 15..19 of descriptor+196..200)
    - Throughput class (bits in descriptor+198..199)

Layer 2: Architecture-Specific HW Tables
  sub_8E7300..sub_8E97B0 build per-SM latency/throughput tables as
  96-byte records in a growable array. Each record maps a scheduling
  class to its pipeline latency, scoreboard wait count, barrier stall
  cycles, and dual-issue compatibility flags.

Layer 3: Runtime Query
  The scheduling engine queries the model via:
    - sub_A08A00 for per-instruction resource costs (3 modes)
    - sub_A08910 for register operand latency
    - sub_693BC0 for memory space classification
    - sub_8CCF80 for long-latency detection (threshold: 19 cycles)

Scheduling Class Assignment (sub_89FBA0)

sub_89FBA0 (85 KB, 2938 lines decompiled) is the largest function in the scheduling subsystem. It assigns each instruction a scheduling class -- an integer that indexes into the per-architecture latency tables. The function operates as a massive switch on *(instr+72) & 0xFFFFCFFF (the Ori opcode with modifier bits masked out).

Scheduling Descriptor Layout

Each instruction carries a scheduling descriptor at offsets 196--200 within the 296-byte Ori instruction object (not the SchedNode). The descriptor is a packed bit-field:

Descriptor at a3+196 (DWORD, 32 bits):
  [8:0]   9-bit latency index -- indexes into HW latency table
  [14:9]  reserved
  [19:15] 5-bit execution pipe mask -- identifies functional unit
          0x08000 = pipe A (ALU)
          0x10000 = pipe B (FP/tensor)
          0x18000 = pipe C (memory/texture)
          0xF8000 = all pipes (default sentinel)

Descriptor at a3+198 (WORD, 16 bits):
  [3:0]   pipe sub-class within the execution pipe
          0x10 = sub-class 1 (control flow)
          0x20 = sub-class 2 (integer ALU)
          0x30 = sub-class 3 (FP32)
          0x40 = sub-class 4 (FP64 / wide ops)
  [8:4]   throughput class (5 bits)
          0x1F0 = maximum throughput (sentinel)

Descriptor at a3+199 (BYTE, high bits):
  [5:1]   additional pipe flags
          0x3E = all flags set (default)
          Specific values: 0x04 (ALU), 0x08 (SFU), 0x0A (FP64), 0x0C (tensor)

Descriptor at a3+200 (WORD, 16 bits):
  [4:0]   read barrier mask (5 bits, 0x1F)
  [9:5]   write barrier mask (5 bits, 0x3E0)

Opcode-to-Class Mapping

The switch statement maps Ori opcodes to scheduling class IDs. These IDs are stored at *(v8+4) where v8 is a pointer to the instruction's extended scheduling record. Representative mappings:

Ori opcodeScheduling classExecution pipeDescription
1130sub-class 1 (0x10)Control flow (BRA, JMP)
2--7 (wide)683sub-class 4 (0x40), pipe 0xAWide FP64 operations
2--7 (narrow, type 19)52sub-class 2 (0x20)Integer ALU (narrow)
2--7 (narrow, other)72sub-class 2 (0x20)Integer ALU (standard)
3, 5 (medium)140sub-class 3 (0x30)FP32 operations
4 (medium)131sub-class 2 (0x20)Integer MAD
6 (wide)140sub-class 4 (0x40), pipe 0xAFP64 pair operations
8 (flag bit set)3defaultPredicate operations (true)
8 (flag clear)2defaultPredicate operations (false)
0xA, 0xB, 0x6C, 0x95200sub-class 2 (0x20)Integer compare/logic
0xA (extended)551defaultExtended integer (wide encoding)
0xA (extended, Mercury)694/700defaultMercury-era extended integer
0xE5defaultConversion operations
0x10 (atomic)575defaultAtomic with flag
0x10 (global)variessub-class 4 (0x40)Global memory load/store
0x141745latency 0xF1WGMMA (warpgroup MMA)
0x142 (variant 3)744latency 0xF0WGMMA variant
0x143765--767latency 0xFBBGMMA/QMMA tensor variants
0x144600latency 0xE6Tensor fence
0x145, 0x146759sub-class 4, pipe 0xCTensor core (HMMA/BMMA)
0x147, 0x148 (wide)761latency 0xFADouble-precision tensor (wide)
0x147, 0x148 (narrow)757latency 0xF6Double-precision tensor (narrow)
0x149604latency 0xE7Tensor synchronization
0x13E749latency 0xF4Bulk copy (ACQBULK)
0x13F748latency 0xF3Bulk release (RELBULK)
0x13D (variant)747/750latency 0xF2/0xF5Collective operations

The scheduling class IDs span a wide range (2--772+). Classes below 256 correspond to legacy instruction categories; higher classes (551, 575, 600, 683, 694, 700, 744--767) represent newer instruction types added for Hopper and Blackwell architectures.

Latency Index Encoding

The low 9 bits of the descriptor at a3+196 encode a latency index that maps directly into the per-architecture HW table. The index is formed by combining the descriptor's low byte with a pipe mask:

latency_index = *(WORD*)(a3+196) & 0x1FF

Observed latency index values and their instruction classes:

Index (hex)Index (dec)Instruction class
0xE6230Tensor fence / sync
0xE7231Tensor synchronization
0xF0240WGMMA variant
0xF1241WGMMA primary
0xF2242Collective op (variant A)
0xF3243Bulk release
0xF4244Bulk copy
0xF5245Collective op (variant B)
0xF6246DP tensor (narrow)
0xF8248Tensor core (HMMA/BMMA)
0xFA250DP tensor (wide)
0xFB251BGMMA/QMMA

The highest index values (0xE6--0xFB) correspond to tensor and collective operations -- the most complex instructions with the longest and most architecture-variable latencies.

Functional Unit Categories

The scheduler tracks 10 functional unit resource counters per basic block. Each counter corresponds to a hardware execution pipe on the SM.

10-Element Resource Vector

Resource tracking uses an 84-byte per-BB slot at *(scheduler+672) + 84 * slot_index:

IndexPipe nameTypical SASS instructionsThroughput (IPC)
0Integer ALU (ALU)IADD3, IMAD, ISETP, LOP3, SHF, IABS, POPC1 (full rate)
1FP32 (FMA)FADD, FFMA, FMUL, FSETP, FMNMX, FCHK1 (full rate)
2FP64 (DFMA)DADD, DFMA, DMUL, DSETP, DMNMX1/2 to 1/64 (SM-dependent)
3Tensor core (MMA)HMMA, IMMA, BMMA, BGMMA, WGMMA, QMMAvaries
4Load/store (LSU)LDG, STG, LDL, STL, LDS, STS, LDGSTS1 (full rate)
5Texture (TEX)TEX, TLD, TXQ, TLD4, TEXS1/2 to 1/4
6Control flow (BRA)BRA, JMP, EXIT, RET, CALL, BRK, CONT1
7Shared memory (SMEM)ATOMS, REDS, LDS, STS (atomic/reduce variants)1
8Special function (SFU)MUFU (RCP, RSQ, SIN, COS, EX2, LG2)1/4
9Uniform datapath (UDP)UPLOP3, UISETP, UIMAD, uniform operations1

The resource vector layout within each 84-byte slot:

Offset  Size       Content
 0..39  10 x int32  Current resource usage per FU (pipe 0..9)
40..79  10 x int32  Resource pressure delta (change from scheduling)
80..83  1 x int32   BB-entered flag and auxiliary state bits

Functional Unit Class Mapping (sub_8F0CD0)

A secondary mapper at sub_8F0CD0 translates (opcode, unit-name-string) pairs to numeric scheduling class IDs for the stall/barrier encoding stage:

OpcodeUnit stringClass IDMeaning
40"LSU_T"15Texture load/store unit
40"XU64"35Extended unit (64-bit ops)
39"DMMA"118Double-precision matrix multiply
53"DMMA"118DMMA (alternate opcode)
default--35Fallback to extended unit

The "LSU_T" and "XU64" string tags appear in the Mercury-era post-scheduling pipeline where the SASS encoder needs to distinguish sub-pipes within the load/store and extended-precision units.

Functional Unit Query (sub_704D30)

sub_704D30 (14 KB) maps SASS opcode character codes to functional unit IDs for the Mercury encoder's latency model. The mapping uses single-character opcode identifiers:

Char codeDecimalFU IDUnit
'D' (68)6840FP64 unit
'E' (69)6944Extended unit
'F' (70)7048FP32 unit
'J' (74)7452Integer unit
'K' (75)7556Conversion unit
'L' (76)7660Load/store unit
'N' (78)7832Tensor unit
'S' (83)8336Special function unit

The function dispatches on *(config+372) >> 12 (the SM architecture selector) to handle architecture-specific unit mapping variations (e.g., Kepler vs Volta).

Per-Architecture HW Latency Tables

Table Construction Pipeline

The HW latency tables are built during scheduler initialization by a chain of constructors:

sub_8E4400(profile, sm_id, sched_mode)     // Warp-level parameters
  |
  v
sub_8E5CA0(profile, table_ptr, table_size) // Assemble output array
  |
  +-- sub_8E6760()  // Group boundary markers
  +-- sub_8E6950()  // Barrier entries
  +-- sub_8E6B40()  // Standard scheduling entries
  +-- sub_8E6F20()  // Wait dependency entries
  +-- sub_8E7110()  // Scoreboard entries
  |
  v
sub_8E7300..sub_8E97B0(profile, ...)       // SM-specific table population
  |
  v
sub_8E3AD0(output, count, entries, ...)    // Copy into final profile

Each SM-specific function populates entries in the 96-byte-per-record output array. Records encode latency, throughput, pipe assignment, and barrier compatibility for each scheduling class.

96-Byte Schedule Record Format

Each record in the HW table occupies 96 bytes (6 x 16-byte XMM slots). Records are stored in a growable array at *(context+56) with count at *(context+64) and capacity at *(context+68). The array grows by 1.5x when full. Records are copied using three _mm_loadu_si128 operations (offsets 0, 16, 32) plus manual field-by-field copy for offsets 48--95; the string at +48 is reference-cloned via sub_714160 when the string-backed flag is set.

Offset  Size   Field               Content
------  ----   -----               -------
 0..1   WORD   type_code           Record type (see type table below)
 2..3   WORD   (padding)           Zero
 4..7   DWORD  aux_size            Type-dependent:
                                     root (type 1): table_size
                                     barrier ('M'): 128 (fixed)
                                     wait/scoreboard ('5'/'6'): 36
                                     sched entry (23): 0
 8..15  8B     (reserved)          Zero
16..19  DWORD  cost_product        Scheduling cost (latency x throughput product)
                                     - Standard entry (23): a2 * a3
                                     - Category header ('!'): entry_count from config+528
                                     - Wait/scoreboard: 280 (fixed sentinel)
                                     - SM-specific (','): 4 * class_count
20..21  WORD   base_latency        Base latency in cycles (standard entries only)
22..23  WORD   dual_issue_flags    Dual-issue compatibility mask (standard entries only)
24..31  8B     (reserved)          Zero
32..39  QWORD  data_ptr            Pointer to type-specific data block:
                                     - Root: parent profile object
                                     - Wait/scoreboard: dependency tracking table
                                     - Barrier: barrier data array
                                     - Category headers: 0
40..47  QWORD  data_size           Byte count of data block at data_ptr:
                                     - Root: table_size; barrier: 128
                                     - Wait/scoreboard: 36; headers: 0
48      BYTE   inline_flag         0 = data_ptr/data_size carry raw data
                                   1 = this record uses the inline string buffer
49..63  15B    inline_str_buf      Inline NUL-terminated string (max 15 chars)
64..71  QWORD  parent_ptr          Back-pointer: SM-specific entries point to table
                                   root; category headers point to profile object
72..79  8B     (reserved)          Zero
80..87  QWORD  string_buf_ptr      Pointer to growable string buffer (32-byte header:
                                   data_ptr, size, capacity, allocator) for variable-
                                   length sub-records; self-references +48 when inline
88      BYTE   string_backed_flag  1 = record owns allocated string data at +80
                                   0 = no allocated string (uses inline or none)
89..95  7B     (padding)           Zero

Record Type Codes

Records are polymorphic -- the type code at offset +0 selects the interpretation of fields +16..+31, +32..+47, and the sub-record format stored in the growable buffer at +80.

TypeASCIICreatorRole
1--sub_8E5CA0Root container (wraps entire HW table)
23--sub_8E6B40Standard scheduling entry (latency + throughput + dual-issue)
33'!'sub_8E5740Category header (begins a named section with string list)
44','sub_8E8480 et al.SM-specific table entry (per-architecture class data)
45'-'sub_8E5CA0Barrier section header (links 128-byte barrier table)
49'1'sub_8E5530Dimension entries (contains 12-byte sub-records)
53'5'sub_8E7110Scoreboard entry (dependency tracking, data_size=36)
54'6'sub_8E6F20Wait dependency entry (dependency table, data_size=36)
57'9'sub_8E5740Category footer (closes the section opened by type 33)
59';'sub_8E5310Variant section (contains 20-byte sub-records)
60'<'sub_8E6760Group boundary marker (separates scheduling groups)
69'E'sub_8E6950Barrier entry (a2 = stall count in cost_product field)
77'M'sub_8E6D40Barrier/sync data entry (data_ptr = barrier array, 128B)
87'W'sub_8E4F20Supplementary weight entry (variable-length string data)

Sub-Record Formats in the Growable Buffer (+80)

Records with string_backed_flag=1 carry variable-length sub-records in the growable buffer. The buffer header at *(record+80) is a 32-byte object: {data_ptr, size (DWORD), capacity (DWORD), allocator_ptr}.

Type 59 (';') -- Variant sub-records (20 bytes each):

Created by sub_8E5310 iterating the variant list at config+536:

Sub-record layout (20 bytes):
  +0   DWORD   source_data       Variant source identifier
  +4   WORD    flags             Variant flags
  +6   WORD    zero              Reserved
  +8   DWORD   throughput_value  Throughput for this variant
  +12  DWORD   aux_value         Auxiliary parameter
  +16  DWORD   zero              Reserved

The main record additionally stores: +16 = start_index (from config+544), +20 = record_index, +24 = back_ref to previous category.

Type 49 ('1') -- Dimension sub-records (12 bytes each):

Created by sub_8E5530 traversing the BST at config+592:

Sub-record layout (12 bytes):
  +0   WORD    node_flags        BST node flags (from node+38)
  +2   WORD    zero              Reserved
  +4   DWORD   node_value        BST node value (from node+32)
  +8   DWORD   node_child        BST node child pointer (low 32 bits of node+24)

Type 44 (',') -- SM-specific class descriptor (16 bytes + packed bitmasks):

Created by sub_8E8480 and other SM-specific builders, followed by a call to sub_8E3AD0 which appends packed bitmask DWORDs:

Initial 16-byte descriptor:
  +0   DWORD   class_flags = 2   Fixed flag value
  +4   WORD    zero              Reserved
  +8   QWORD   mask              Latency mask (0xFFFFFFFF00000000)

Followed by bitmask DWORDs (4 bytes each, one per 8 scheduling classes):
  Each DWORD encodes 4 bits per entry (4 entries x 4 properties):
    bit 4*i+0:  entry[i].field_0 != 1
    bit 4*i+1:  entry[i].field_4 != 1
    bit 4*i+2:  entry[i].field_8 != 1
    bit 4*i+3:  entry[i].field_12 != 1
  Source entries are 20 bytes apart in the input array.

Assembly Sequence

sub_8E5CA0 orchestrates the complete table by emitting records in this order:

  1. Barrier header (type '-', conditional on config+336): links the 128-byte barrier data table at config+272.
  2. Root container (type 1): data_ptr = profile_object, data_size = table_size.
  3. Category header + footer (types '!' / '9'): emitted by sub_8E5740, which enumerates named sections from config+520..528.
  4. Variant section (type ';'): emitted by sub_8E5310 if config+544 != 0.
  5. Supplementary weights (type 'W'): emitted by sub_8E4F20 if config+640 != -1.
  6. Dimension entries (type '1'): emitted by sub_8E5530 if config+608 > 0.

After all records are appended, the function computes the total serialized size (with 16-byte alignment padding per data block), allocates the output buffer, and writes a 32-byte header per record into the linear output at context+104.

Architecture Dispatch Table

AddressSMArchitectureTable sizeNotes
sub_8E7300sm_70Volta3.3 KBFirst Turing-era table format
sub_8E7540sm_72Xavier2.9 KBAutomotive Volta variant
sub_8E7720sm_75Turing3.5 KBAdded TensorFloat-32
sub_8E7940sm_80 (base)Ampere base2.9 KBShared base for sm_80/86/87
sub_8E7B40sm_80Ampere3.3 KBFull Ampere with async copy
sub_8E7D80sm_86GA10x4.4 KBConsumer Ampere
sub_8E8070sm_87Orin3.5 KBAutomotive Ampere
sub_8E8280sm_89Ada Lovelace3.1 KBAdded FP8 tensor ops
sub_8E8480sm_90Hopper5.2 KBDPX, WGMMA, TMA
sub_8E8780sm_90aHopper accel.4.6 KBWGMMA async extensions
sub_8E8A90sm_100Blackwell DC3.0 KB5th-gen tensor, TCGEN05
sub_8E8CB0sm_100 (short)Blackwell DC949 BSupplementary table
sub_8E8DB0sm_103Blackwell Ultra1.7 KBGB300 extensions
sub_8E8F60sm_103 (short)Blackwell Ultra618 BSupplementary table
sub_8E9000sm_120RTX 50xx2.9 KBConsumer Blackwell
sub_8E92E0sm_120 (ext)RTX 50xx5.5 KBExtended consumer table
sub_8E97B0universalFallback8.8 KBDefault for unknown SM

sm_90 (Hopper) has the second-largest combined table (5.2 + 4.6 KB including sm_90a) reflecting the complexity of WGMMA, DPX, and TMA scheduling. sm_120 extended (5.5 KB) is the single largest individual table, accommodating the consumer Blackwell feature set.

The "short" supplementary tables (sub_8E8CB0 for sm_100, sub_8E8F60 for sm_103) add entries for architecture-specific instructions not covered by the base table -- typically new tensor core variants and collective operations.

Warp-Level Hardware Profile (sub_8E4400)

sub_8E4400 maps the SM architecture ID (a2) to warp-level dispatch parameters stored in a 36-byte structure:

Architecture-to-Warp Mapping

SM ID rangeWarps per SMDispatch slotsArchitecture era
<= 20479496sm_50 (Maxwell)
20480--245756176sm_60 (Pascal)
24576--286727192sm_70 (Volta)
28673--327677208sm_75 (Turing)
32768--368638224sm_80 (Ampere)
> 3686316240sm_90+ (Hopper, Blackwell)

The packed DWORD at offset +18 encodes (warps, sub-warp-count) as a 32-bit value. For example, 983055 (0x000F000F) = 15 warps in the low half and 15 in the high half, while 1048592 (0x00100010) = 16 warps for sm_90+.

Sub-Architecture Variants

Specific SM version IDs map to sub-architecture variant codes stored at offset +26:

SM IDHexVariantArchitecture
81930x20012sm_50 (Maxwell Titan X)
204810x50012sm_60 variant
245760x60000sm_70 (Volta base)
286740x70022sm_75 variant A
286750x70033sm_75 variant B
286760x70044sm_75 variant C
286770x70055sm_75 variant D
327680x80000sm_80 (Ampere base)
368640x90000sm_90 (Hopper base)
368670x90033sm_90 variant A
368680x90044sm_90 variant B (sm_90a)
368690x90055sm_90 variant C

Pipeline Width (offset +24)

The scheduling mode parameter (a3) selects the pipeline width stored at offset +24. This value controls how many instructions the scheduler models as issuing per cycle:

ModeValue at +24Meaning
1, 8, 91Single-issue
34Quad-issue (tensor)
45Penta-issue
56Hexa-issue
67Hepta-issue
78Octa-issue
109Nona-issue
1110Deca-issue
default2Dual-issue

These values model the effective issue width for different scheduling contexts. The tensor core modes (4--11) reflect warpgroup-level cooperative execution where multiple warp slots issue tensor instructions simultaneously.

Memory Space Classification (sub_693BC0)

sub_693BC0 (22 lines) classifies the memory space of load/store instructions. It extracts the last source operand from the instruction, looks up the register descriptor, and calls sub_91C840 to determine the memory space type. The function returns an integer code:

Return valueMemory spaceTypical latency range
1Generic (resolved at runtime)20--200+ cycles
2Local memory (per-thread stack)20--200 cycles
3Shared memory20--30 cycles
4Constant memory (cached)4--8 cycles
7Constant bank (indexed)4--8 cycles
11Surface memory200--500 cycles
16Global memory (DRAM)200--500 cycles

The scheduler uses these values in the priority function (sub_8C9320) to distinguish "hot" (long-latency) memory operations from "cold" (short-latency) ones. Functions sub_A9CDE0 classifies hot (global/texture) memory and sub_A9CF90 classifies cold (constant/shared) memory.

Long-Latency Detection (sub_8CCF80)

sub_8CCF80 checks if an instruction qualifies as "long-latency" for scheduling priority purposes. The function:

  1. Verifies the target architecture supports dual-issue via sub_7DC0E0.
  2. For opcode 183 (LD/ST variant): checks memory space via sub_693BC0. Memory spaces 4, 16, 2, 11, 3, 1, and 7 all qualify for long-latency classification.
  3. For opcode 130 (HSET2 in the ROT13 name table; used as a generic internal marker): queries via vtable+640 whether the instruction is recognized as long-latency.
  4. Queries the scheduling oracle (sub_8BF3A0) for the instruction's estimated latency.
  5. Returns true if the estimated latency exceeds 19 cycles.

The threshold of 19 cycles is the boundary between "short-latency" instructions (ALU, FP32, shared memory) and "long-latency" instructions (global memory, texture, tensor core) that benefit from latency hiding through instruction reordering.

Resource Cost Model (sub_A08A00)

sub_A08A00 (345 lines) computes per-instruction resource costs for the 10-element functional unit vector. It operates in three modes selected by parameter a6:

Mode 0/1: Instruction Cost Initialization

Resets the instruction's resource tracking state:

  • a1[0] = 0 (accumulated cost)
  • a1[1045] = 0 (accumulated delta)
  • a1[2071] = 0 (accumulated pressure)
  • Byte at offset 8280 = 0 (flags)

Then computes per-operand resource contributions by iterating source operands (count at a3+80, starting at a3+84):

Mode 2: Differential Cost

Computes the differential cost (new minus old):

v55 = a1[0]       // previous instruction cost
v56 = a1[1045]    // previous delta cost

Then runs the same operand iteration as mode 1 and subtracts the previous values.

Mode 3: Pressure Accumulation

Adds the instruction's previously computed pressure a1[2071] into the running total at *(a5+24).

Per-Operand Cost Computation

For each source operand, the function:

  1. Checks operand type: ((operand >> 28) & 7) == 1 means register operand.
  2. Skips operands with values 41--44 (special sentinel registers).
  3. Looks up the register descriptor via *(a1+88) + 8 * (operand & 0xFFFFFF).
  4. Checks if register class *(descriptor+64) is <= 6 (physical register file).
  5. Calls sub_A08910 to get the register's latency and count:
    • Returns the starting register index
    • Outputs count (*a4) and cost-per-register (*a5)
  6. Iterates over the register range, accumulating costs for registers not in the "already-consumed" bitmask at *(a1+832).

The cost accumulation uses a 9-bit field in the instruction's scheduling word at offset +12, masked as & 0x1FF.

Register Latency Query (sub_A08910)

sub_A08910 (39 lines) returns the register index and cost for a single operand:

function GetRegisterLatency(context, reg_desc, operand, out_count, out_cost):
    pipeline_bits = (reg_desc.field_48 >> 20) & 3
    count = 1
    cost = (pipeline_bits == 3) ? 2 : 1
    *out_count = count
    *out_cost = cost

    if context.flags & 0x10:    // dual-register tracking mode
        return 2 * reg_desc.field_12     // doubled register index
    else:
        if context.flags & 0x08 and pipeline_bits != 1 and reg_desc.class == 6:
            *out_cost = 2 * cost          // double cost for wide registers
        return reg_desc.field_12          // register index

The pipeline bits extracted from (reg_desc+48) >> 20 encode the register's pipeline affinity:

  • Bits == 1: standard pipeline register
  • Bits == 3: double-width register (costs 2 instead of 1)
  • Other values: architecture-specific pipeline assignment

When dual-register tracking is active (context flag 0x10, controlled by knob 420), register indices are doubled to provide separate tracking for even/odd register halves.

Latency Hiding Statistics

The post-scheduling analysis pass (sub_73B360, MacLoopSchedulingAnalytics, 28.7 KB) computes and reports latency hiding effectiveness for four categories of long-latency operations:

CategoryString identifierStat functionTypical latency
Shared memory loads"LDS latency hiding"sub_73A1D020--30 cycles
Global memory loads"LDG latency hiding"sub_73A7F0200--500 cycles
Extended 64-bit ops"Xu64 latency hiding"sub_73ADF015--30 cycles
Anti-dependencies"Antidep latency hiding"(inline)varies

Each category reports: Num (count of operations), Min (minimum hidden cycles), Max (maximum hidden cycles), Avg (average hidden cycles). The pass also tracks MAC instruction utilization ("MacInsts", "MacReuses", "TepidMacUtil") and resource busy time ("LsuResBusy", "Time", "TepidTime").

This analysis runs after scheduling is complete and drives feedback for the Mac Loop scheduler, which handles fused multiply-accumulate loop bodies. Knob 443 gates the MAC instruction classification.

Dual-Issue Rules

Dual-issue scheduling is controlled by sub_8CF5D0 (CheckDualIssueEligibility, 3.5 KB) and implemented by sub_8B77C0 (DualIssueScheduler, 15 KB) with pairing logic in sub_8BDC40 (7.9 KB).

Eligibility Check

sub_8CF5D0 returns 0 (no dual-issue) if:

  • The target architecture does not support dual-issue (sub_7DC0E0 returns false).
  • Function flag bit 2 at func+1368 is set (incompatible function).

When eligible, the function iterates basic blocks checking instruction pairs:

  • sub_A9CDE0(instr): returns true if instruction is dual-issuable (hot = global/texture).
  • sub_A9CF90(instr): returns true if instruction can pair with the next (cold = constant/shared).

The dual-issue benefit score is stored at scheduler+328 and used by the priority function to bias toward instruction pairs that can co-issue.

Dual-Issue Constraints

Dual-issue pairs must satisfy:

  1. Pipe compatibility: the two instructions must target different functional units (e.g., ALU + FP32, or ALU + load/store). Same-pipe pairs cannot dual-issue.
  2. Register conflict: the pair must not have RAW dependencies on the same register within the same cycle.
  3. Barrier compatibility: neither instruction may be waiting on a scoreboard barrier.
  4. Architecture support: dual-issue is primarily an sm_50 (Maxwell) feature. Newer architectures (sm_70+) use wider warp schedulers instead.

For sm_50, a special register budget function adjusts the register allocation target to account for the reduced register pressure from dual-issue execution.

Stall Count Computation

The stall count determines how many cycles an instruction must wait before it can issue. Stalls are computed by sub_8D3E20 (2.1 KB) and encoded by sub_8F3130 (1.0 KB).

Stall Encoding in Control Words

Each SASS instruction carries a stall count in its control word:

  • Maximum stall: 16 cycles (capped by knobs 805 and 806).
  • Minimum stall: 1 cycle (no zero-stall encoding exists).
  • Default stall when no dependency: determined by the HW profile's pipeline depth.

The stall/barrier encoding pipeline (sub_8D7760, 41 KB) computes stalls by walking the dependency DAG backward from each instruction:

function ComputeStallCycles(sched, instr):
    max_wait = 0
    for each predecessor of instr:
        distance = instr.cycle - pred.cycle
        latency = LookupLatency(sched, pred, instr)
        wait = latency - distance
        max_wait = max(max_wait, wait)
    return min(max_wait, MaxStallFromKnob(sched))

The encoding function sub_8F4140 packs the complete control word:

FieldEncoderBitsRange
Stall countsub_8F313041--16 cycles
Yield hintsub_8F365010/1
Read barriersub_8F31F060--5 (barrier ID)
Write barriersub_8F31F060--5 (barrier ID)
Scoreboard waitsub_8F38606barrier wait mask
Reuse flags(separate)4register reuse hints

Sentinel Values

The scheduling system uses several sentinel values:

ValueMeaning
-1 (0xFFFFFFFF)Unscheduled instruction position
0x1869F (99999)Infinite latency sentinel
0xFFFFFFFFBatch window sentinel (DynBatch)

Resource Cost Accumulation

sub_8C67A0 (ComputeResourceCost, 3.7 KB) drives the per-instruction resource accounting. It calls the resource model sub_A08A00 three times per instruction:

function ComputeResourceCost(sched, instr):
    slot = GetResourceSlot(sched, instr)
    slot.bb_entered |= 1

    // Phase 1: Instruction's own execution cost
    sub_A08A00(sched, instr, instr_data, output, slot, mode=1)
    // Accumulate: slot[0..9] += output[0..9]  (SSE _mm_add_epi32)

    // Phase 2: Operand release costs (for last-use operands)
    sub_A08A00(sched, instr, instr_data, output, slot, mode=2)
    // Accumulate delta: slot[10..19] += output[0..9]

    // Phase 3: Combined instruction + BB-level impact
    sub_A08A00(sched, instr, instr_data, output, slot, mode=3)
    // Accumulate pressure into slot[20]

The SSE-optimized accumulation uses _mm_add_epi32 to add 4 resource counters at a time, processing the full 10-element vector in 3 SSE iterations (4 + 4 + 2).

Cutlass-Specific Scheduling

sub_8F47E0 detects NVIDIA cutlass GEMM kernels by calling strstr(function_name, "cutlass"). When detected, the scheduler activates hand-tuned scheduling parameters for matrix multiplication inner loops. This includes:

  • Modified stall counts for the HMMA/WGMMA instruction sequences.
  • Adjusted register pressure targets.
  • Specific barrier placement patterns for double-buffered shared memory.

This reflects NVIDIA's investment in hand-tuning their cutlass library's scheduling behavior within ptxas itself.

Execution Pipe Assignment (sub_13710B0)

sub_13710B0 (7.1 KB, 1,088 lines decompiled) is the SASS-backend execution pipe class assigner. It runs in the SASS encoding pipeline (address range 0x1370--0x139F) after instruction selection, register allocation, and the main scheduling pass are complete. Where sub_89FBA0 assigns IR-level scheduling class IDs (2--772+) consumed by the priority and stall-computation passes, sub_13710B0 writes SASS-level pipe class IDs (0x00--0x141) that control control-word encoding: stall counts, barrier assignments, and dual-issue pairing in the final binary.

Descriptor Initialization

Before dispatching on the opcode, the function initializes the scheduling descriptor at a3+196..202 to the "all-pipes" default:

*(DWORD*)(a3+196) |= 0xF8000     // pipe mask = all (bits 15..19)
*(BYTE*)(a3+200)  |= 0x1F        // read barrier mask = all
*(WORD*)(a3+198)   = HIWORD | 0x1F0  // throughput class = max
*(WORD*)(a3+200)  |= 0x3E0       // write barrier mask = all
*(BYTE*)(a3+199)   = ... | 0x3E  // pipe flags = all set

Then it switches on *(a2+72) & 0xFFFFCFFF (the Ori opcode with modifier bits masked), writing a 9-bit pipe class into the low bits of *(WORD*)(a3+196) and optionally overriding the pipe mask, sub-class, and pipe flags.

Pipe Mask Encoding

Bits 15--19 of *(DWORD*)(a3+196) select the execution pipe:

ValuePipeFunctional unitsResource vector indices
0x08000Pipe AALU, integer, FP64, conversion0 (ALU), 2 (DFMA)
0x10000Pipe BFP32, tensor, SFU, half-precision1 (FMA), 3 (MMA), 8 (SFU)
0x18000Pipe CMemory, texture, wide FP644 (LSU), 5 (TEX)
0xF8000AllDefault sentinel (no constraint)--

Sub-Class Encoding

Bits 4--7 of *(WORD*)(a3+198) encode the sub-class within the pipe:

ValueSub-classInstruction category
0x10Control flowBranch, predicate, miscellaneous
0x20Integer ALUConversion, barrier, integer ops
0x30FP32 / SFUSingle-precision, half-precision
0x40FP64 / TensorDouble-precision wide, tensor core

Pipe Flags Encoding

Bits 1--5 of *(BYTE*)(a3+199) encode sub-unit affinity:

ValueMeaning
0x02Narrow ALU sub-unit
0x04ALU (integer / conversion)
0x06Load/store or wide ALU
0x08SFU / half-precision pipe
0x0AFP64 wide (double-precision)
0x0CTensor core pipe
0x3EAll flags set (default)

Opcode-to-Pipe-Class Mapping

The complete switch covers 80+ Ori opcodes. Representative mappings:

Ori opcodePipe classPipeSub-classSASS instructionDecision logic
10x08--0x10IMADAlways
2--7 (wide)0x03B (0x10000)0x30IMAD_WIDE, IADD3, etc.sub_7D6780 = true
2--7 (wide, v6=6)0x03C (0x18000)0x40LOP3 (wide, FP64)Opcode 6, wide
2--7 (narrow)0x0CA (0x08000)--IMAD, IADD3, etc.Narrow, type != 19
2--7 (narrow, t=19)0x7B----IMAD (BF16/FP8 type)Type 19 path
8 (flag clear)0x33----IABS (no guard)Operand flag bit 0
8 (flag set)0x34----IABS (guarded)Operand flag bit 0
0x10 (flagged)0x68----ATOM (flagged)Operand bit 2
0x10 (mem=3)0x67----ATOM (shared)sub_7DFFC0 = 3
0x10 (mem=4)0x69----ATOM (constant)sub_7DFFC0 = 4
0x10 (other)0x66----ATOM (global)Default
0x12 (no 0x400)0x3D----FADD (standard)Operand bit 10 clear
0x12 (0x400 set)0x78----FADD (const-bank)Operand bit 10 set
0x17 (op1 reg6)0x37----S2R (tensor reg, op1)*(desc+64) = 6
0x17 (op2 reg6)0x36----S2R (tensor reg, op2)*(desc+64) = 6
0x17 (other)0x38----S2R (standard)Neither operand reg6
0x180x04A (0x08000)0x20FSETPAlways
0x24 (wide)0x14B (0x10000)0x30PRMT (FP width)sub_7D6780 = true
0x24 (narrow)0x11B (0x10000)0x30PRMT (integer)sub_7D6780 = false
0x330x21A (0x08000)0x20IDPAlways; flags 0x06
0x3C (mem ops)0x2B--0x32----STG variants6-way split on flags
0x3E (mem ops)0x2D--0x2E----LDL variantsFlag / no-flag split
0x420x5D----MUFU (SFU)Always
0x4D0x84--0x85B (0x10000)0x40WGMMA-classExtended tensor fields
0x4E (mem ops)0x2F--0x30----LD (generic)Flag / no-flag split
0x660x09B (0x10000)0x30DEPBARAlways; flags 0x08
0x82 / 130 (ext)0x17----NANOTRAP (extended); HSET2 in ROT13sub_A9AB10 = true
0x82 / 130 (ctrl)0x13all (0xF8000)0x10NANOTRAP (control); HSET2 in ROT13vtable+640
0xC9--0xCA (wide)0x07A (0x08000)--DFMA, DADD (wide)sub_7D6780 = true
0xD10x05A (0x08000)0x20DFMAAlways
0xD20x0AA (0x08000)0x30DFMA variantSub-class 0x30, flag 0x04
0xF00x0FA (0x08000)--F2FFlags 0x04
0x10E0x7EB (0x10000)--HMMA_16Flags 0x08
0x1170x80B (0x10000)0x40HMMA_32Tensor pipe; flags 0x0C
0x11A0x81B (0x10000)0x40IMMATensor pipe
default0x88----(unrecognized)Sentinel

Decision Axes

The function dispatches on three axes beyond the opcode:

  1. Data type width: sub_7D6780(*(a2+76)) returns true for wide types (FP64). Wide types route to pipe A or C with sub-class 0x30 or 0x40; narrow types route to pipe A with sub-class 0x20.

  2. Memory access classification: sub_7DFFC0(a2, code_obj) returns a memory space code (3 = shared, 4 = constant). Used for ATOM (case 0x10) to split into 4 pipe classes by memory space.

  3. Operand register class: *(descriptor+64) from the register descriptor. Class 6 (tensor/accumulator register file) triggers distinct pipe classes for S2R (case 0x17) and DFMA/DADD variants.

Additionally, two architectural gates control tensor instruction classes:

  • *(a1+25) flag and sub_1370F40 gate tensor-extended pipe classes. When disabled, tensor instructions fall through to class 0x141 (a sentinel).
  • vtable+3184 on the code object checks a feature gate for CALL instruction classification.

Memory Instruction Pipe Variants

Load/store instructions (cases 0x3C, 0x3E, 0x4E) receive a 6-way pipe class split based on two properties:

PropertyTest method
Same-source vs different-sourcesub_91E7A0(a2, 0) vs sub_91E7A0(a2, 1)
Has flag operandsub_91E860(code_obj, a2, i) returns 8
VariantSTG (0x3C)LDL (0x3E)LD (0x4E)
Same-src, no flag0x31(n/a)(n/a)
Same-src, flagged0x32(n/a)(n/a)
Diff-src, no flag0x2B0x2D0x2F
Diff-src, flagged0x2C0x2E0x30

This fine-grained split allows the SASS encoder to select different stall counts and barrier patterns depending on whether the load/store has a predicate guard and whether the source address register is shared with another operand.

Type-19 Special Path

When sub_7D6780 returns false (not wide) and *(a2+76) == 19, several instruction groups receive distinct pipe classes in the 0x7A--0x7D range:

Ori opcode groupStandard classType-19 classLikely type
2--7 (narrow)0x0C0x7BBF16 / FP8
0x6E--0x72 (narrow)0x0B0x7ABF16 / FP8
0x8B--0x8C (narrow)0x0D0x7CBF16 / FP8
0xC9--0xCA0x10/0x120x7DBF16 / FP8

Type 19 likely corresponds to BF16 or FP8, which require different pipeline routing than standard FP16/FP32/FP64 types on Hopper and Blackwell architectures.

Function Map

AddressSizeIdentityConfidence
sub_693BC022 linesMemorySpaceClassify -- return memory space codeHIGH
sub_695530606 linesComputeLatencies -- per-BB latency computationHIGH
sub_704D3014 KBGetFunctionalUnit -- SASS opcode to FU mappingHIGH
sub_73A1D0~6 KBLDSLatencyStats -- shared memory latency statsHIGH
sub_73A7F0~6 KBLDGLatencyStats -- global memory latency statsHIGH
sub_73ADF06.5 KBXU64LatencyStats -- extended unit latency statsHIGH
sub_73B36028.7 KBMacLoopSchedulingAnalytics -- latency hiding reportHIGH
sub_7998602.9 KBClassifyInstructionLatencyHIGH
sub_89FBA085 KBSetOpcodeLatencies -- per-opcode scheduling classHIGH
sub_8B540014 KBScheduleForLatency -- latency-optimized schedulingMEDIUM
sub_8B77C015 KBDualIssueScheduler -- dual-issue scheduling engineMEDIUM
sub_8BDC407.9 KBDualIssuePairing -- instruction pair selectionMEDIUM
sub_8C67A03.7 KBComputeResourceCost -- per-instruction FU costHIGH
sub_8C72905.1 KBGetResourceVector -- SSE-optimized copyHIGH
sub_8CCF802.3 KBIsLongLatencyOp -- latency > 19 checkHIGH
sub_8CF5D03.5 KBCheckDualIssueEligibilityHIGH
sub_8D3E202.1 KBComputeStallCycles -- required stall countHIGH
sub_8D776041 KBStallAndBarrierInsertion -- encode stalls/barriersHIGH
sub_8E3AD0--CopyProfileEntries -- finalize HW tableMEDIUM
sub_8E44003.3 KBInitHWProfile_Warp -- warp dispatch paramsHIGH
sub_8E49206.9 KBBuildScoreboardEntries -- scoreboard BSTHIGH
sub_8E4D8015 linesStringRefCleanup -- decref string in record copyHIGH
sub_8E4F20~1.5 KBEmitWeightEntry -- supplementary weight record (type 'W')HIGH
sub_8E5310~1.5 KBEmitVariantSection -- variant sub-records (type ';')HIGH
sub_8E5530~1.5 KBEmitDimensionEntries -- dimension sub-records (type '1')HIGH
sub_8E5CA020 KBEmitScheduleOutput -- scheduling control wordsHIGH
sub_8E67602.9 KBEmitGroupBoundary -- group boundary markerHIGH
sub_8E6B402.9 KBEmitSchedEntry -- standard scheduling entryHIGH
sub_8E6D402.9 KBEmitBarrierEntry -- barrier/sync entryHIGH
sub_8E6F202.9 KBEmitWaitEntry -- wait dependency entryHIGH
sub_8E71102.9 KBEmitScoreboardEntry -- scoreboard entryHIGH
sub_8E73003.3 KBHWTable_sm70 -- Volta latency tableCERTAIN
sub_8E75402.9 KBHWTable_sm72 -- Xavier latency tableCERTAIN
sub_8E77203.5 KBHWTable_sm75 -- Turing latency tableCERTAIN
sub_8E79402.9 KBHWTable_sm80_base -- Ampere base tableCERTAIN
sub_8E7B403.3 KBHWTable_sm80 -- Ampere full tableCERTAIN
sub_8E7D804.4 KBHWTable_sm86 -- GA10x tableCERTAIN
sub_8E80703.5 KBHWTable_sm87 -- Orin tableCERTAIN
sub_8E82803.1 KBHWTable_sm89 -- Ada Lovelace tableCERTAIN
sub_8E84805.2 KBHWTable_sm90 -- Hopper tableCERTAIN
sub_8E87804.6 KBHWTable_sm90a -- Hopper accelerated tableCERTAIN
sub_8E8A903.0 KBHWTable_sm100 -- Blackwell DC tableCERTAIN
sub_8E8CB0949 BHWTable_sm100_short -- Blackwell supplementaryCERTAIN
sub_8E8DB01.7 KBHWTable_sm103 -- Blackwell Ultra tableCERTAIN
sub_8E8F60618 BHWTable_sm103_short -- BU supplementaryCERTAIN
sub_8E90002.9 KBHWTable_sm120 -- RTX 50xx tableCERTAIN
sub_8E92E05.5 KBHWTable_sm120_ext -- RTX 50xx extendedCERTAIN
sub_8E97B08.8 KBHWTable_universal -- fallback tableCERTAIN
sub_8E9DC04.8 KBEmitLatencyEntry -- HW table entry helperHIGH
sub_8EFA1018 KBEmitScheduleReport -- statistics outputHIGH
sub_8F0CD024 BMapFUClassID -- (opcode, name) to classHIGH
sub_8F1EB015 KBEncodeScheduleWords -- SASS control word outputHIGH
sub_8F31301.0 KBEncodeStallFieldHIGH
sub_8F31F06.1 KBEncodeBarrierFieldHIGH
sub_8F36502.7 KBEncodeYieldFieldHIGH
sub_8F38603.0 KBEncodeScoreboardFieldHIGH
sub_8F41405.6 KBEncodeFullControlWordHIGH
sub_8F47E0~50 BDetectCutlass -- strstr for "cutlass"CERTAIN
sub_A0891039 linesGetRegisterLatency -- operand cost queryHIGH
sub_A08A00345 linesResourceModel -- 3-mode FU cost computationHIGH
sub_A0953091 linesUpdateStallCycles -- per-instruction stall updateHIGH
sub_A9CDE0--IsHotMemory -- global/texture classificationHIGH
sub_A9CF90--IsColdMemory -- constant/shared classificationHIGH
sub_13710B07.1 KBAssignPipeClass -- SASS-level pipe assignmentHIGH
sub_1370F40~500 BCheckTensorFeature -- gates tensor pipe classesHIGH
sub_7D6780~100 BIsWideType -- true for FP64/wide typesHIGH
sub_7DFFC0~200 BClassifyMemAccess -- 3=shared, 4=constantHIGH
sub_7E3640~100 BGetCustomPipe -- 5-bit pipe sub-classMEDIUM
sub_91E7A0~100 BGetSrcEncoding -- source operand encoding queryMEDIUM
sub_91E860~100 BGetOperandType -- operand type codeMEDIUM
sub_A9AB10~100 BNeedsExtEncoding -- extended encoding checkMEDIUM

Cross-References