Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Register Model (R / UR / P / UP)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas models four hardware register files plus two auxiliary barrier register files. Every Ori instruction references registers from one or more of these files. During the optimization phases (0--158), registers carry virtual numbers; the fat-point register allocator (phase 159+) maps them to physical hardware slots. This page documents the register files, the virtual/physical register descriptor, the 7 allocator register classes, wide register conventions, special registers, the operand encoding format, pressure tracking, and SM-specific limits.

Four Register Files

FileMnemonicWidthUsable rangeZero/TrueABI typeIntroduced
RGeneral-purpose32 bitsR0 -- R254RZ (R255)2sm_30
URUniform32 bitsUR0 -- UR62URZ (UR63)3sm_75
PPredicate1 bitP0 -- P6PT (P7)5sm_30
UPUniform predicate1 bitUP0 -- UP6UPT (UP7)--sm_75

R registers are per-thread 32-bit general-purpose registers. They hold integers, floating-point values, and addresses. 64-bit values occupy consecutive even/odd pairs (R4:R5); 128-bit values occupy aligned quads (R0:R1:R2:R3). The total R-register count for a function is field[159] + field[102] (reserved + allocated), stored in the Code Object at offsets +159 and +102. Maximum usable: 254 (R0--R254). R255 is the hardware zero register RZ -- reads return 0, writes are discarded.

UR registers (uniform general-purpose) are warp-uniform: every thread in a warp sees the same value. Available on sm_75 and later. Range: UR0--UR62 usable, UR63 is the uniform zero register URZ. The UR count is at Code Object +99. Attempting to use UR on pre-sm_75 targets triggers the diagnostic "Uniform registers were disallowed, but the compiler required (%d) uniform registers for correct code generation.".

P registers are 1-bit predicates used for conditional execution (@P0 FADD ...) and branch conditions. P0--P6 are usable; P7 is the hardwired always-true predicate PT. Writes to PT are discarded. The assembler uses PT as the default predicate for unconditional instructions. In the allocator, predicate registers support half-width packing: two virtual predicates can be packed into one physical predicate slot, with the hi/lo distinction stored in bit 23 (0x800000) of the virtual register flags.

UP registers are the uniform predicate variant. UP0--UP6 are usable; UP7 is UPT (always-true). Available on sm_75+.

Seven Allocator Register Classes

The fat-point allocator processes 7 register classes, indexed by the reg_type field at vreg+64. Class 0 is the cross-class constraint propagation channel and is skipped in the main allocation loop. Classes 1--6 are allocated independently, in order. The allocator distribution loop in sub_9721C0 (lines 520--549) reads *(int*)(vreg+64) and uses it directly as the class bucket index, guarded by reg_type <= 6:

Class IDNameWidthHW limitDescription
0(unified)----Cross-class constraint propagation (skipped)
1R32-bit255General-purpose registers (R0--R254)
2R (alt)32-bit255GPR variant (used for RZ sentinel, stat collector alternate)
3UR32-bit63Uniform general-purpose (UR0--UR62)
4UR (ext)32-bit63Uniform GPR variant (triggers flag update at +1369 in constructor)
5P / UP1-bit7Predicate registers (P0--P6, UP0--UP6)
6Tensor/Acc32-bitvariesTensor/accumulator registers for MMA/WGMMA operations

The class ID is the reg_type value stored at vreg+64. The allocator class at vreg+12 is a separate field used for instruction-level classification, not for the per-class allocation passes. The allocator's per-class linked lists at alloc[3*reg_type + 138] are populated directly from vreg+64.

Per-class state is initialized via the target descriptor vtable call vtable[896](alloc_state, class_id), which populates per-class register file descriptors at alloc[114..156] (four 8-byte entries per class).

Barrier Registers

Barrier registers (B and UB) are a distinct register file used by the BAR, DEPBAR, BSSY, and BSYNC instructions for warp-level and CTA-level synchronization. B0--B15 are the non-uniform barrier registers; UB0--UB15 are the uniform variant. Barrier registers have reg_type = 9, which is above the <= 6 cutoff for the main allocator class buckets. They are handled by a separate allocation mechanism outside the 7-class system.

Tensor/Accumulator Registers (Class 6)

Class 6 registers are created during intrinsic lowering of tensor core operations (MMA, WGMMA, HMMA, DMMA). Over 30 intrinsic lowering functions in the 0x6B--0x6D address range call sub_91BF30(ptr, ctx, 6) to create these registers. The GMMA pipeline pass (sub_ADA740, sub_69E590) identifies accumulator operands by checking *(vreg+64) == 6. The accumulator counting function at sub_78C6B0 uses the pair-mode bits at vreg+48 (bits 20--21) to determine whether a type-6 register consumes 1 or 2 physical R slots.

Virtual Register Descriptor

Every virtual register in a function is represented by a 160-byte descriptor allocated from the per-function arena. The register file array is at Code Object +88, indexed as *(ctx+88) + 8*regId. The descriptor is created by sub_91BF30 (register creation function).

Descriptor Layout

OffsetSizeTypeFieldNotes
+08ptrnextLinked list pointer (allocation worklist)
+84i32idUnique register ID within function
+124i32class_indexAllocator register class (0--6)
+201u8flags_byteBit 0x20 = live
+244i32bb_indexBasic block of definition
+284i32epochEpoch counter for liveness tracking
+328ptralias_nextNext aliased register (coalescing chain)
+368ptralias_parentCoalesced parent pointer
+404f32spill_costAccumulated spill cost
+488u64flagsMulti-purpose flag word (see below)
+568ptrdef_instrDefining instruction pointer
+644i32reg_typeRegister file type enum
+684i32physical_regPhysical register number (-1 = unassigned)
+721u8size0 = scalar, nonzero = encoded width
+764f32secondary_costSecondary spill cost
+804i32spill_flag0 = not spilled, 1 = spilled
+972u16reserved
+1048ptruse_chainUse chain head (instruction pointer)
+1128ptrdef_chainDefinition chain
+1208ptrregfile_nextNext in register file linked list
+1288ptrlinked_nextNext in linked-register chain
+1368ptrreserved2
+1448ptrconstraint_listConstraint list head for allocator
+1528ptrreserved3

Initial values set by the constructor (sub_91BF30):

vreg->next           = NULL;            // +0
vreg->id             = ctx->reg_count + 1;  // +8, auto-incrementing
vreg->class_index    = 0;               // +12
vreg->flags_byte     = 0;               // +20
vreg->alias_parent   = (ptr)-1;         // +20..27 (qword write)
vreg->physical_reg   = -1;              // +68 (unassigned)
vreg->reg_type       = a3;              // +64 (passed as argument)
vreg->size           = 0;               // +72
vreg->spill_flag     = 0;               // +80
vreg->use_chain      = NULL;            // +104
vreg->def_chain      = NULL;            // +112
vreg->constraint_list = NULL;           // +144

For predicate types (a3 == 2 or a3 == 3), the flags word at +48 is initialized to 0x1000 (4096). For all other types, it is initialized to 0x1018 (4120). If the type is 7 (alternate predicate classification), the physical register is initialized to 0 instead of -1.

Flag Bits at +48

BitMaskMeaning
90x200Pre-assigned / fixed register
100x400Coalesced source
110x800Coalesced target
120x1000Base flag (set for all types)
140x4000Spill marker (already spilled)
180x40000Needs-spill (allocator sets when over budget)
20--21(pair mode)0 = single, 1 = lo-half of pair, 3 = double-width
220x400000Constrained to architecture limit
230x800000Hi-half of pair (predicate half-width packing)
270x8000000Special handling flag

Register File Type Enum (at +64)

This enum determines the register file a VR belongs to. It is used by the register class name table at off_21D2400 to map type values to printable strings ("R", "UR", "P", etc.) for diagnostic output such as "Referencing undefined register: %s%d".

ValueFileAlloc classDescription
1R1General-purpose register (32-bit)
2R (alt)2GPR variant (RZ sentinel in sub_7D82E0, stat collector alternate)
3UR3Uniform register (32-bit)
4UR (ext)4Uniform GPR variant (triggers flag update at +1369 in constructor)
5P / UP5Predicate register (1-bit); covers both P and UP
6Tensor/Acc6Tensor/accumulator register for MMA/WGMMA operations
7P (alt)--Predicate variant (physical = 0 at init); above allocator cutoff
8----Extended type (created by sub_83EF00); above allocator cutoff
9B / UB--Barrier register; above allocator cutoff, separate allocation
10R2--Extended register pair (64-bit, two consecutive R regs)
11R4--Extended register quad (128-bit, four consecutive R regs)

Values 0--6 are within the allocator's class system (the distribution loop in sub_9721C0 guards with reg_type <= 6). Values 7+ are handled by separate mechanisms. The off_21D2400 name table is indexed by reg_type and provides display strings for diagnostic output.

The stat collector at sub_A60B60 (24 KB) enumerates approximately 25 register sub-classes including R, P, B, UR, UP, UB, Tensor/Acc, SRZ, PT, RZ, and others by iterating vtable getter functions per register class.

Wide Registers

NVIDIA GPUs have only 32-bit physical registers. Wider values are composed from consecutive registers.

64-Bit Pairs (R2)

A 64-bit value occupies two consecutive registers where the base register has an even index: R0:R1, R2:R3, R4:R5, and so on. The low 32 bits reside in the even register; the high 32 bits in the odd register. In the Ori IR, a 64-bit pair is represented by a single virtual register with:

  • vreg+64 (type) = 10 (extended pair)
  • vreg+48 bits 20--21 (pair mode) = 3 (double-width)

The allocator selects even-numbered physical slots by scanning with stride 2 instead of 1. The register consumption function (sub_939CE0) computes slot + (1 << (pair_mode == 3)) - 1, consuming two physical slots.

128-Bit Quads (R4)

A 128-bit value occupies four consecutive registers aligned to a 4-register boundary: R0:R1:R2:R3, R4:R5:R6:R7, etc. Used by texture instructions, wide loads/stores, and tensor core operations. In the Ori IR:

  • vreg+64 (type) = 11 (extended quad)
  • Allocator scans with stride 4

Alignment Constraints

WidthBase alignmentStrideExample
32-bit (scalar)Any1R7
64-bit (pair)Even2R4:R5
128-bit (quad)4-aligned4R8:R9:R10:R11

The texture instruction decoder (sub_1170920) validates even-register alignment via a dedicated helper (sub_1170680) that checks if a register index falls within the set {34, 36, 38, ..., 78} and returns 0 if misaligned.

The SASS instruction encoder for register pairs (sub_112CDA0, 8.9 KB) maps 40 register pair combinations (0/1, 2/3, ..., 78/79) to packed 5-bit encoding values at 0x2000000 (33,554,432) intervals.

Special Registers

Zero and True Registers

RegisterFileIndexInternal sentinelBehavior
RZR2551023Reads return 0; writes discarded
URZUR631023Uniform zero; reads return 0
PTP731Always-true predicate; writes discarded
UPTUP731Uniform always-true

The internal sentinel value 1023 (0x3FF) represents "don't care" or "zero register" throughout the Ori IR and allocator. During SASS encoding, hardware register index 255 is mapped to sentinel 1023 for R/UR files, and hardware index 7 is mapped to sentinel 31 for P/UP files. These sentinels are checked in encoders to substitute the default register value:

// Decoder: extract register operand (sub_9B3C20)
if (reg_idx == 255)
    internal_idx = 1023;   // RZ sentinel

// Decoder: extract predicate operand (sub_9B3D60)
if (pred_idx == 7)
    internal_idx = 31;     // PT sentinel

// Encoder: emit register field
if (reg == 1023)
    use *(a1+8) as default;  // encode physical RZ

Architectural Predicate Indices

The allocator skips architectural predicate registers by index number:

IndexRegisterTreatment
39(special)Skipped during allocation (skip predicate sub_9446D0)
41PTSkipped -- hardwired true predicate
42P0Skipped -- architectural predicate
43P1Skipped -- architectural predicate
44P2Skipped -- architectural predicate

The skip check in sub_9446D0 returns true (skip) for register indices 41--44 and 39, regardless of register class. For other registers, it checks whether the instruction is a CSSA phi (opcode 195 with barrier type 9) or whether the register is in the exclusion set hash table at alloc+360.

Special System Registers (S2R / CS2R)

Thread identity and hardware state are accessed through the S2R (Special Register to Register) and CS2R (Control/Status Register to Register) instructions. These read read-only hardware registers into R-file registers.

Common system register values (from PTX parser initialization at sub_451730):

PTX nameHardwareDescription
%tid / %ntidSR_TID_X/Y/ZThread ID within CTA
%ctaid / %nctaidSR_CTAID_X/Y/ZCTA ID within grid
%laneidSR_LANEIDLane index within warp (0--31)
%warpid / %nwarpidSR_WARPIDWarp index within CTA
%smid / %nsmidSR_SMIDSM index
%grididSR_GRIDIDGrid identifier
%clock / %clock_hi / %clock64SR_CLOCK / SR_CLOCK_HICycle counter
%lanemask_eq/lt/le/gt/geSR_LANEMASK_*Lane bitmask variants

The S2R register index must be between 0 and 255 inclusive, enforced by the string "S2R register must be between 0 and 255 inclusive". Special system register ranges are tracked at Code Object offsets +1712 (start) and +1716 (count).

Operand Encoding in Ori Instructions

Each instruction operand is encoded as a 32-bit packed value in the operand array starting at instruction offset +84. The operand at index i is at *(instr + 84 + 8*i).

Packed Operand Format (Ori IR)

 31   30  29  28  27            24  23  22  21  20  19                  0
+----+---+---+---+---------------+---+---+---+---+---------------------+
|sign|     type  |  modifier (8) |                index (20)           |
+----+---+---+---+---------------+---+---+---+---+---------------------+
 bit 31: sign/direction flag          bits 0-19: register/symbol index
 bits 28-30: operand type (3 bits)    bit 24: pair extension flag

Extraction pattern (50+ call sites):

uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
int type    = (operand >> 28) & 7;     // bits 28-30
int index   = operand & 0xFFFFF;       // bits 0-19
int mods    = (operand >> 20) & 0xFF;  // bits 20-27
bool is_neg = (operand >> 31) & 1;     // bit 31
Type valueMeaning
1Register operand (index into register file at *(ctx+88) + 8*index)
5Symbol/constant operand (index into symbol table at *(ctx+152))
6Special operand (barrier, system register)

For register operands (type 1), the index is masked as operand & 0xFFFFFF (24 bits) to extract the full register ID. Indices 41--44 are architectural predicates that are never allocated.

SASS Instruction Register Encoding

During final SASS encoding, the register operand encoder (sub_7BC030, 814 bytes, 6147 callers) packs register operands into the 128-bit instruction word:

Encoded register field (16 bits at variable bit offset):
  bit 0:      presence flag (1 = register present)
  bits 1-4:   register file type (4 bits, 12 values)
  bits 5-14:  register number (10 bits)

The 4-bit register file type field in the SASS encoding maps the internal operand type tag to hardware encoding:

Operand type tagEncoded valueRegister file
10R (32-bit)
21R pair (64-bit)
32UR (uniform 32-bit)
43UR pair (uniform 64-bit)
54P (predicate)
65(reserved)
76(reserved)
87B (barrier)
168(extended)
329(extended)
6410(extended pair)
12811(extended quad)

The predicate operand encoder (sub_7BCF00, 856 bytes, 1657 callers) uses a different format: 2-bit predicate type, 3-bit predicate condition, and 8-bit value. It checks for PT (operand byte[0] == 14) and handles the always-true case.

Register-Class-to-Hardware Encoding

The function sub_1B6B250 (2965 bytes, 254 callers) implements the mapping from the compiler's abstract (register_class, sub_index) pair to hardware register numbers:

hardware_reg = register_class * 32 + sub_index

For example: class 0, index 1 returns 1; class 1, index 1 returns 33; class 2, index 1 returns 65. The guard wrapper sub_1B73060 (483 callers) returns 0 for the no-register case (class=0, index=0).

The register field writer (sub_1B72F60, 483 callers) packs the encoded register number into the 128-bit instruction word with the encoding split across two bitfields:

*(v2 + 12) |= (encoded_reg << 9) & 0x3E00;       // bits [13:9]
*(v2 + 12) |= (encoded_reg << 21) & 0x1C000000;   // bits [28:26]

Register Pressure Tracking

Scheduling Phase Pressure Counters

The scheduler maintains 10 per-block register pressure counters at offsets +4 through +40 of the per-BB scheduling record (72 bytes per basic block). At BB entry, these are copied into the scheduler context at context offsets +48 through +87. The counters track live register counts for each register class:

BB record offsetContext offset (idx)Register class
+4+48 (idx 12)R (general-purpose)
+8+52 (idx 13)P (predicate)
+12+56 (idx 14)UR (uniform)
+16+60 (idx 15)UP (uniform predicate)
+20+64 (idx 16)B (barrier)
+24+68 (idx 17)(arch-specific class 0)
+28+72 (idx 18)(arch-specific class 1)
+32+76 (idx 19)(arch-specific class 2)
+36+80 (idx 20)(arch-specific class 3)
+40+84 (idx 21)(arch-specific class 4 / control total)

The spill cost analyzer (sub_682490, 14 KB) allocates two stack arrays (v94[511] and v95[538]) as per-register-class pressure delta arrays. For each instruction, it computes pressure increments and decrements based on the instruction's register operand definitions and uses.

The register pressure coefficient is controlled by knob 740 (double, default 0.045). The pressure curve function uses a piecewise linear model with parameters (4, 2, 6) via sub_8CE520.

Liveness Bitvectors

The Code Object maintains register liveness as bitvectors:

OffsetBitvectorDescription
+832Main register livenessOne bit per virtual register; tracks which registers are live at the current program point
+856Uniform register livenessSeparate bitvector for UR/UP registers

These bitvectors are allocated via sub_BDBAD0 (bitvector allocation, with size = register count + 1 bits) and manipulated via the SSE2-optimized bitvector primitives at sub_BDBA60 / sub_BDC180 / sub_BDCDE0 / sub_BDC300.

For each basic block during dependency graph construction (sub_A0D800, 39 KB), the per-block liveness is computed by iterating instructions and checking operand types ((v >> 28) & 7 == 1 for register operands), then updating the bitvector at +832 with set/clear operations.

Allocator Pressure Arrays

The fat-point allocator (sub_957160) uses two 512-DWORD (2048-byte) arrays per allocation round:

ArrayRole
Primary (v12[512])Per-physical-register interference count
Secondary (v225[512])Tie-breaking cost metric

Both are zeroed with SSE2 vectorized _mm_store_si128 loops at the start of each round. For each VR being allocated, the pressure builder (sub_957020) walks the VR's constraint list and increments the corresponding physical register slots. The threshold (knob 684, default 50) filters out congested slots.

ABI Register Reservations

Reserved Registers

Registers R0--R3 are unconditionally reserved by the ABI across all SM generations. The diagnostic "Registers 0-3 are reserved by ABI and cannot be used for %s" fires if they are targeted by parameter assignment or user directives.

Minimum Register Counts by SM Generation

SM generationValueSM targetsMinimum registers
3(sm_target+372) >> 12 == 3sm_35, sm_37(no minimum)
4== 4sm_50 -- sm_5316
5== 5sm_60 -- sm_8916
9== 9sm_90, sm_90a24
>9> 9sm_100+24

Violating the minimum emits warning 7016: "regcount %d specified below abi_minimum of %d".

Per-Class Hardware Limits

ClassLimitNotes
R255R0--R254 usable; controlled by --maxrregcount and --register-usage-level (0--10)
UR63UR0--UR62 usable; sm_75+ only
P7P0--P6 usable
UP7UP0--UP6 usable; sm_75+ only
B16B0--B15
UB16UB0--UB15

The --maxrregcount CLI option sets a per-function hard ceiling for R registers. The --register-usage-level option (0--10, default 5) modulates the register allocation target: level 0 means no restriction, level 10 means minimize register usage as aggressively as possible. The per-class budget at alloc + 32*class + 884 reflects the interaction between the CLI limit and the optimization level.

The --device-function-maxrregcount option overrides the kernel-level limit for device functions when compiling with -c.

Dynamic Register Allocation (setmaxnreg)

sm_90+ (Hopper and later) supports dynamic register allocation through the setmaxnreg.inc and setmaxnreg.dec instructions, which dynamically increase or decrease the per-thread register count at runtime. ptxas tracks these as internal states setmaxreg.try_alloc, setmaxreg.alloc, and setmaxreg.dealloc. Multiple diagnostics guard correct usage:

  • "setmaxnreg.dec has register count (%d) which is larger than the largest temporal register count in the program (%d)"
  • "setmaxreg.dealloc/release has register count (%d) less than launch min target (%d) allowed"
  • "Potential Performance Loss: 'setmaxnreg' ignored to maintain minimum register requirements."

Pair Modes and Coalescing

The pair mode at vreg+48 bits 20--21 controls how the allocator handles wide registers:

Pair modeValueBehavior
Single0Occupies one physical register slot
Lo-half1Low half of a register pair
Double-width3Occupies two consecutive physical slots

The allocator computes register consumption via sub_939CE0:

consumption = slot + (1 << (pair_mode == 3)) - 1;
// single:  slot + 0  = slot (1 slot)
// double:  slot + 1  = slot+1 (2 slots)

The coalescing pass (sub_9B1200, 800 lines) eliminates copy instructions by merging the source and destination VRs into the same physical register. The alias chain at vreg+36 (coalesced parent) is followed during assignment (sub_94FDD0) to propagate the physical register through all aliased VRs:

alias = vreg->alias_parent;     // vreg+36
while (alias != NULL) {
    alias->physical_reg = slot;  // alias+68
    alias = alias->alias_parent; // alias+36
}

Register Name Table

The register class name table at off_21D2400 is a pointer array indexed by the register file type enum (from vreg+64). Each entry points to a string: "R", "UR", "P", "UP", "B", "UB", etc. This table is used by diagnostic functions:

  • sub_A4B9F0 (StatsEmitter::emitUndefinedRegWarning): "Referencing undefined register: %s%d" where %s is off_21D2400[*(vreg+64)] and %d is *(vreg+68) (physical register number).
  • sub_A60B60 (RegisterStatCollector::collectStats, 24 KB): Enumerates ~25 register sub-classes by iterating vtable getters, one per register class. The enumerated classes include R, P, B, UR, UP, UB, SRZ, PT, RZ, and others.
  • "Fatpoint count for entry %s for regclass %s : %d": Prints per-function per-class allocation statistics.

Key Functions

AddressSizeFunctionDescription
sub_91BF3099 linescreateVirtualRegisterAllocates 160-byte VR descriptor, initializes fields, appends to register file array
sub_9446D029 linesshouldSkipRegisterReturns true for indices 41--44, 39 (architectural specials); checks CSSA phi and exclusion set
sub_A4B8F0248BemitInstrRegStatsEmits "instr/R-regs: %d instructions, %d R-regs"
sub_A4B9F0774BemitUndefinedRegWarningWalks operands backward, formats "Referencing undefined register: %s%d"
sub_A60B604560BcollectRegisterStatsEnumerates ~25 register sub-classes via vtable getters
sub_7BC030814BencodeRegOperandPacks register into SASS instruction: 1-bit presence + 4-bit type + 10-bit number
sub_7BCF00856BencodePredOperandPacks predicate into SASS: 2-bit type + 3-bit condition + 8-bit value
sub_9B3C20--decodeRegOperandDecoder helper: extracts register, maps 255 to 1023 (RZ)
sub_9B3D60--decodePredOperandDecoder helper: extracts predicate, maps 7 to 31 (PT)
sub_1B6B2502965BregClassToHardwareMaps (class, sub_index) to hardware number: class * 32 + sub_index
sub_1B7306019BregClassToHardwareGuardGuard wrapper: returns 0 for no-register case
sub_1B72F6032BwriteRegFieldPacks encoded register into instruction word bits [13:9] and [28:26]
sub_112CDA08.9KBencodeRegisterPairMaps 40 register pair combinations to 5-bit packed encoding values
sub_939CE023 linescomputeConsumptionPair-aware register slot consumption counter
sub_94FDD0155 linesassignRegisterCommits physical register assignment, propagates through alias chain
sub_A0D80039KBbuildDependencyGraphPer-block dependency graph with register-to-instruction mapping
sub_A06A6015KBscheduleWithPressurePer-block scheduling loop tracking live register set bitvector
sub_68249014KBcomputeRegPressureDeltasPer-instruction register pressure delta computation
sub_B28E00--getRegClassReturns register class (1023 = wildcard, 1 = GPR)
sub_B28E10--isRegOperandPredicate: is this a register operand?
sub_B28E20--isPredOperandPredicate: is this a predicate operand?
sub_B28E90--isURegPredicate: is this a uniform register?

Opcode Register Class Table

Every Ori opcode carries an implicit register class contract: which register files its operands may reference, what data widths are valid, and which addressing modes apply. The function sub_6575D0 (49 KB, buildEncodingDescriptor) is the central dispatch that translates each instruction's opcode into a packed encoding descriptor consumed by the SASS encoder.

Function Signature

// sub_6575D0 -- buildEncodingDescriptor
// a1 = compiler context
// a2 = Ori instruction node pointer
// a3 = output: 4-DWORD packed encoding descriptor
char buildEncodingDescriptor(Context *a1, Instruction *a2, uint32_t *a3);

Architecture

The function is a two-level dispatch:

  1. Outer switch on the Ori opcode at *(instr->info + 8) -- 168 unique case values spanning opcodes 3 (IADD3) through 0xF5 (PIXLD).

  2. Inner encoding per opcode (or group): assigns an encoding category ID to a3[0], then calls the bitfield packers to fill a3[1..2] with register class attributes.

Two helper functions pack fields into the descriptor:

FunctionRoleCall countField ID range
sub_917A60 (packRegClassField)Bitfield encoder -- field IDs 91--340 map to specific bit positions in a3[1] and a3[2]11291--340
sub_A2FF00 (packOperandField)Alternate encoder for operand-level slots (data type, memory space)283--71

Encoding Category Assignment

The encoding category at a3[0] selects which SASS instruction format template the downstream per-SM encoder uses. Key mappings (opcode index to encoding category):

Opcode(s)SASS mnemonicCategoryRegister class summary
3IADD3489R dest, R/UR sources, P carry
4BMSK106R only
5--6SGXT / LOP3490--491R dest, R/UR sources
7ISETP59P dest, R/UR sources + memory ordering fields
8IABS60R dest, R source + memory ordering fields
0x0E--0x10FSET/FSEL/FSETP510R/P dest, FP operation variant
0x11/0x12/0x18FSETP/MOV/PRMT517FP comparison, combine, data width (IDs 288--299)
0x15--0x16P2R/R2P524--525P-to-R or R-to-P conversion
0x19VOTE526R dest, optional memory class
0x1ACS2R variant527UR source width (494--496), data type from a2+92
0x1BCS2R_32497Source width (494/495/496), predicate flag (ID 270)
0x1EIPA494Interpolation mode (440--442), flat/smooth (443/444)
0x1FMUFU501Subfunction (445--447), precision (450--459)
0x20SHF502Direction (461--463), source class (464--466), clamp, data type
0x21SHFL503Mode (470/471), operand classes (472--482)
0x22--0x23I2I/I2IP55/56Integer conversion type (23 entries in dword_2026B20)
0x28--0x2AIPA/MUFU ext512Extended encoding variants (428--430)
0x2B--0x2CF2F/F2F_X513Conversion direction (432/433), saturation (434/435)
0x2DFRND516Rounding variant (526), mode (528/529)
0x51--0x53AL2P, AL2P_IDX437--438Bindless flag (ID 148), predicate (ID 147)
0x54--0x56BMOV_B/BMOV_R/BMOV423--424B-register class
0x64--0x67SETLMEMBASE/ATOM156/463Atom-vs-red (ID 178), data width (ID 181)
0x68BRX468Target (ID 190), call convention (IDs 191--192)
0x6A/0x6C/0x6DJMP/JMX/CALL469Control flow target class (ID 176)
0x77--0x79BSSY/BREAK/BSYNC528--530Sync mode (ID 324), variant (ID 325)
0x82NANOTRAP487Trap operation class (ID 257), has-source (ID 256)
0x9E--0x9FHopper+ instrs535--536Hopper class A/B (IDs 337--338)
0xAF--0xB2LD/ST variants431--446Full modifier set: uniform (91), pair (92--102)
0xB8--0xBELDG/STG/LDL/STL449--456Cache policy (131), float mode (134), width (131)
0xC1Conditional10/13Branch type (ID 167), divergent (ID 168)
0xC8PRMT24Permute selector (ID 65/66)
0xC9--0xD3Texture/surface61/455Texture data type (IDs 17/18), surface (IDs 19--22)
0xD6--0xD7DMMA/CVTA515Direction (304), predicate (305), data type (306)
0xDA--0xDBSUATOM521/533Data width (326--331), sync mode (328)
0xDCSURED534Data width (331), type (335--336), sync (333)
0xE0WGMMA500Data type (198), enable (199), barrier (201)
0xF5PIXLD532Mode from dword_2026AA0 (ID 323)

Extended Opcode Path (Memory/Atomic Sub-dispatch)

When the opcode falls in the 0xF6--0x10C range (memory/atomic extended instructions), a separate sub-dispatch applies. The function sub_44AC80 gates entry; sub_44AC60 and sub_44AC70 select among three encoding categories:

CategoryGate functionMeaning
441defaultBase memory operation
442sub_44AC60 truePredicated memory variant
443sub_44AC70 trueExtended memory variant

Within each category, the sub-opcode selects register class fields:

Sub-opcodeRegister class (field 115)Data width (field 113)
0xF6/0xFF/0x10669 (class A)60 (standard)
0xF7/0x100/0x10771 (class B)60 (standard)
0xF8/0x102/0x1090 (default)63 (wide)
0xF9/0x103/0x10A0 (default)61 (narrow)
0xFA/0x104/0x10B0 (default)62 (medium)
0xFB0 (default)65 (type A)
0xFC0 (default)66 (type B)
0xFD0 (default)68 (type C)
0xFE/0x105/0x10C0 (from table)64 (from dword_2026C30)
0x101/0x10872 (class C)60 (standard)

Packed Descriptor Layout

The output descriptor a3 is a 4-DWORD (16-byte) structure:

DWORDContent
a3[0]Encoding category ID (0--542) -- selects SASS format template
a3[1]Packed bitfield: memory space (bits 0--3), address type (bits 4--7)
a3[2]Packed bitfield: register class attributes (data width, type, modifiers)
a3[3]Auxiliary flags (bit 1 = texture scope, bit 29 = special)
a3[4]Operand count override (set to 12 for KILL/extended mem ops)

Register Class Field Groups

The 112 calls to packRegClassField (sub_917A60) use field IDs organized into functional groups. Each field ID maps to a specific bit range in the output descriptor via a mask-and-OR encoding:

// Example: field 113 (data width) -- bits 7-9 of a3[2]
case 113:
    val = dword_21DEB20[a3_value - 61];  // 8-entry lookup
    a3[2] = (val << 7) | (a3[2] & 0xFFFFF87F);
    break;

// Example: field 91 (uniform flag) -- bit 16 of a3[2]
case 91:
    a3[2] = ((value == 1) << 16) | (a3[2] & 0xFFFEFFFF);
    break;
Field groupIDsBits writtenPurpose
Core class91--102a3[2] bits 5--22Uniform, pair, predicate, data type, saturate, negate, abs, complement
Data width113--117a3[2] bits 0--9Width code, uniform-mem, source regclass, type specifier, write-back
Load/store118--134a3[1] + a3[2]Memory space, address type, cache policy, atomic op, scope, float mode
Texture/surface135--165a3[2] bits 1--31Texture type, dimension, LOD mode, ordering, acquire, scope hint
Control flow167--202a3[2] bits 1--6Branch type, divergent, WGMMA data type/enable/barrier
FP/conversion230--264a3[2] variousFP operation, comparison, combine, interpolation, MUFU, SHF, SHFL
Extended269--299a3[2] variousCS2R, FSETP, rounding, data type wide, destination regclass
Hopper/Blackwell304--340a3[2] variousDMMA, WGMMA, TMA hints, surface sync, Hopper-specific classes

Sub-handler Functions

Complex opcode families delegate register class encoding to dedicated sub-functions:

FunctionOpcodes handledPurpose
sub_650390TEX, TLD, texture familyTexture register class (sampler, coordinate, LOD)
sub_650220LDG, STG, LD, ST, ATOM, REDMemory instruction register class
sub_651330FMUL (opcode 0x0D)FP multiply register class
sub_650920LEA, special (0x09, 0x72, 0x74, 0x7A, 0x80, 0x81)LEA / special instruction
sub_650A90I2I, F2F, conversions (0x24--0x27, 0xE2--0xEB)Type conversion register class
sub_652190Branch/call (0x13, 0x14, 0x17)Branch/call register class
sub_653B90Misc (0x0C)Miscellaneous instruction
sub_650C80Memory barrier modifiersApplied when (a2+56) & 0x4F0 is nonzero
sub_651A90Texture modifiers (0x83)Applied before texture encoding
sub_62D5D0Memory space computationComputes memory space tag from operand types

Lookup Tables

The function references 28 static lookup tables that map instruction attribute values to register class encoding values:

TableSizeUsed by field(s)Content
dword_21DEB80594Data type encoding
dword_21DEB503107, 115, 145, 157, 1653-value encoding (reused across 5 fields)
dword_21DEB208113Data width code
dword_21DEB007116, 126, 131, 170Type encoding (reused across 4 fields)
dword_21DEAE05119/123, 136, 143, 159Variant table (reused across 4 fields)
dword_21DEAA013120Memory space code
dword_21DEA6010121, 135/151Address/texture type
dword_21DEA2015124/125Reduction type
dword_21DE9F06129/130, 150Scope code
dword_2026C306116 (ext path)Sub-opcode to data type
dword_2026C8020165 (surface)Surface operation codes
dword_2026E2017286Data type (wide)
dword_2026AC016198WGMMA data type
dword_2026B2023I2I conversionInteger conversion type