Register Model (R / UR / P / UP)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

ptxas models four hardware register files plus two auxiliary barrier register files. Every Ori instruction references registers from one or more of these files. During the optimization phases (0--158), registers carry virtual numbers; the fat-point register allocator (phase 159+) maps them to physical hardware slots. This page documents the register files, the virtual/physical register descriptor, the 7 allocator register classes, wide register conventions, special registers, the operand encoding format, pressure tracking, and SM-specific limits.

Four Register Files

File	Mnemonic	Width	Usable range	Zero/True	ABI type	Introduced
R	General-purpose	32 bits	R0 -- R254	RZ (R255)	2	sm_30
UR	Uniform	32 bits	UR0 -- UR62	URZ (UR63)	3	sm_75
P	Predicate	1 bit	P0 -- P6	PT (P7)	5	sm_30
UP	Uniform predicate	1 bit	UP0 -- UP6	UPT (UP7)	--	sm_75

R registers are per-thread 32-bit general-purpose registers. They hold integers, floating-point values, and addresses. 64-bit values occupy consecutive even/odd pairs (R4:R5); 128-bit values occupy aligned quads (R0:R1:R2:R3). The total R-register count for a function is field[159] + field[102] (reserved + allocated), stored in the Code Object at offsets +159 and +102. Maximum usable: 254 (R0--R254). R255 is the hardware zero register RZ -- reads return 0, writes are discarded.

UR registers (uniform general-purpose) are warp-uniform: every thread in a warp sees the same value. Available on sm_75 and later. Range: UR0--UR62 usable, UR63 is the uniform zero register URZ. The UR count is at Code Object +99. Attempting to use UR on pre-sm_75 targets triggers the diagnostic "Uniform registers were disallowed, but the compiler required (%d) uniform registers for correct code generation.".

P registers are 1-bit predicates used for conditional execution (@P0 FADD ...) and branch conditions. P0--P6 are usable; P7 is the hardwired always-true predicate PT. Writes to PT are discarded. The assembler uses PT as the default predicate for unconditional instructions. In the allocator, predicate registers support half-width packing: two virtual predicates can be packed into one physical predicate slot, with the hi/lo distinction stored in bit 23 (0x800000) of the virtual register flags.

UP registers are the uniform predicate variant. UP0--UP6 are usable; UP7 is UPT (always-true). Available on sm_75+.

Seven Allocator Register Classes

The fat-point allocator processes 7 register classes, indexed by the reg_type field at vreg+64. Class 0 is the cross-class constraint propagation channel and is skipped in the main allocation loop. Classes 1--6 are allocated independently, in order. The allocator distribution loop in sub_9721C0 (lines 520--549) reads *(int*)(vreg+64) and uses it directly as the class bucket index, guarded by reg_type <= 6:

Class ID	Name	Width	HW limit	Description
0	(unified)	--	--	Cross-class constraint propagation (skipped)
1	R	32-bit	255	General-purpose registers (R0--R254)
2	R (alt)	32-bit	255	GPR variant (used for RZ sentinel, stat collector alternate)
3	UR	32-bit	63	Uniform general-purpose (UR0--UR62)
4	UR (ext)	32-bit	63	Uniform GPR variant (triggers flag update at +1369 in constructor)
5	P / UP	1-bit	7	Predicate registers (P0--P6, UP0--UP6)
6	Tensor/Acc	32-bit	varies	Tensor/accumulator registers for MMA/WGMMA operations

The class ID is the reg_type value stored at vreg+64. The allocator class at vreg+12 is a separate field used for instruction-level classification, not for the per-class allocation passes. The allocator's per-class linked lists at alloc[3*reg_type + 138] are populated directly from vreg+64.

Per-class state is initialized via the target descriptor vtable call vtable[896](alloc_state, class_id), which populates per-class register file descriptors at alloc[114..156] (four 8-byte entries per class).

Barrier Registers

Barrier registers (B and UB) are a distinct register file used by the BAR, DEPBAR, BSSY, and BSYNC instructions for warp-level and CTA-level synchronization. B0--B15 are the non-uniform barrier registers; UB0--UB15 are the uniform variant. Barrier registers have reg_type = 9, which is above the <= 6 cutoff for the main allocator class buckets. They are handled by a separate allocation mechanism outside the 7-class system.

Tensor/Accumulator Registers (Class 6)

Class 6 registers are created during intrinsic lowering of tensor core operations (MMA, WGMMA, HMMA, DMMA). Over 30 intrinsic lowering functions in the 0x6B--0x6D address range call sub_91BF30(ptr, ctx, 6) to create these registers. The GMMA pipeline pass (sub_ADA740, sub_69E590) identifies accumulator operands by checking *(vreg+64) == 6. The accumulator counting function at sub_78C6B0 uses the pair-mode bits at vreg+48 (bits 20--21) to determine whether a type-6 register consumes 1 or 2 physical R slots.

Virtual Register Descriptor

Every virtual register in a function is represented by a 160-byte descriptor allocated from the per-function arena. The register file array is at Code Object +88, indexed as *(ctx+88) + 8*regId. The descriptor is created by sub_91BF30 (register creation function).

Descriptor Layout

Offset	Size	Type	Field	Notes
+0	8	`ptr`	`next`	Linked list pointer (allocation worklist)
+8	4	`i32`	`id`	Unique register ID within function
+12	4	`i32`	`class_index`	Allocator register class (0--6)
+20	1	`u8`	`flags_byte`	Bit 0x20 = live
+24	4	`i32`	`bb_index`	Basic block of definition
+28	4	`i32`	`epoch`	Epoch counter for liveness tracking
+32	8	`ptr`	`alias_next`	Next aliased register (coalescing chain)
+36	8	`ptr`	`alias_parent`	Coalesced parent pointer
+40	4	`f32`	`spill_cost`	Accumulated spill cost
+48	8	`u64`	`flags`	Multi-purpose flag word (see below)
+56	8	`ptr`	`def_instr`	Defining instruction pointer
+64	4	`i32`	`reg_type`	Register file type enum
+68	4	`i32`	`physical_reg`	Physical register number (-1 = unassigned)
+72	1	`u8`	`size`	0 = scalar, nonzero = encoded width
+76	4	`f32`	`secondary_cost`	Secondary spill cost
+80	4	`i32`	`spill_flag`	0 = not spilled, 1 = spilled
+97	2	`u16`	`reserved`
+104	8	`ptr`	`use_chain`	Use chain head (instruction pointer)
+112	8	`ptr`	`def_chain`	Definition chain
+120	8	`ptr`	`regfile_next`	Next in register file linked list
+128	8	`ptr`	`linked_next`	Next in linked-register chain
+136	8	`ptr`	`reserved2`
+144	8	`ptr`	`constraint_list`	Constraint list head for allocator
+152	8	`ptr`	`reserved3`

Initial values set by the constructor (sub_91BF30):

vreg->next           = NULL;            // +0
vreg->id             = ctx->reg_count + 1;  // +8, auto-incrementing
vreg->class_index    = 0;               // +12
vreg->flags_byte     = 0;               // +20
vreg->alias_parent   = (ptr)-1;         // +20..27 (qword write)
vreg->physical_reg   = -1;              // +68 (unassigned)
vreg->reg_type       = a3;              // +64 (passed as argument)
vreg->size           = 0;               // +72
vreg->spill_flag     = 0;               // +80
vreg->use_chain      = NULL;            // +104
vreg->def_chain      = NULL;            // +112
vreg->constraint_list = NULL;           // +144

For predicate types (a3 == 2 or a3 == 3), the flags word at +48 is initialized to 0x1000 (4096). For all other types, it is initialized to 0x1018 (4120). If the type is 7 (alternate predicate classification), the physical register is initialized to 0 instead of -1.

Flag Bits at +48

Bit	Mask	Meaning
9	`0x200`	Pre-assigned / fixed register
10	`0x400`	Coalesced source
11	`0x800`	Coalesced target
12	`0x1000`	Base flag (set for all types)
14	`0x4000`	Spill marker (already spilled)
18	`0x40000`	Needs-spill (allocator sets when over budget)
20--21	(pair mode)	0 = single, 1 = lo-half of pair, 3 = double-width
22	`0x400000`	Constrained to architecture limit
23	`0x800000`	Hi-half of pair (predicate half-width packing)
27	`0x8000000`	Special handling flag

Register File Type Enum (at +64)

This enum determines the register file a VR belongs to. It is used by the register class name table at off_21D2400 to map type values to printable strings ("R", "UR", "P", etc.) for diagnostic output such as "Referencing undefined register: %s%d".

Value	File	Alloc class	Description
1	R	1	General-purpose register (32-bit)
2	R (alt)	2	GPR variant (RZ sentinel in `sub_7D82E0`, stat collector alternate)
3	UR	3	Uniform register (32-bit)
4	UR (ext)	4	Uniform GPR variant (triggers flag update at +1369 in constructor)
5	P / UP	5	Predicate register (1-bit); covers both P and UP
6	Tensor/Acc	6	Tensor/accumulator register for MMA/WGMMA operations
7	P (alt)	--	Predicate variant (physical = 0 at init); above allocator cutoff
8	--	--	Extended type (created by `sub_83EF00`); above allocator cutoff
9	B / UB	--	Barrier register; above allocator cutoff, separate allocation
10	R2	--	Extended register pair (64-bit, two consecutive R regs)
11	R4	--	Extended register quad (128-bit, four consecutive R regs)

Values 0--6 are within the allocator's class system (the distribution loop in sub_9721C0 guards with reg_type <= 6). Values 7+ are handled by separate mechanisms. The off_21D2400 name table is indexed by reg_type and provides display strings for diagnostic output.

The stat collector at sub_A60B60 (24 KB) enumerates approximately 25 register sub-classes including R, P, B, UR, UP, UB, Tensor/Acc, SRZ, PT, RZ, and others by iterating vtable getter functions per register class.

Wide Registers

NVIDIA GPUs have only 32-bit physical registers. Wider values are composed from consecutive registers.

64-Bit Pairs (R2)

A 64-bit value occupies two consecutive registers where the base register has an even index: R0:R1, R2:R3, R4:R5, and so on. The low 32 bits reside in the even register; the high 32 bits in the odd register. In the Ori IR, a 64-bit pair is represented by a single virtual register with:

vreg+64 (type) = 10 (extended pair)
vreg+48 bits 20--21 (pair mode) = 3 (double-width)

The allocator selects even-numbered physical slots by scanning with stride 2 instead of 1. The register consumption function (sub_939CE0) computes slot + (1 << (pair_mode == 3)) - 1, consuming two physical slots.

128-Bit Quads (R4)

A 128-bit value occupies four consecutive registers aligned to a 4-register boundary: R0:R1:R2:R3, R4:R5:R6:R7, etc. Used by texture instructions, wide loads/stores, and tensor core operations. In the Ori IR:

vreg+64 (type) = 11 (extended quad)
Allocator scans with stride 4

Alignment Constraints

Width	Base alignment	Stride	Example
32-bit (scalar)	Any	1	R7
64-bit (pair)	Even	2	R4:R5
128-bit (quad)	4-aligned	4	R8:R9:R10:R11

The texture instruction decoder (sub_1170920) validates even-register alignment via a dedicated helper (sub_1170680) that checks if a register index falls within the set {34, 36, 38, ..., 78} and returns 0 if misaligned.

The SASS instruction encoder for register pairs (sub_112CDA0, 8.9 KB) maps 40 register pair combinations (0/1, 2/3, ..., 78/79) to packed 5-bit encoding values at 0x2000000 (33,554,432) intervals.

Special Registers

Zero and True Registers

Register	File	Index	Internal sentinel	Behavior
RZ	R	255	1023	Reads return 0; writes discarded
URZ	UR	63	1023	Uniform zero; reads return 0
PT	P	7	31	Always-true predicate; writes discarded
UPT	UP	7	31	Uniform always-true

The internal sentinel value 1023 (0x3FF) represents "don't care" or "zero register" throughout the Ori IR and allocator. During SASS encoding, hardware register index 255 is mapped to sentinel 1023 for R/UR files, and hardware index 7 is mapped to sentinel 31 for P/UP files. These sentinels are checked in encoders to substitute the default register value:

// Decoder: extract register operand (sub_9B3C20)
if (reg_idx == 255)
    internal_idx = 1023;   // RZ sentinel

// Decoder: extract predicate operand (sub_9B3D60)
if (pred_idx == 7)
    internal_idx = 31;     // PT sentinel

// Encoder: emit register field
if (reg == 1023)
    use *(a1+8) as default;  // encode physical RZ

Architectural Predicate Indices

The allocator skips architectural predicate registers by index number:

Index	Register	Treatment
39	(special)	Skipped during allocation (skip predicate `sub_9446D0`)
41	PT	Skipped -- hardwired true predicate
42	P0	Skipped -- architectural predicate
43	P1	Skipped -- architectural predicate
44	P2	Skipped -- architectural predicate

The skip check in sub_9446D0 returns true (skip) for register indices 41--44 and 39, regardless of register class. For other registers, it checks whether the instruction is a CSSA phi (opcode 195 with barrier type 9) or whether the register is in the exclusion set hash table at alloc+360.

Special System Registers (S2R / CS2R)

Thread identity and hardware state are accessed through the S2R (Special Register to Register) and CS2R (Control/Status Register to Register) instructions. These read read-only hardware registers into R-file registers.

Common system register values (from PTX parser initialization at sub_451730):

PTX name	Hardware	Description
`%tid` / `%ntid`	SR_TID_X/Y/Z	Thread ID within CTA
`%ctaid` / `%nctaid`	SR_CTAID_X/Y/Z	CTA ID within grid
`%laneid`	SR_LANEID	Lane index within warp (0--31)
`%warpid` / `%nwarpid`	SR_WARPID	Warp index within CTA
`%smid` / `%nsmid`	SR_SMID	SM index
`%gridid`	SR_GRIDID	Grid identifier
`%clock` / `%clock_hi` / `%clock64`	SR_CLOCK / SR_CLOCK_HI	Cycle counter
`%lanemask_eq/lt/le/gt/ge`	SR_LANEMASK_*	Lane bitmask variants

The S2R register index must be between 0 and 255 inclusive, enforced by the string "S2R register must be between 0 and 255 inclusive". Special system register ranges are tracked at Code Object offsets +1712 (start) and +1716 (count).

Operand Encoding in Ori Instructions

Each instruction operand is encoded as a 32-bit packed value in the operand array starting at instruction offset +84. The operand at index i is at *(instr + 84 + 8*i).

Packed Operand Format (Ori IR)

 31   30  29  28  27            24  23  22  21  20  19                  0
+----+---+---+---+---------------+---+---+---+---+---------------------+
|sign|     type  |  modifier (8) |                index (20)           |
+----+---+---+---+---------------+---+---+---+---+---------------------+
 bit 31: sign/direction flag          bits 0-19: register/symbol index
 bits 28-30: operand type (3 bits)    bit 24: pair extension flag

Extraction pattern (50+ call sites):

uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
int type    = (operand >> 28) & 7;     // bits 28-30
int index   = operand & 0xFFFFF;       // bits 0-19
int mods    = (operand >> 20) & 0xFF;  // bits 20-27
bool is_neg = (operand >> 31) & 1;     // bit 31

Type value	Meaning
1	Register operand (index into register file at `(ctx+88) + 8index`)
5	Symbol/constant operand (index into symbol table at `*(ctx+152)`)
6	Special operand (barrier, system register)

For register operands (type 1), the index is masked as operand & 0xFFFFFF (24 bits) to extract the full register ID. Indices 41--44 are architectural predicates that are never allocated.

SASS Instruction Register Encoding

During final SASS encoding, the register operand encoder (sub_7BC030, 814 bytes, 6147 callers) packs register operands into the 128-bit instruction word:

Encoded register field (16 bits at variable bit offset):
  bit 0:      presence flag (1 = register present)
  bits 1-4:   register file type (4 bits, 12 values)
  bits 5-14:  register number (10 bits)

The 4-bit register file type field in the SASS encoding maps the internal operand type tag to hardware encoding:

Operand type tag	Encoded value	Register file
1	0	R (32-bit)
2	1	R pair (64-bit)
3	2	UR (uniform 32-bit)
4	3	UR pair (uniform 64-bit)
5	4	P (predicate)
6	5	(reserved)
7	6	(reserved)
8	7	B (barrier)
16	8	(extended)
32	9	(extended)
64	10	(extended pair)
128	11	(extended quad)

The predicate operand encoder (sub_7BCF00, 856 bytes, 1657 callers) uses a different format: 2-bit predicate type, 3-bit predicate condition, and 8-bit value. It checks for PT (operand byte[0] == 14) and handles the always-true case.

Register-Class-to-Hardware Encoding

The function sub_1B6B250 (2965 bytes, 254 callers) implements the mapping from the compiler's abstract (register_class, sub_index) pair to hardware register numbers:

hardware_reg = register_class * 32 + sub_index

For example: class 0, index 1 returns 1; class 1, index 1 returns 33; class 2, index 1 returns 65. The guard wrapper sub_1B73060 (483 callers) returns 0 for the no-register case (class=0, index=0).

The register field writer (sub_1B72F60, 483 callers) packs the encoded register number into the 128-bit instruction word with the encoding split across two bitfields:

*(v2 + 12) |= (encoded_reg << 9) & 0x3E00;       // bits [13:9]
*(v2 + 12) |= (encoded_reg << 21) & 0x1C000000;   // bits [28:26]

Register Pressure Tracking

Scheduling Phase Pressure Counters

The scheduler maintains 10 per-block register pressure counters at offsets +4 through +40 of the per-BB scheduling record (72 bytes per basic block). At BB entry, these are copied into the scheduler context at context offsets +48 through +87. The counters track live register counts for each register class:

BB record offset	Context offset (idx)	Register class
+4	+48 (idx 12)	R (general-purpose)
+8	+52 (idx 13)	P (predicate)
+12	+56 (idx 14)	UR (uniform)
+16	+60 (idx 15)	UP (uniform predicate)
+20	+64 (idx 16)	B (barrier)
+24	+68 (idx 17)	(arch-specific class 0)
+28	+72 (idx 18)	(arch-specific class 1)
+32	+76 (idx 19)	(arch-specific class 2)
+36	+80 (idx 20)	(arch-specific class 3)
+40	+84 (idx 21)	(arch-specific class 4 / control total)

The spill cost analyzer (sub_682490, 14 KB) allocates two stack arrays (v94[511] and v95[538]) as per-register-class pressure delta arrays. For each instruction, it computes pressure increments and decrements based on the instruction's register operand definitions and uses.

The register pressure coefficient is controlled by knob 740 (double, default 0.045). The pressure curve function uses a piecewise linear model with parameters (4, 2, 6) via sub_8CE520.

Liveness Bitvectors

The Code Object maintains register liveness as bitvectors:

Offset	Bitvector	Description
+832	Main register liveness	One bit per virtual register; tracks which registers are live at the current program point
+856	Uniform register liveness	Separate bitvector for UR/UP registers

These bitvectors are allocated via sub_BDBAD0 (bitvector allocation, with size = register count + 1 bits) and manipulated via the SSE2-optimized bitvector primitives at sub_BDBA60 / sub_BDC180 / sub_BDCDE0 / sub_BDC300.

For each basic block during dependency graph construction (sub_A0D800, 39 KB), the per-block liveness is computed by iterating instructions and checking operand types ((v >> 28) & 7 == 1 for register operands), then updating the bitvector at +832 with set/clear operations.

Allocator Pressure Arrays

The fat-point allocator (sub_957160) uses two 512-DWORD (2048-byte) arrays per allocation round:

Array	Role
Primary (`v12[512]`)	Per-physical-register interference count
Secondary (`v225[512]`)	Tie-breaking cost metric

Both are zeroed with SSE2 vectorized _mm_store_si128 loops at the start of each round. For each VR being allocated, the pressure builder (sub_957020) walks the VR's constraint list and increments the corresponding physical register slots. The threshold (knob 684, default 50) filters out congested slots.

ABI Register Reservations

Reserved Registers

Registers R0--R3 are unconditionally reserved by the ABI across all SM generations. The diagnostic "Registers 0-3 are reserved by ABI and cannot be used for %s" fires if they are targeted by parameter assignment or user directives.

Minimum Register Counts by SM Generation

SM generation	Value	SM targets	Minimum registers
3	`(sm_target+372) >> 12 == 3`	sm_35, sm_37	(no minimum)
4	`== 4`	sm_50 -- sm_53	16
5	`== 5`	sm_60 -- sm_89	16
9	`== 9`	sm_90, sm_90a	24
>9	`> 9`	sm_100+	24

Violating the minimum emits warning 7016: "regcount %d specified below abi_minimum of %d".

Per-Class Hardware Limits

Class	Limit	Notes
R	255	R0--R254 usable; controlled by `--maxrregcount` and `--register-usage-level` (0--10)
UR	63	UR0--UR62 usable; sm_75+ only
P	7	P0--P6 usable
UP	7	UP0--UP6 usable; sm_75+ only
B	16	B0--B15
UB	16	UB0--UB15

The --maxrregcount CLI option sets a per-function hard ceiling for R registers. The --register-usage-level option (0--10, default 5) modulates the register allocation target: level 0 means no restriction, level 10 means minimize register usage as aggressively as possible. The per-class budget at alloc + 32*class + 884 reflects the interaction between the CLI limit and the optimization level.

The --device-function-maxrregcount option overrides the kernel-level limit for device functions when compiling with -c.

Dynamic Register Allocation (setmaxnreg)

sm_90+ (Hopper and later) supports dynamic register allocation through the setmaxnreg.inc and setmaxnreg.dec instructions, which dynamically increase or decrease the per-thread register count at runtime. ptxas tracks these as internal states setmaxreg.try_alloc, setmaxreg.alloc, and setmaxreg.dealloc. Multiple diagnostics guard correct usage:

"setmaxnreg.dec has register count (%d) which is larger than the largest temporal register count in the program (%d)"
"setmaxreg.dealloc/release has register count (%d) less than launch min target (%d) allowed"
"Potential Performance Loss: 'setmaxnreg' ignored to maintain minimum register requirements."

Pair Modes and Coalescing

The pair mode at vreg+48 bits 20--21 controls how the allocator handles wide registers:

Pair mode	Value	Behavior
Single	0	Occupies one physical register slot
Lo-half	1	Low half of a register pair
Double-width	3	Occupies two consecutive physical slots

The allocator computes register consumption via sub_939CE0:

consumption = slot + (1 << (pair_mode == 3)) - 1;
// single:  slot + 0  = slot (1 slot)
// double:  slot + 1  = slot+1 (2 slots)

The coalescing pass (sub_9B1200, 800 lines) eliminates copy instructions by merging the source and destination VRs into the same physical register. The alias chain at vreg+36 (coalesced parent) is followed during assignment (sub_94FDD0) to propagate the physical register through all aliased VRs:

alias = vreg->alias_parent;     // vreg+36
while (alias != NULL) {
    alias->physical_reg = slot;  // alias+68
    alias = alias->alias_parent; // alias+36
}

Register Name Table

The register class name table at off_21D2400 is a pointer array indexed by the register file type enum (from vreg+64). Each entry points to a string: "R", "UR", "P", "UP", "B", "UB", etc. This table is used by diagnostic functions:

sub_A4B9F0 (StatsEmitter::emitUndefinedRegWarning): "Referencing undefined register: %s%d" where %s is off_21D2400[*(vreg+64)] and %d is *(vreg+68) (physical register number).
sub_A60B60 (RegisterStatCollector::collectStats, 24 KB): Enumerates ~25 register sub-classes by iterating vtable getters, one per register class. The enumerated classes include R, P, B, UR, UP, UB, SRZ, PT, RZ, and others.
"Fatpoint count for entry %s for regclass %s : %d": Prints per-function per-class allocation statistics.

Key Functions

Address	Size	Function	Description
`sub_91BF30`	99 lines	`createVirtualRegister`	Allocates 160-byte VR descriptor, initializes fields, appends to register file array
`sub_9446D0`	29 lines	`shouldSkipRegister`	Returns true for indices 41--44, 39 (architectural specials); checks CSSA phi and exclusion set
`sub_A4B8F0`	248B	`emitInstrRegStats`	Emits `"instr/R-regs: %d instructions, %d R-regs"`
`sub_A4B9F0`	774B	`emitUndefinedRegWarning`	Walks operands backward, formats `"Referencing undefined register: %s%d"`
`sub_A60B60`	4560B	`collectRegisterStats`	Enumerates ~25 register sub-classes via vtable getters
`sub_7BC030`	814B	`encodeRegOperand`	Packs register into SASS instruction: 1-bit presence + 4-bit type + 10-bit number
`sub_7BCF00`	856B	`encodePredOperand`	Packs predicate into SASS: 2-bit type + 3-bit condition + 8-bit value
`sub_9B3C20`	--	`decodeRegOperand`	Decoder helper: extracts register, maps 255 to 1023 (RZ)
`sub_9B3D60`	--	`decodePredOperand`	Decoder helper: extracts predicate, maps 7 to 31 (PT)
`sub_1B6B250`	2965B	`regClassToHardware`	Maps (class, sub_index) to hardware number: `class * 32 + sub_index`
`sub_1B73060`	19B	`regClassToHardwareGuard`	Guard wrapper: returns 0 for no-register case
`sub_1B72F60`	32B	`writeRegField`	Packs encoded register into instruction word bits [13:9] and [28:26]
`sub_112CDA0`	8.9KB	`encodeRegisterPair`	Maps 40 register pair combinations to 5-bit packed encoding values
`sub_939CE0`	23 lines	`computeConsumption`	Pair-aware register slot consumption counter
`sub_94FDD0`	155 lines	`assignRegister`	Commits physical register assignment, propagates through alias chain
`sub_A0D800`	39KB	`buildDependencyGraph`	Per-block dependency graph with register-to-instruction mapping
`sub_A06A60`	15KB	`scheduleWithPressure`	Per-block scheduling loop tracking live register set bitvector
`sub_682490`	14KB	`computeRegPressureDeltas`	Per-instruction register pressure delta computation
`sub_B28E00`	--	`getRegClass`	Returns register class (1023 = wildcard, 1 = GPR)
`sub_B28E10`	--	`isRegOperand`	Predicate: is this a register operand?
`sub_B28E20`	--	`isPredOperand`	Predicate: is this a predicate operand?
`sub_B28E90`	--	`isUReg`	Predicate: is this a uniform register?

Opcode Register Class Table

Every Ori opcode carries an implicit register class contract: which register files its operands may reference, what data widths are valid, and which addressing modes apply. The function sub_6575D0 (49 KB, buildEncodingDescriptor) is the central dispatch that translates each instruction's opcode into a packed encoding descriptor consumed by the SASS encoder.

Function Signature

// sub_6575D0 -- buildEncodingDescriptor
// a1 = compiler context
// a2 = Ori instruction node pointer
// a3 = output: 4-DWORD packed encoding descriptor
char buildEncodingDescriptor(Context *a1, Instruction *a2, uint32_t *a3);

Architecture

The function is a two-level dispatch:

Outer switch on the Ori opcode at *(instr->info + 8) -- 168 unique case values spanning opcodes 3 (IADD3) through 0xF5 (PIXLD).
Inner encoding per opcode (or group): assigns an encoding category ID to a3[0], then calls the bitfield packers to fill a3[1..2] with register class attributes.

Two helper functions pack fields into the descriptor:

Function	Role	Call count	Field ID range
`sub_917A60` (`packRegClassField`)	Bitfield encoder -- field IDs 91--340 map to specific bit positions in `a3[1]` and `a3[2]`	112	91--340
`sub_A2FF00` (`packOperandField`)	Alternate encoder for operand-level slots (data type, memory space)	28	3--71

Encoding Category Assignment

The encoding category at a3[0] selects which SASS instruction format template the downstream per-SM encoder uses. Key mappings (opcode index to encoding category):

Opcode(s)	SASS mnemonic	Category	Register class summary
3	`IADD3`	489	R dest, R/UR sources, P carry
4	`BMSK`	106	R only
5--6	`SGXT` / `LOP3`	490--491	R dest, R/UR sources
7	`ISETP`	59	P dest, R/UR sources + memory ordering fields
8	`IABS`	60	R dest, R source + memory ordering fields
0x0E--0x10	`FSET`/`FSEL`/`FSETP`	510	R/P dest, FP operation variant
0x11/0x12/0x18	`FSETP`/`MOV`/`PRMT`	517	FP comparison, combine, data width (IDs 288--299)
0x15--0x16	`P2R`/`R2P`	524--525	P-to-R or R-to-P conversion
0x19	`VOTE`	526	R dest, optional memory class
0x1A	`CS2R` variant	527	UR source width (494--496), data type from `a2+92`
0x1B	`CS2R_32`	497	Source width (494/495/496), predicate flag (ID 270)
0x1E	`IPA`	494	Interpolation mode (440--442), flat/smooth (443/444)
0x1F	`MUFU`	501	Subfunction (445--447), precision (450--459)
0x20	`SHF`	502	Direction (461--463), source class (464--466), clamp, data type
0x21	`SHFL`	503	Mode (470/471), operand classes (472--482)
0x22--0x23	`I2I`/`I2IP`	55/56	Integer conversion type (23 entries in `dword_2026B20`)
0x28--0x2A	`IPA`/`MUFU` ext	512	Extended encoding variants (428--430)
0x2B--0x2C	`F2F`/`F2F_X`	513	Conversion direction (432/433), saturation (434/435)
0x2D	`FRND`	516	Rounding variant (526), mode (528/529)
0x51--0x53	`AL2P`, `AL2P_IDX`	437--438	Bindless flag (ID 148), predicate (ID 147)
0x54--0x56	`BMOV_B`/`BMOV_R`/`BMOV`	423--424	B-register class
0x64--0x67	`SETLMEMBASE`/`ATOM`	156/463	Atom-vs-red (ID 178), data width (ID 181)
0x68	`BRX`	468	Target (ID 190), call convention (IDs 191--192)
0x6A/0x6C/0x6D	`JMP`/`JMX`/`CALL`	469	Control flow target class (ID 176)
0x77--0x79	`BSSY`/`BREAK`/`BSYNC`	528--530	Sync mode (ID 324), variant (ID 325)
0x82	`NANOTRAP`	487	Trap operation class (ID 257), has-source (ID 256)
0x9E--0x9F	Hopper+ instrs	535--536	Hopper class A/B (IDs 337--338)
0xAF--0xB2	`LD`/`ST` variants	431--446	Full modifier set: uniform (91), pair (92--102)
0xB8--0xBE	`LDG`/`STG`/`LDL`/`STL`	449--456	Cache policy (131), float mode (134), width (131)
0xC1	Conditional	10/13	Branch type (ID 167), divergent (ID 168)
0xC8	`PRMT`	24	Permute selector (ID 65/66)
0xC9--0xD3	Texture/surface	61/455	Texture data type (IDs 17/18), surface (IDs 19--22)
0xD6--0xD7	`DMMA`/`CVTA`	515	Direction (304), predicate (305), data type (306)
0xDA--0xDB	`SUATOM`	521/533	Data width (326--331), sync mode (328)
0xDC	`SURED`	534	Data width (331), type (335--336), sync (333)
0xE0	`WGMMA`	500	Data type (198), enable (199), barrier (201)
0xF5	`PIXLD`	532	Mode from `dword_2026AA0` (ID 323)

Extended Opcode Path (Memory/Atomic Sub-dispatch)

When the opcode falls in the 0xF6--0x10C range (memory/atomic extended instructions), a separate sub-dispatch applies. The function sub_44AC80 gates entry; sub_44AC60 and sub_44AC70 select among three encoding categories:

Category	Gate function	Meaning
441	default	Base memory operation
442	`sub_44AC60` true	Predicated memory variant
443	`sub_44AC70` true	Extended memory variant

Within each category, the sub-opcode selects register class fields:

Sub-opcode	Register class (field 115)	Data width (field 113)
0xF6/0xFF/0x106	69 (class A)	60 (standard)
0xF7/0x100/0x107	71 (class B)	60 (standard)
0xF8/0x102/0x109	0 (default)	63 (wide)
0xF9/0x103/0x10A	0 (default)	61 (narrow)
0xFA/0x104/0x10B	0 (default)	62 (medium)
0xFB	0 (default)	65 (type A)
0xFC	0 (default)	66 (type B)
0xFD	0 (default)	68 (type C)
0xFE/0x105/0x10C	0 (from table)	64 (from `dword_2026C30`)
0x101/0x108	72 (class C)	60 (standard)

Packed Descriptor Layout

The output descriptor a3 is a 4-DWORD (16-byte) structure:

DWORD	Content
`a3[0]`	Encoding category ID (0--542) -- selects SASS format template
`a3[1]`	Packed bitfield: memory space (bits 0--3), address type (bits 4--7)
`a3[2]`	Packed bitfield: register class attributes (data width, type, modifiers)
`a3[3]`	Auxiliary flags (bit 1 = texture scope, bit 29 = special)
`a3[4]`	Operand count override (set to 12 for KILL/extended mem ops)

Register Class Field Groups

The 112 calls to packRegClassField (sub_917A60) use field IDs organized into functional groups. Each field ID maps to a specific bit range in the output descriptor via a mask-and-OR encoding:

// Example: field 113 (data width) -- bits 7-9 of a3[2]
case 113:
    val = dword_21DEB20[a3_value - 61];  // 8-entry lookup
    a3[2] = (val << 7) | (a3[2] & 0xFFFFF87F);
    break;

// Example: field 91 (uniform flag) -- bit 16 of a3[2]
case 91:
    a3[2] = ((value == 1) << 16) | (a3[2] & 0xFFFEFFFF);
    break;

Field group	IDs	Bits written	Purpose
Core class	91--102	`a3[2]` bits 5--22	Uniform, pair, predicate, data type, saturate, negate, abs, complement
Data width	113--117	`a3[2]` bits 0--9	Width code, uniform-mem, source regclass, type specifier, write-back
Load/store	118--134	`a3[1]` + `a3[2]`	Memory space, address type, cache policy, atomic op, scope, float mode
Texture/surface	135--165	`a3[2]` bits 1--31	Texture type, dimension, LOD mode, ordering, acquire, scope hint
Control flow	167--202	`a3[2]` bits 1--6	Branch type, divergent, WGMMA data type/enable/barrier
FP/conversion	230--264	`a3[2]` various	FP operation, comparison, combine, interpolation, MUFU, SHF, SHFL
Extended	269--299	`a3[2]` various	CS2R, FSETP, rounding, data type wide, destination regclass
Hopper/Blackwell	304--340	`a3[2]` various	DMMA, WGMMA, TMA hints, surface sync, Hopper-specific classes

Sub-handler Functions

Complex opcode families delegate register class encoding to dedicated sub-functions:

Function	Opcodes handled	Purpose
`sub_650390`	TEX, TLD, texture family	Texture register class (sampler, coordinate, LOD)
`sub_650220`	LDG, STG, LD, ST, ATOM, RED	Memory instruction register class
`sub_651330`	FMUL (opcode 0x0D)	FP multiply register class
`sub_650920`	LEA, special (0x09, 0x72, 0x74, 0x7A, 0x80, 0x81)	LEA / special instruction
`sub_650A90`	I2I, F2F, conversions (0x24--0x27, 0xE2--0xEB)	Type conversion register class
`sub_652190`	Branch/call (0x13, 0x14, 0x17)	Branch/call register class
`sub_653B90`	Misc (0x0C)	Miscellaneous instruction
`sub_650C80`	Memory barrier modifiers	Applied when `(a2+56) & 0x4F0` is nonzero
`sub_651A90`	Texture modifiers (0x83)	Applied before texture encoding
`sub_62D5D0`	Memory space computation	Computes memory space tag from operand types

Lookup Tables

The function references 28 static lookup tables that map instruction attribute values to register class encoding values:

Table	Size	Used by field(s)	Content
`dword_21DEB80`	5	94	Data type encoding
`dword_21DEB50`	3	107, 115, 145, 157, 165	3-value encoding (reused across 5 fields)
`dword_21DEB20`	8	113	Data width code
`dword_21DEB00`	7	116, 126, 131, 170	Type encoding (reused across 4 fields)
`dword_21DEAE0`	5	119/123, 136, 143, 159	Variant table (reused across 4 fields)
`dword_21DEAA0`	13	120	Memory space code
`dword_21DEA60`	10	121, 135/151	Address/texture type
`dword_21DEA20`	15	124/125	Reduction type
`dword_21DE9F0`	6	129/130, 150	Scope code
`dword_2026C30`	6	116 (ext path)	Sub-opcode to data type
`dword_2026C80`	20	165 (surface)	Surface operation codes
`dword_2026E20`	17	286	Data type (wide)
`dword_2026AC0`	16	198	WGMMA data type
`dword_2026B20`	23	I2I conversion	Integer conversion type

Ori IR Overview -- register files in the context of the full IR
Instructions -- packed operand format and opcode encoding
Allocator Architecture -- the 7-class fat-point allocator
Fat-Point Algorithm -- pressure arrays, constraint types, selection loop
GPU ABI -- reserved registers, parameter passing, return address
Spilling -- spill/reload for each register class
Scheduler -- 10 per-block pressure counters at record +4..+40
SASS Encoding -- how the descriptor drives instruction word layout

Keyboard shortcuts

PTXAS Reverse Engineering Reference