Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Ori Internal Representation

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Ori -- short for "Original IR" -- is ptxas's sole intermediate representation. It is a fully proprietary, SASS-level IR with virtual registers, its own CFG infrastructure, and a partial-SSA discipline. Ori has no relationship to LLVM IR: there is no LLVM Value hierarchy, no LLVM-style use-def chains, no SSA dominance-frontier construction. Every IR-level optimization pass in ptxas (prefixed Ori in the NamedPhases table: OriCopyProp, OriSanitize, OriBranchOpt, OriLoopSimplification, OriStrengthReduce, OriDoPredication, etc.) operates on this representation.

The key design decision that distinguishes Ori from PTX: Ori uses SASS opcode names, not PTX mnemonics. After the MercConverter pass (sub_9F1A90, 35KB) runs, every instruction carries the name of the hardware SASS instruction it will become -- IMAD, FFMA, LDG, STG, BAR, BRA, EXIT, etc. -- just with virtual (not physical) register operands. This means the optimizer already knows exactly which hardware functional unit each instruction will execute on, enabling accurate latency modeling and scheduling from the earliest optimization phases.

Key Facts

PropertyValue
NameOri ("Original IR")
HeritageFully proprietary (not LLVM-based)
LevelSASS machine-level with virtual registers
SSA formPartial -- constructed by phase 23, destroyed by phase 73
Code Object size~1136 bytes per function (C++ object)
Code Object vtable0x21EE238
Register files4: R (GPR), UR (uniform), P (predicate), UP (uniform predicate)
Operand kinds10 distinct types
CFG representationFNV-1a hash maps for successor/backedge edges
Opcode encodingROT13 of real SASS mnemonic names
BB entry size40 bytes per basic block, contiguous array
Instruction linkageDoubly-linked list within each basic block

Architecture Overview

  PTX source
      |
      v
  [Flex/Bison parser]          -- see pipeline/ptx-parser.md
      |
      v
  [PTX-to-Ori lowering]        -- see pipeline/ptx-to-ori.md
      |
      v
  +-------------------------------------------+
  |            Ori IR                          |
  |                                            |
  |  Code Object (per-function container)      |
  |    +-- Basic Block array (40B entries)     |
  |    |     +-- Instruction linked list       |
  |    |           +-- Packed operand array     |
  |    +-- CFG (FNV-1a hash map edges)         |
  |    +-- RPO array                           |
  |    +-- Register file arrays                |
  |    +-- Backedge map                        |
  +-------------------------------------------+
      |
      | 159 optimization phases (phases 0-158)
      |   phase 23: GenerateMovPhi (enter partial SSA)
      |   phase 73: ConvertAllMovPhiToMov (exit partial SSA)
      |
      v
  [Instruction selection]      -- see codegen/isel.md
      |
      v
  [Register allocation]        -- see regalloc/overview.md
      |
      v
  [Instruction scheduling]     -- see scheduling/overview.md
      |
      v
  [SASS binary encoding]       -- see codegen/encoding.md

The Code Object

Every function under compilation is represented by a single Code Object -- a ~1136-byte C++ structure that owns all IR data for that function. The Code Object vtable is at 0x21EE238. Its constructor is at sub_A3B080.

Field Map

OffsetTypeFieldDescription
+24u32sm_versionSM target (encoded: 12288=sm30, 20481=sm50, 36865=sm90)
+72ptrcode_bufOutput code object buffer
+88ptrreg_fileRegister descriptor array. *(ctx+88)+8*regId -> descriptor
+152ptrsym_tableSymbol/constant lookup array
+272ptrinstr_headInstruction linked-list head
+296ptrbb_arrayBasic block array pointer (40B per entry)
+304u32bb_indexBasic block array count/current index
+312ptroptionsOptionsManager* for knob queries
+648ptrsucc_mapCFG successor edge hash table
+680ptrbackedge_mapCFG backedge hash table
+720ptrrpo_arrayReverse post-order array (int*)
+768ptrconst_sectionsConstant memory section array
+776ptrsmem_sectionsShared memory section array
+976ptrblock_infoBlock info array (40 bytes per entry, contiguous)
+984i32num_blocksNumber of basic blocks
+1584ptrsm_backendSM-specific architecture backend object (see data-structures.md)
+1664ptrknob_containerKnob container pointer (for -knob queries)
+1928ptrcodegen_ctxCode object / code generation context

Register and Instruction Counts (SM Backend Object)

The register counts and instruction counts live in the SM backend object at *(code_obj+1584), accessed via DWORD-indexed fields (not Code Object byte offsets). Earlier versions of this page incorrectly listed these as Code Object offsets +99, +102, +159, +335, +341 -- those are DWORD indices, making the actual byte offsets 396, 408, 636, 1340, and 1364 respectively within the SM backend.

DWORD IndexByte OffsetTypeFieldDescription
[99]+396u32ur_countUniform register (UR) count
[102]+408u32r_allocR-register count (allocated)
[159]+636u32r_reservedR-register count (reserved)
[335]+1340u32instr_hiInstruction count (upper bound)
[341]+1364u32instr_loInstruction count (lower bound)

Register count formula (from sub_A4B8F0, where v5 = *(_DWORD **)(ctx + 1584)):

total_R_regs      = v5[159] + v5[102]   // reserved + allocated
instruction_count = v5[335] - v5[341]   // upper - lower

The stats emitter at sub_A3A7E0 prints a detailed per-function profile:

# 142 instructions, 24 R-regs
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
# [est latency = 87] [LSpillB=0]
# [Occupancy = 0.750000]
# [issue thru=0.888889] [fp thru=0.000000]
# [worstcaseLat=87.000000]
# [avgcaseLat=52.500000]

Basic Blocks

Basic blocks are stored as 40-byte entries in a contiguous array at Code Object +976. The block count is at +984.

Block Entry Layout (40 bytes)

OffsetTypeField
+0ptrHead instruction pointer (first instruction in BB)
+8ptrInstruction list link / tail
+28i32bix -- block index (unique ID for CFG operations)
+32u64Flags / padding

Blocks are additionally accessible via a sub-block array at Code Object +368, indexed as *(ctx+368) + 8*blockIndex.

The debug dumper (sub_BE21D0) emits Graphviz DOT output for the CFG:

digraph f {
  node [fontname="Courier" ...]
  bix0 -> bix1
  bix0 -> bix3
  bix1 -> bix2
  bix2 -> bix1    // backedge (loop)
  bix2 -> bix3
}

Control Flow Graph

The CFG uses FNV-1a hash maps to represent edges. Two separate hash tables exist at Code Object offsets +648 (successor edges) and +680 (backedge info).

FNV-1a Hashing

All CFG hash lookups use the same parameters, confirmed across 50+ call sites:

ParameterValue
Initial hash0x811C9DC5
Prime16777619 (0x01000193)
Input4-byte block index, hashed byte-by-byte

Hash Map Structure

Each hash map uses chained hashing with 24-byte bucket entries:

Bucket (24 bytes):
  +0   node* head      // first node in chain
  +8   node* tail      // last node in chain
  +16  i32   count     // entries in this bucket

Full Node (64 bytes):
  +0   node* next      // chain link
  +8   i32   key       // block index
  +16  ptr   values    // successor/predecessor block list
  +32  sub-hash data   // embedded sub-table for multi-edge blocks
  +56  u32   hash      // cached FNV-1a hash

Simple Node (16 bytes):
  +0   node* next
  +8   i32   key
  +12  u32   hash

Growth policy: rehash when total_elements > num_unique_keys (load factor exceeds 1.0). Capacity doubles on each rehash.

Key CFG Functions

AddressSizeFunctionNotes
sub_BDE1509KBCFG::computeRPOExplicit DFS stack, assigns RPO numbers into +720 array
sub_BDE8B02KBCFG::printEdgesFNV-1a lookup, prints "bix%d -> bix%d\n"
sub_BDEA504KBCFG::dumpRPOAndBackedgesRPO + backedge debug dump
sub_BE069054KBCFG::buildAndAnalyzeMain CFG constructor: predecessors, successors, RPO, loop detection
sub_BE21D01.4KBCFG::dumpDOTGraphviz DOT format output
sub_BE23304KBCFG::computeDominatorsPost-build dominator/loop analysis with bitvector ops

The RPO dump (sub_BDEA50) produces output like:

Showing RPO state for each basic block:
  bix0 -> RPONum: 0
  bix1 -> RPONum: 1
  bix2 -> RPONum: 3
  bix3 -> RPONum: 2
RPO traversal order: bix0, bix1, bix3, bix2
Showing backedge info:
  bix2 -> backedge's successor BB: 1

Instructions

Instructions are C++ objects with a large vtable, linked into per-basic-block doubly-linked lists. Each instruction carries a unique integer ID, an opcode, and a packed operand array.

Instruction Layout

OffsetTypeFieldDescription
+8variesreg_classRegister class / encoding fields
+16i32idUnique instruction ID
+28u32opcodeSASS opcode (lower 12 bits = base, bits 11-12 = modifier)
+36u32flagsFlags (bits 19-21 = subtype)
+48u8special_flagsVolatile/special (bit 5 = volatile)
+72u32opcode_infoOpcode info (duplicate/extended field, confirmed 50+ sites)
+73u8instr_flagsPer-instruction flag byte
+80u32operand_countNumber of operands
+84u32[]operandsPacked operand array (8 bytes per operand)
+160ptrenc_bufEncoding buffer pointer (post-selection)
+184u32enc_modeEncoding mode
+200u64imm_valueImmediate value

Packed Operand Encoding

Each operand occupies 8 bytes in the operand array starting at instruction offset +84:

 31  30  29  28  27       24  23  22  21  20  19                  0
+---+---+---+---+-----------+---+---+---+---+---------------------+
|     type      |  modifier bits (8 bits)    |  index (20 bits)    |
+---+---+---+---+-----------+---+---+---+---+---------------------+
                 ^                            ^
                 bit 24: extended flag         bits 0-19: reg/sym index

type field (bits 28-30):
  1 = register operand      -> index into *(ctx+88) register file
  5 = symbol/const operand  -> index into *(ctx+152) symbol table

Operand Word 1 (Upper 4 Bytes)

Each 8-byte operand slot has two DWORDs. Word 0 (documented above) carries type/modifier/index. Word 1 carries extended flags:

Word 1 (at instr + 84 + 8*i + 4):

 31  30  29  28  27  26  25  24  23                             0
+---+---+---+---+---+---+---+---+-------------------------------+
|     reserved / mod flags      |CB |      auxiliary data        |
+---+---+---+---+---+---+---+---+-------------------------------+
                             ^
                             bit 24: const-bank flag (CB)

Bits 25-31 (mask 0xFE000000): extended modifier flags
  When any bit is set, the operand has special semantics.
  Peephole matchers bail out early if (word1 & 0xFE000000) != 0.
  Bit 25 (0x2000000): operand reuse / negation extension
  Bit 26 (0x4000000): absolute-value modifier (|x|)

Bit 24 (mask 0x1000000): const-bank flag
  When set, indicates the source references a constant bank (c[N][offset]).
  The scheduler uses this to distinguish FADD (standard) from FADD (const-bank)
  for latency modeling (see scheduling/latency-model.md).

Bits 0-23: auxiliary data
  For symbol/const operands (type 5): constant bank number
  For predicate guards (type 6): predicate sense (true/false)
  For register operands (type 1): typically zero

Evidence: sub_40848E checks (word1 & 0xFE000000) != 0 across all operands; sub_405769 tests both 0x1000000 and 0x6000000 combinations; sub_404AD0 verifies (word1 & 0xFE000000) == 0 before allowing peephole transforms. Confirmed in 30+ decompiled functions (confidence 0.92).

Extraction Pattern

Extraction pattern (appears in 50+ functions):

uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
int type    = (operand >> 28) & 7;
int index   = operand & 0xFFFFF;
int mods    = (operand >> 20) & 0xFF;

uint32_t word1 = *(uint32_t*)(instr + 84 + 8 * i + 4);
bool has_const_bank = (word1 & 0x1000000) != 0;
bool has_ext_mods   = (word1 & 0xFE000000) != 0;

Opcode Constants

Selected confirmed opcodes (from multiple independent functions):

ValueInstructionNotes
47NOP / barrier
72CALL / JMPFunction call or jump
91ATOMAtomic memory operation
92REDReduction operation
95STSStore to shared memory (ROT13: FGF). Note: EXIT = opcode 77 (RKVG), RET = opcode 72 (ERG)
155LD variantLoad instruction
173ST variantStore instruction
183LD.EExtended load (& 0xFFFFCFFF mask removes modifier bits)
267ST variantStore (& 0xFFFFCFFF)
268LD variantLoad (& 0xFFFFCFFF)
288ST.EExtended store

The 0xFFFFCFFF mask (clear bits 12-13) strips modifier/suboperation bits from the opcode, yielding the base instruction class. This pattern appears in InstructionClassifier, MBarrierDetector, and OperandLowering code.

ROT13 Opcode Names

All SASS opcode mnemonic strings stored in the binary are ROT13-encoded. The master table is initialized in sub_BE7390 (InstructionInfo constructor) at offset 4184 of the InstructionInfo object, with 16-byte {name, length} entries. This is lightweight obfuscation -- not a security measure.

Selected decoded names (~200+ total, covering the full sm_70+ SASS ISA):

ROT13RealCategory
VZNQIMADInteger multiply-add
VNQQ3IADD33-input integer add
SSZNFFMAFP fused multiply-add
SNQQFADDFP add
SZHYFMULFP multiply
ZBIMOVMove
FRYSELSelect
YBC3LOP33-input logic
VFRGCISETPInteger set-predicate
SFRGCFSETPFP set-predicate
YRNLEALoad effective address
FUSSHFShift / funnel shift
ZHSHMUFUMulti-function unit (SFU)
YQTLDGLoad global
FGTSTGStore global
YQPLDCLoad constant
YQYLDLLoad local
YQFLDSLoad shared
NGBZATOMAtomic
ONEBARBarrier
OENBRABranch
PNYYCALLCall
ERGRETReturn
RKVGEXITExit
GRKTEXTexture
ZRZONEMEMBARMemory barrier
JNECFLAPWARPSYNCWarp synchronize
C2EP2RPredicate to register
E2CR2PRegister to predicate
ABCNOPNo-op
OFFLBSSYBranch sync stack push
OFLAPBSYNCBranch sync
QRCONEDEPBARDependency barrier

Register Files

Ori maintains four distinct register files, mirroring the SASS hardware register model.

Register File Summary

FileWidthRangeSpecialABI typeCode Object offset
R32-bitR0 -- R255RZ (read-zero)2+102 (alloc), +159 (reserved)
UR32-bitUR0 -- UR63URZ (read-zero)3+99
P1-bitP0 -- P6PT (always-true)5(tracked separately)
UP1-bitUP0 -- UP6UPT (always-true)--(tracked separately)

R registers are the main 32-bit general-purpose registers. 64-bit values occupy consecutive pairs (e.g., R4:R5). The total R-register count for a function is field[159] + field[102] (reserved + allocated). Maximum is 255 usable registers (R0-R254); R255 is the hardware zero register RZ.

UR registers (sm_75+) are uniform registers shared across the warp. Every thread sees the same value. UR0-UR63 on supported architectures. The count is at Code Object +99.

P registers are 1-bit predicate registers used for conditional execution. P0-P6 are usable; PT is the hardwired always-true predicate (writes are discarded).

UP registers are the uniform variant of predicates, shared across the warp like UR.

Register Descriptor

Each register is described by a descriptor in the register file array, accessed as *(ctx+88) + 8*regId:

OffsetTypeField
+8u32Size / live range info
+12u32Register number
+16u32Register class (enum)
+20u32Physical register name (assigned after regalloc)
+24ptrDefinition info (0 = undefined / uninitialized)
+36u32Flags (bits 19-21 = subtype)
+48u8Volatile/special flags (bit 5 = volatile marker)
+64u32Register file type enum
+68u32Physical register number (post-allocation)

Register file type values at descriptor +64:

ValueMeaning
2General-purpose (R)
3Uniform (UR)
5Predicate (P)
6General register (alternate classification)
7Predicate (alternate classification)
10Extended register pair (64-bit)
11Extended register quad (128-bit)

The register class name table at off_21D2400 maps reg_type enum values to string names. The stat collector (sub_A60B60, 24KB) enumerates ~25 register sub-classes including R, P, B, UR, UP, UB, Tensor/Acc, SRZ, PT, RZ, and others. The allocator processes classes 0--6 (matching reg_type values 0--6); barrier registers (reg_type 9) are handled separately.

Partial SSA

Ori does not maintain full SSA form at all times. Instead, it uses a bounded "partial SSA" window managed by two phases in the 159-phase optimization pipeline.

Phase 23: GenerateMovPhi

Constructs phi-like MovPhi pseudo-instructions at CFG merge points. Inserted after loop unrolling (phase 22) and before pipelining (phase 24). This establishes partial SSA form -- not through LLVM-style dominance-frontier phi insertion, but through explicit MovPhi nodes that represent value merging at control-flow join points.

Phase 73: ConvertAllMovPhiToMov

Destructs SSA form by lowering every MovPhi into a plain MOV instruction. Runs after sync instruction expansion (phase 72) and before uniform register conversion (phase 74). This is SSA destruction without the need for interference-graph-based coalescing -- the MovPhi nodes simply become copies.

The SSA Window

The partial-SSA window spans phases 23 through 73, covering the bulk of the optimization pipeline:

Phase 23  GenerateMovPhi         <-- SSA construction
Phase 24  OriPipelining
Phase 25  StageAndFence
Phase 26  OriRemoveRedundantBarriers
Phase 29  GeneralOptimize
Phase 37  GeneralOptimizeMid
Phase 46  GeneralOptimizeMid2
Phase 49  GvnCse
Phase 50  OriReassociateAndCommon
Phase 54  OriDoRematEarly
Phase 58  GeneralOptimizeLate
Phase 63  OriDoPredication
Phase 65  GeneralOptimizeLate2
Phase 69  OriDoRemat
Phase 70  OriPropagateVaryingSecond
Phase 71  OptimizeSyncInstructions
Phase 72  LateExpandSyncInstructions
Phase 73  ConvertAllMovPhiToMov  <-- SSA destruction

All optimizations between these two phases can rely on the single-definition property of MovPhi nodes for reaching-definition analysis.

MovPhi Instruction Format

A MovPhi is not a distinct opcode -- it reuses the MOV opcode (19) with a distinguishing flag in the instruction's auxiliary fields. Phase 73 (ConvertAllMovPhiToMov) converts MovPhi to plain MOV by clearing this flag, without changing the opcode value.

MovPhi operand layout:
  +72  opcode         = 19 (MOV)
  +76  opcode_aux     = flag distinguishing MovPhi from plain MOV
  +80  operand_count  = 2*N + 1  (variable, one destination + N source-predecessor pairs)

  operand[0]:           destination register (the merged value)
  operand[1], [2]:      {source_reg, predecessor_bix} for predecessor 0
  operand[3], [4]:      {source_reg, predecessor_bix} for predecessor 1
  ...
  operand[2*N-1], [2*N]: {source_reg, predecessor_bix} for predecessor N-1

This is the operational equivalent of an SSA phi node. For a CFG merge with two predecessors:

;; PTX-level CFG:            ;; Ori MovPhi:
;;   bix1 defines R7         ;;
;;   bix2 defines R9         ;;   MovPhi R3, R7, bix1, R9, bix2
;;   bix3 merges             ;;
;;   uses R3                 ;;   "if from bix1, R3 = R7; if from bix2, R3 = R9"

Phase 23 (GenerateMovPhi) inserts these at merge points where a register has different reaching definitions from different predecessors. Phase 73 destructor linearizes them: it inserts a MOV R3, R7 at the end of bix1 and a MOV R3, R9 at the end of bix2, then deletes the MovPhi.

Operand Kinds

The IR supports 10 distinct operand kinds, identified through the register allocator verifier (sub_A55D80) and the instruction selection pattern matcher infrastructure.

#KindDescription
1R/UR registerGeneral-purpose or uniform register operand
2P/UP registerPredicate or uniform-predicate register operand
3Any registerWildcard -- matches any register class
4OffsetMemory offset for address computation
5RegularStandard immediate or constant value
6PredicatedGuard predicate controlling conditional execution
7RematRematerialization marker (value can be recomputed instead of spilled)
8Spill-refillSpill/refill pair marker for register allocator
9R2P / P2RRegister-to-predicate or predicate-to-register conversion pair
10Bit-spillSingle-bit spill (predicate register spill to GPR)

The regalloc verifier (sub_A55D80, confidence 0.95) classifies 10 problem categories that map to these operand kinds:

  1. Missing spill match for refill
  2. Refill reads uninitialized memory
  3. P2R-R2P pattern match failure
  4. Bit-spill-refill pattern match failure
  5. Previously defined operand now uninitialized
  6. Extra post-regalloc definitions (mixed-size check)
  7. Rematerialization problem
  8. P2R-R2P base destroyed
  9. Bit-spill-refill base destroyed
  10. Definitions disappeared without new ones added

The pattern matcher infrastructure at 0xB7D000--0xBA9D00 (~390 functions) uses a separate classification for instruction selection:

FunctionPredicate
sub_B28E10isRegOperand
sub_B28E20isPredOperand
sub_B28E40isImmOperand
sub_B28E80isConstOperand
sub_B28E90isUReg
sub_B28E00getRegClass (1023 = wildcard, 1 = GPR)

Ori vs. PTX

PTX is a virtual ISA -- a stable interface between the compiler frontend and the architecture-specific backend. Ori is the architecture-specific backend representation that replaces PTX opcodes with actual SASS instructions early in compilation.

AspectPTXOri
Opcode setVirtual mnemonics (add, mul, ld, st)SASS hardware opcodes (IMAD, FFMA, LDG, STG)
Register modelUnlimited virtual registers, typed4 hardware register files (R, UR, P, UP) with virtual numbering
SSA formNot applicable (PTX is a linear ISA)Partial SSA between phases 23 and 73
CFG representationImplicit (labels + branches)Explicit hash-map-based CFG with RPO, backedges, dominators
Target dependenceArchitecture-independent (forward-compatible)Architecture-specific (per-SM instruction selection)
Conversion pointInput to ptxasAfter MercConverter (sub_9F1A90)

The MercConverter pass is the boundary: it transforms PTX-derived intermediate opcodes into SM-specific SASS opcodes by dispatching through a large opcode switch (sub_9ED2D0, 25KB). After MercConverter, the string "After MercConverter" appears in diagnostic output, and the IR is fully in SASS-opcode form. Each instruction then carries enough information for the scheduler to compute accurate latencies, throughputs, and functional-unit assignments.

Worked Example: add.f32 to FADD

This traces a single PTX instruction through the Ori representation, showing exactly how the opcode, operands, and register references are encoded in memory.

PTX Input

add.f32 %f3, %f1, %f2

After MercConverter (sub_9F1A90), this becomes the Ori instruction:

FADD R3, R1, R2

The type qualifier .f32 disappears -- the "F" in FADD encodes the float type. Register names %f1, %f2, %f3 become virtual register IDs R1, R2, R3 in the R (GPR) register file.

Instruction Object in Memory

FADD is opcode 12 in the ROT13 name table (ROT13: SNQQ, at InstructionInfo+4184+16*12). The 296-byte instruction object:

Offset  Value              Field
------  -----------------  ---------------------
+0      prev_ptr           Linked-list prev
+8      next_ptr           Linked-list next
+16     <id>               Unique instruction ID
+72     0x0000000C         opcode = 12 (FADD)
+80     0x00000003         operand_count = 3
+84     0x10000003         operand[0] word0: dst R3
+88     0x00000000         operand[0] word1: no ext flags
+92     0x10000001         operand[1] word0: src R1
+96     0x00000000         operand[1] word1: no ext flags
+100    0x10000002         operand[2] word0: src R2
+104    0x00000000         operand[2] word1: no ext flags

Operand Decoding

Take operand[0] word0 = 0x10000003:

  0x10000003 in binary:
    bit 31     = 0       (no sign/negate)
    bits 28-30 = 001     (type = 1 = register operand)
    bits 20-27 = 00000000 (no modifiers)
    bits 0-19  = 00003   (register index = 3)

The register index resolves through the register descriptor array:

reg_desc = *(ptr*)(*(ptr*)(code_obj + 88) + 8 * 3);
// reg_desc + 64: reg_file_type = 2 (R / GPR file)
// reg_desc + 12: register number = 3

If the source operand were a constant-bank reference (e.g., FADD R3, R1, c[0][0x10]), operand[2] would have type=5 (symbol/constant) in word0 and the const-bank flag (0x1000000) set in word1. The scheduler distinguishes these two FADD variants for latency modeling: standard FADD gets throughput class 0x3D, while const-bank FADD gets 0x78.

Memory Space Classification

Memory operands carry a space type enum, resolved by sub_91C840 which maps the PTX-level space identifier to an internal category number. The full input enumeration (from complete decompilation of sub_91C840, confidence 0.98):

InputPTX SpaceInternal CategoryNotes
0(none)--Unmapped, no memory space
1Register / generic16Register file address
2Code / function12Function address
3(gap)--Unmapped
4.shared1Shared memory
5.const3Constant memory
6.global11Global memory
7.local2Local memory
8(gap)--Unmapped
9.local (variant)2Same as 7, alternate encoding
10--11(gap)--Unmapped
12.param4Parameter memory
13Generic (unqualified)0Generic address space
14.tex8Texture memory
15.surf17Surface memory
16Spill space7Register spill/fill scratch
17(gap)--Unmapped
18(instruction-dependent)variesSub-classifies by opcode at a2[1]
19.uniform15Uniform (sm_75+)
20.global (extended)6Global, extended variant
21.const (extended)5Constant, extended store-to-global path
22.const (extended, alt)5Constant, alternate extended
23.surf / tensor (ext)18Surface/tensor extended (sm_90+)

Case 18 (0x12) uses a sub-switch on the opcode value at a2[1] to further classify: opcodes 7, 43, 45, 53 map to category 6 (global-like); opcode 111 and opcodes in the 183--199 range map to category 5 (constant-like); opcodes 54 and 189 map to category 9 (special).

The hot/cold classifier pair (sub_A9CDE0 / sub_A9CF90) consumes the internal category to partition instructions for scheduling. Hot memory operations (global loads/stores, certain atomics -- category 11) have long latencies and benefit from aggressive scheduling; cold operations (constant loads -- category 3) have shorter latencies and are treated more conservatively.

Key Functions

AddressSizeRoleConfidence
sub_A3B080--Code Object constructor; allocates ~1136-byte per-function IR container (vtable at 0x21EE238)0.90
sub_A4B8F0--Register count formula: total_R = v5[159] + v5[102], instr_count = v5[335] - v5[341]0.90
sub_A3A7E0--Stats emitter; prints per-function profile (instruction count, register count, occupancy, latency)0.90
sub_BE21D01.4KBCFG::dumpDOT; emits Graphviz DOT output for the control flow graph0.92
sub_BDE1509KBCFG::computeRPO; explicit DFS stack, assigns reverse post-order numbers into Code Object +720 array0.92
sub_BDE8B02KBCFG::printEdges; FNV-1a lookup, prints "bix%d -> bix%d\n"0.92
sub_BDEA504KBCFG::dumpRPOAndBackedges; RPO traversal order + backedge debug dump0.92
sub_BE069054KBCFG::buildAndAnalyze; main CFG constructor -- predecessors, successors, RPO, loop detection0.92
sub_BE23304KBCFG::computeDominators; post-build dominator and loop analysis with bitvector operations0.92
sub_BE7390--InstructionInfo constructor; initializes 322-entry ROT13 opcode name table at object offset +41840.90
sub_9F1A9035KBMercConverter pass; transforms PTX-derived opcodes into SM-specific SASS opcodes0.92
sub_9ED2D025KBOpcode switch inside MercConverter; dispatches per-opcode legalization0.90
sub_91C840--Memory space classifier; maps PTX-level space identifiers (0--23) to internal category numbers0.98
sub_A9CDE0--Hot/cold memory classifier (hot path); partitions instructions by memory category for scheduling0.85
sub_A9CF90--Hot/cold memory classifier (cold path); complement of sub_A9CDE00.85
sub_A60B6024KBRegister stat collector; enumerates ~25 register sub-classes (R, P, B, UR, UP, UB, Tensor/Acc, etc.)0.85
sub_A55D80--Register allocator verifier; classifies 10 operand-kind problem categories for regalloc validation0.95
sub_40848E--Operand extended-flag checker; tests (word1 & 0xFE000000) != 0 across all operands0.85
sub_405769--Operand flag tester; tests 0x1000000 and 0x6000000 combinations in operand word 10.85
sub_404AD0--Peephole guard; verifies (word1 & 0xFE000000) == 0 before allowing peephole transforms0.85
sub_B28E10--isRegOperand; ISel pattern matcher operand predicate0.90
sub_B28E20--isPredOperand; ISel pattern matcher operand predicate0.90
sub_B28E40--isImmOperand; ISel pattern matcher operand predicate0.90
sub_B28E80--isConstOperand; ISel pattern matcher operand predicate0.90
sub_B28E90--isUReg; ISel pattern matcher operand predicate0.90
sub_B28E00--getRegClass; returns register class (1023 = wildcard, 1 = GPR)0.90