Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Structure Layouts

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

This page documents the key internal data structures in ptxas v13.0.88: the compilation context ("god object"), the Ori Code Object, symbol tables, constant/shared memory descriptors, the pool allocator's object model, and the generic container types (hash maps, linked lists, growable arrays) that underpin nearly every subsystem.

All offsets are byte offsets from the structure base unless otherwise noted. Types are inferred from decompiled access patterns. Field names are reverse-engineered -- the binary is stripped.

Compilation Context (the "God Object")

The compilation context is the central state object passed to every phase in the pipeline. It is not the Code Object (which is per-function); it is the per-compilation-unit container that owns the Code Object, the knob system, the output stream, the function list, and all per-pass configuration. The sub_7FBB70 (PerKernelEntry) function receives this as a1, and every phase's execute() receives it as the second argument.

The context is a polymorphic C++ object with a vtable at offset +0. It is allocated by the compilation driver and persists for the lifetime of a single compilation unit. Key observations:

  • The vtable at +0 provides 263+ virtual methods (vtable spans to offset 2104+)
  • The object is at least 1928 bytes based on the highest confirmed field access (+1928 = codegen_ctx)
  • The knob/options system is accessed through an indirection at +1664 (pointer to knob container object)
  • The output stream lives at +1440

Compilation Context Field Map

OffsetTypeFieldEvidence
+0vtable*vtable*(_QWORD *)a1 in every virtual dispatch
+8ptrparent / driver_ctxBack-pointer; sub_A3A7E0 reads v2 = *(a1+8) then v2[198] for Code Object
+80u32last_exit_codesub_663C30: *(a1+80) = v2[462]
+96u32compile_unit_indexsub_663C30: *(a1+96) = 1 on first call
+139u8multi_function_flagsub_663C30: if (!*(a1+139))
+144ptrname_table (via vtable+144)sub_7FBB70: *(*a1 + 144) -> name lookup vtable
+296ptrcurrent_functionsub_7FBB70: *(*(a1+296) + 164) = function index
+368ptrfunction_name_arraysub_7FBB70: *(a1+368 + 8*func_id) -> name object
+1144ptrfunction_list_headsub_663C30: linked list of function descriptors
+1160ptrentry_list_headsub_663C30: linked list of kernel entry descriptors
+1376u32scheduling_mode_flagsBit 0x08 = forward, bit 0x10 = bidirectional
+1412i8compilation_flags_bytesub_A3B080: *(char*)(a2+1412) < 0
+1416u8output_detail_flagssub_7FBB70: *(a1+1416) |= 0x80; bits 4-5 control latency reporting mode
+1418u8codegen_mode_flagssub_A3B080: *(a2+1418) & 4
+1428i32function_indexsub_7FBB70: *(a1+1428) < 0 means first invocation
+1440stream*output_streamsub_7FBB70: sub_7FE930(a1+1440, "\nFunction name: ")
+1560ptrtiming_recordsGrowable array of 32-byte timing entries
+1576u32timing_countsub_C62720: cu->timing_count++
+1552i32pipeline_progressPipeline progress counter (0--21), monotonically increases; see known values
+1584ptrsm_backendSM-specific architecture backend object (polymorphic, 1712--1992B depending on SM target); provides vtable dispatch for legalization, optimization, scheduling, and codegen; see note below
+1664ptrknob_containersub_7FB6C0, sub_A3B080: options/knob dispatch object
+1864ptrbb_structuresub_7FB6C0: destroyed via sub_77F880
+1872ptrper_func_datasub_7FB6C0: destroyed via sub_7937D0
+1880ptrfunction_contextsub_7FB6C0: 17 analysis-result pairs at qword offsets
+1928ptrcodegen_ctxConfirmed in overview.md Code Object table

SM Backend Object at +1584

The pointer at context+0x630 (decimal 1584) is the single most confusing field in the compilation context, because it serves multiple roles through a single polymorphic C++ object. Different wiki pages historically called it different names depending on which role they observed:

  • Legalization pages see it dispatching MidExpansion, LateExpansionUnsupportedOps, etc., and call it "SM backend" or "arch_backend"
  • Scheduling pages see it providing hardware latency profiles at *(sm_backend+372) and call it "scheduler context" or "hw_profile"
  • Optimization pages see it dispatching GvnCse (vtable[23]) and OriReassociateAndCommon (vtable[44]) and call it "optimizer state" or "function manager"
  • Codegen/template pages see it holding register file capacity at +372 and hardware capability flags at +1037

It is one object. The canonical name is sm_backend. It is constructed per-compilation-unit in sub_662920 with a switch on SM version bits (v3 >> 12). Each SM generation gets a different-sized allocation and a different vtable:

SM CaseSizeBase ConstructorVtableSM Generations
31712Bsub_A99A30off_2029DD0sm_30 (Kepler)
41712Bsub_A99A30off_21B4A50sm_50 (Maxwell)
51888Bsub_A99A30off_22B2A58sm_60 (Pascal)
61912Bsub_A99A30off_21D82B0sm_70 (Volta)
71928Bsub_ACDE20off_21B2D30sm_80 (Ampere)
81992Bsub_662220off_21C0C68sm_89 (Ada)
91992Bsub_662220off_21D6860sm_90+ (Hopper/Blackwell)

Key sub-fields on the SM backend:

  • +372 (i32): codegen factory value / encoded SM architecture version (e.g., 28673 = sm_80)
  • +1037 (u8): hardware capability flags (bit 0 = has high-precision FP64 MUFU seeds)
  • Vtable slots provide architecture-specific dispatch for 50+ operations

Pipeline Progress Counter at +1552

The field at context+1552 is a monotonically increasing int32 that tracks how far the compilation has progressed through the 159-phase pipeline. It is not a legalization-only counter -- it is incremented by phases across all categories (legalization, optimization, scheduling, regalloc). Each increment is performed by a small thunk function whose sole body is *(ctx + 1552) = N.

Known values and their associated phases:

ValueThunk AddressPhase / Context
0(init)sub_7F7DC0 -- compilation context constructor
1sub_C5F620Early pipeline (before ConvertUnsupportedOps)
2sub_C5F5A0After ConvertUnsupportedOps (phase 5)
3sub_C5EF80After MidExpansion (phase 45)
4sub_C5EF30After OriDoRematEarly (phase 54) -- signals remat mode active
5sub_1233D70Mid-pipeline scheduling/ISel context
7sub_6612E0 / sub_C60AA0After LateExpansion (phase 55)
8sub_849C60Post-optimization context
9sub_C5EB80After OriBackCopyPropagate (phase 83)
10sub_88E9D0Late optimization
11sub_C5EA80After SetAfterLegalization (phase 95) region
12sub_C5E980Post-legalization
13sub_13B5C80ISel/scheduling
14sub_C5E830Post-scheduling
15sub_C5E7C0Register allocation phase
16sub_C5E6E0Post-regalloc
17sub_C5E5A0Mercury/codegen
18sub_C5E4D0Post-Mercury
19sub_C5E440Late codegen
20sub_C5E390Post-RA cleanup
21sub_C5E0B0Final pipeline stage

Readers of downstream passes use *(ctx+1552) > N to gate behavior that should only run after a certain pipeline point. For example, the rematerialization cross-block pass checks *(ctx+1552) > 4 to enable its second-pass mode.

Knob Container Access Pattern

The knob container at +1664 is accessed through a two-level virtual dispatch pattern that appears at 100+ call sites:

// Fast path: known vtable -> direct array read
_QWORD *v2 = *(_QWORD **)(ctx + 1664);
bool (*query)(__int64, int) = *(bool (**)(...))(*v2 + 72);
if (query == sub_6614A0)
    result = *(u8*)(v2[9] + knob_index * 72 + offset) != 0;
else
    result = query((int64)v2, knob_index);  // slow path

The fast path reads directly from the knob value array at v2[9] (offset +72 of the knob state object), where each knob value occupies 72 bytes. The slow path invokes the virtual method for derived knob containers.

Function Context (at +1880)

When a function is under compilation, +1880 points to a large context object containing 17 pairs of analysis-result data structures. Each pair consists of a sorted container and a hash map, holding results such as live ranges, register maps, and scheduling data. The cleanup code in sub_7FB6C0 destroys pairs at qword offsets [102, 97, 92, 87, 82, 77, 72, 67, 62, 57, 52, 47, 42, 36, 31, 26, 21] from the context base, then handles reference-counted objects at offsets [10] and [2].

Ori Code Object (~1136 bytes)

The Code Object is the per-function container for all IR data. One instance exists for each function under compilation. Constructor is at sub_A3B080, vtable at 0x21EE238.

Constructor Analysis

The constructor (sub_A3B080) takes two arguments: a1 (the Code Object to initialize) and a2 (the compilation context). It:

  1. Sets +8 = a2 (back-pointer to compilation context)
  2. Sets +0 = &unk_21EE238 (vtable)
  3. Zeroes approximately 250 distinct fields across the 1136-byte range
  4. Loads two SSE constants from xmmword_2027600 and xmmword_21EFAE0 into offsets +96 and +112 (likely default register file descriptors or encoding parameters)
  5. Reads a2+1412 and a2+1418 to set mode flags at +1101 and +1008
  6. Accesses the knob container at a2+1664 to query knob 367 for initial configuration
  7. Sets +1008 = 0x300000050 (default) or 0x400000080 (if a2+1418 & 4)

Code Object Field Map

OffsetTypeFieldEvidence / Notes
+0vtable*vtable0x21EE238, 263+ virtual methods
+8ptrcompilation_ctxBack-pointer to owning compilation context
+16u128(zeroed)SSE zero-store in constructor
+24u32sm_versionEncoded SM target (12288=sm30, 20481=sm50, 36865=sm90)
+32u128(zeroed)SSE zero-store
+48u128(zeroed)SSE zero-store
+64u32init_flagsZeroed in constructor
+72ptrcode_bufOutput code buffer
+80u128(zeroed)
+88ptrreg_fileRegister descriptor array: *(ctx+88) + 8*regId
+96u128reg_defaults_1Loaded from xmmword_2027600
+99u32ur_countUniform register (UR) count
+102u32r_allocR-register allocated count
+112u128reg_defaults_2Loaded from xmmword_21EFAE0
+128--175u128[3](zeroed)SSE zero-stores
+152ptrsym_tableSymbol/constant lookup array
+159u32r_reservedR-register reserved count
+176ptr(zeroed)
+184u32(zeroed)
+192ptr(zeroed)
+200u128(zeroed)
+216u128(zeroed)
+232u32(zeroed)
+236u32(zeroed)
+240ptr(zeroed)
+248u128(zeroed)
+264u128(zeroed)
+272ptrinstr_headInstruction linked-list head
+280u32(zeroed)
+288ptr(zeroed)
+296ptrbb_arrayBasic block array pointer (40 bytes per entry)
+304u32bb_indexCurrent basic block count
+312ptroptionsOptionsManager* for knob queries
+320--359u128[3](zeroed)
+335u32instr_hiInstruction count upper bound
+336u32tex_inst_countTexture instruction count (stats emitter)
+338u32fp16_vect_instFP16 vectorized instruction count
+340u32inst_pairsInstruction pair count
+341u32instr_loInstruction count lower bound
+342u32tepid_instTepid instruction count
+360ptr(zeroed)
+368u32sub_block_flags
+372u32instr_totalTotal instruction count (triggers chunked scheduling at > 0x3FFF)
+376u32(zeroed)
+384--416ptr[5](zeroed)
+424u32(zeroed)
+432ptr(zeroed)
+440u32(zeroed)
+448ptr(zeroed)
+464ptr(zeroed)
+472u8(zeroed)
+473u8(zeroed)
+536u32(zeroed)
+540u32(zeroed)
+648ptrsucc_mapCFG successor edge hash table
+680ptrbackedge_mapCFG backedge hash table
+720ptrrpo_arrayReverse post-order array (int*)
+728ptrbitmask_arrayGrow-on-demand bitmask array for scheduling
+740u32bitmask_capacityCapacity of bitmask array
+752ptr(zeroed)
+760u32(zeroed)
+764u32(zeroed)
+768ptrconst_sectionsConstant memory section array
+772u8(zeroed)
+776ptrsmem_sectionsShared memory section array
+976ptrblock_infoBlock info array (40 bytes per entry, contiguous)
+984i32num_blocksNumber of basic blocks
+996u32annotation_offsetCurrent offset into annotation buffer (sub_A4B8F0)
+1000ptrannotation_bufferAnnotation data buffer (sub_A4B8F0)
+1008u64encoding_paramsDefault 0x300000050 or 0x400000080
+1016ptr(zeroed)
+1024u32(zeroed)
+1032ptr(zeroed)
+1040ptr(zeroed)
+1064ptr(zeroed)
+1080u128(zeroed)
+1096u32(zeroed)
+1100u8(zeroed)
+1101u8optimization_modeSet from knob 367 and compilation_ctx+1412
+1102u8(zeroed)
+1104ptr(zeroed)
+1120u128(zeroed)

Register Count Formula

From the stats emitter at sub_A3A7E0 and the register count function at sub_A4B8F0 (which both use vtable+2104 dispatch with sub_859FC0 as the fast path):

total_R_regs      = code_obj[159] + code_obj[102]   // reserved + allocated
instruction_count = code_obj[335] - code_obj[341]   // upper - lower

Stats Emitter Field Map

The stats emitter (sub_A3A7E0) accesses a per-function stats record through the SM backend: v3 = *(compilation_ctx+8)[198] (offset +1584 from the outer compilation context points to the SM backend object; the emitter then reads per-function stats fields within it). It uses DWORD indexing (4-byte), and reveals these additional fields:

DWORD IndexByte OffsetFieldStat String
8+32est_latency[est latency = %d]
10+40worst_case_lat[worstcaseLat=%f]
11+44avg_case_lat[avgcaseLat=%f]
12+48spill_bytes[LSpillB=%d]
13+52refill_bytes[LRefillB=%d]
14+56s_refill_bytes[SRefillB=%d]
15+60s_spill_bytes[SSpillB=%d]
16+64low_lmem_spill[LowLmemSpillSize=%d]
17+68frame_lmem_spill[FrameLmemSpillSize=%d]
18+72non_spill_bytes[LNonSpillB=%d]
19+76non_refill_bytes[LNonRefillB=%d]
20+80non_spill_size[NonSpillSize=%d]
26+104occupancy (float)[Occupancy = %f]
27+108div_branches[est numDivergentBranches=%d]
28+112attr_mem_usage[attributeMemUsage=%d]
29+116program_size[programSize=%d]
42+168precise_inst[Precise inst=%d]
44+176udp_inst[UDP inst=%d]
45+180vec_to_ur[numVecToURConverts inst=%d]
49+196max_live_suspend[maxNumLiveValuesAtSuspend=%d]
87+348partial_unroll[partially unrolled loops=%d]
88+352non_unrolled[non-unrolled loops=%d]
89+356cb_bound_tex[CB-Bound Tex=%d]
90+360partial_bound_tex[Partially Bound Tex=%d]
91+364bindless_tex[Bindless Tex=%d]
92+368ur_bound_tex[UR-Bound Tex=%d]
93+372sm_version_check> 24575 triggers UR reporting
99+396ur_count_stats[urregs=%d]
102+408r_allocR-register allocated count
159+636r_reservedR-register reserved count
303+1212est_fp[est fp=%d]
306+1224est_half[est half=%d]
307+1228est_transcendental[est trancedental=%d]
308+1232est_ipa[est ipa=%d]
310+1240est_shared[est shared=%d]
311+1244est_control_flow[est controlFlow=%d]
315+1260est_load_store[est loadStore=%d]
316+1264est_tex[est tex=%d]
334+1336inst_pairs[instPairs=%d]
335+1340instr_hiInstruction count upper bound
336+1344tex_inst_count[texInst=%d]
337+1348fp16_inst[FP16 inst=%d]
338+1352fp16_vect_inst[FP16 VectInst=%d]
339+1356inst_hint[instHint=%d]
340+1360inst_pairs_2checked for non-zero to print instHint line
341+1364instr_loInstruction count lower bound
342+1368tepid_inst[tepid=%d]

Note: The stats emitter accesses the Code Object through a float pointer (v3), so DWORD indices map to byte offsets via index * 4 for integers and index * 4 for floats. Float fields at indices 9, 26, 50, 54, 57, 58, 59, 61, 62, 65, 84, 85, 86 hold throughput and occupancy metrics. A linked list at qword index 55 (byte +440) holds additional string annotations.

Basic Block Entry (40 bytes)

Basic blocks are stored in a contiguous array at Code Object +976, with count at +984.

BasicBlock (40 bytes)
  +0    ptr      instr_head     // first instruction in this BB
  +8    ptr      instr_tail     // last instruction (or list link)
  +16   ptr      (reserved)
  +24   u32      (reserved)
  +28   i32      bix            // block index (unique ID for CFG ops)
  +32   u64      flags          // scheduling/analysis flags

The scheduling pass (sub_8D0640) initializes per-block scheduling state by iterating the block list and zeroing qword offsets [7], [13], [19], and setting [21] = -1 on each block.

Instruction Layout

Instructions are polymorphic C++ objects linked into per-BB doubly-linked lists. The instruction format is detailed in Instructions; this section covers only the structural linkage.

Each instruction carries a unique integer ID at +16, an opcode at +72 (the peephole optimizer masks with & 0xCF on byte 1 to strip modifier bits), and a packed operand array starting at +84. The operand count is at +80. Operands are 8 bytes each.

Packed Operand Format

 31  30  29  28  27       24  23  22  21  20  19                  0
+---+---+---+---+-----------+---+---+---+---+---------------------+
|     type      |  modifier bits (8 bits)    |  index (20 bits)    |
+---+---+---+---+-----------+---+---+---+---+---------------------+

Extraction (50+ confirmed sites):
  uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
  int type    = (operand >> 28) & 7;     // bits 28-30
  int index   = operand & 0xFFFFF;       // bits 0-19
  int mods    = (operand >> 20) & 0xFF;  // bits 20-27
Type ValueMeaningResolution
1Register operandIndex into *(code_obj+88) register file
5Symbol/constant operandIndex into *(code_obj+152) symbol table

The operand classifier functions at 0xB28E00--0xB28E90 provide predicate checks:

FunctionPredicate
sub_B28E00getRegClass (1023 = wildcard, 1 = GPR)
sub_B28E10isRegOperand
sub_B28E20isPredOperand
sub_B28E40isImmOperand
sub_B28E80isConstOperand
sub_B28E90isUReg

Symbol Table

The symbol table is accessed through Code Object +152. Based on the symbol table builder at sub_621480 (21KB, references a1+30016 for the symbol table base), symbols are stored in a hash-map-backed structure where each symbol has a name and associated properties (address, type, section binding).

Internal Symbol Names

The following internal symbol names appear in decompiled code, indicating the kinds of entities tracked:

SymbolPurpose
__ocg_constOCG-generated constant data
__shared_scratchShared memory scratch space
__funcAddrTab_gGlobal indirect function call table
__funcAddrTab_cConstant indirect function call table
_global_ptr_%sGlobal pointer for named variable
$funcID$nameFunction-local relocation symbol
__cuda_dummy_entry__Dummy entry generated by --compile-only
__cuda_sanitizerCUDA sanitizer instrumentation symbol

Symbol Resolution Flow

Symbol resolution (sub_625800, 27KB) traverses the symbol table to resolve references during the PTX-to-Ori lowering and subsequent optimization phases. The format %s[%d] (from sub_6200A0) is used for array-subscripted symbol references, and __$endLabel$__%s markers delimit function boundaries.

Constant Buffer Layout

Constant memory is organized into banks (c[0], c[1], ...) corresponding to the CUDA .nv.constant0, .nv.constant2, etc. ELF sections. The constant section array at Code Object +768 tracks all constant banks for the current function.

Constant Bank Handling

The constant bank handler at sub_6BC560 (4.9KB) manages references to constant memory using the c[%d] (integer bank) and c[%s] (named bank, sw-compiler-bank) notation. It enforces:

  • A maximum constant register count (error: "Constant register limit exceeded; more than %d constant registers")
  • LDC (Load Constant) requires a constant or immediate bank number

ELF Constant Symbols

The ELF symbol emitter (sub_7FD6C0) creates symbols for constant bank metadata:

Symbol NamePurpose
.nv.ptx.const0.sizeSize of constant bank 0 (kernel parameters)

The constant emission function (sub_7D14C0, 5.6KB) iterates the constant section array and copies bank data into the output ELF sections.

Shared Memory Layout

Shared memory (.nv.shared) allocations are tracked through the shared memory section array at Code Object +776. Reserved shared memory regions are managed by sub_6294E0 (12.1KB) and sub_629E40 (6.1KB).

Reserved Shared Memory Symbols

The ELF emitter recognizes these special symbols for shared memory layout:

Symbol NamePurpose
.nv.reservedSmem.beginStart of reserved shared memory region
.nv.reservedSmem.capCapacity of reserved shared memory
.nv.reservedSmem.endEnd of reserved shared memory region
.nv.reservedSmem.offset0First reserved offset within shared memory
.nv.reservedSmem.offset1Second reserved offset within shared memory

The --disable-smem-reservation CLI option disables the reservation mechanism. Shared memory intrinsic lowering (sub_6C4DA0, 15KB) validates that shared memory operations use types {b32, b64}.

Descriptor Size Symbols

Additional ELF symbols track texture/surface descriptor sizes in shared memory:

Symbol NamePurpose
.nv.unified.texrefDescSizeUnified texture reference descriptor size
.nv.independent.texrefDescSizeIndependent texture reference descriptor size
.nv.independent.samplerrefDescSizeIndependent sampler reference descriptor size
.nv.surfrefDescSizeSurface reference descriptor size

Pool Allocator

The pool allocator (sub_424070, 3,809 callers) is the single most heavily used allocation function. Every dynamic data structure in ptxas is allocated through pools.

Pool Object Layout

OffsetTypeFieldNotes
+0ptrlarge_block_listSingly-linked list of large (>4999 byte) blocks
+32u32min_slab_sizeMinimum slab allocation size
+44u32slab_countNumber of slabs allocated
+48ptrlarge_free_listFree list for large blocks (boundary-tag managed)
+56u32fragmentation_countFragmentation counter (decremented on split)
+60u32max_orderMaximum power-of-2 order for large blocks
+64...(large block free lists)a1 + 32*(order+2) = per-order free list head
+2112ptrtracking_mapHash map for allocation metadata tracking
+2128ptr[N]small_free_listsSize-binned free lists: *(pool + 8*(size>>3) + 2128) = head
+7128mutex*pool_mutexpthread_mutex_t* for thread safety

Allocation Paths

Small path (size <= 4999 bytes = 0x1387):

  1. Round size up to 8-byte alignment: aligned = (size + 7) & ~7
  2. Minimum 16 bytes
  3. Compute bin: bin = pool + 8 * (aligned >> 3) + 2128
  4. If bin has a free block: pop from free list, decrement slab available bytes
  5. If bin is empty: allocate a new slab from the parent (size = aligned * ceil(min_slab_size / aligned)), carve into free-list nodes

Large path (size > 4999 bytes):

  1. Add 32 bytes for boundary tags
  2. Search power-of-2 order free lists starting from log2(size+32)
  3. If found: split block if remainder > 39 bytes, return payload
  4. If not found: call sub_423B60 to grow the pool, allocate new slab from parent

Boundary Tag Format (Large Blocks)

Large blocks use boundary tags for coalescing on free:

Block Header (32 bytes):
  +0    i64      sentinel      // -1 = allocated, else -> next free
  +8    ptr      prev_free     // previous in free list (or 0)
  +16   u64      tag_offset    // 32 (header size)
  +24   u64      payload_size  // user-requested allocation size

Block Footer (32 bytes at end):
  +0    i64      sentinel
  +8    ptr      prev_free
  +16   u64      footer_tag    // 32
  +24   u64      block_size    // total size including headers

Slab Descriptor (56 bytes)

Each slab is tracked by a 56-byte descriptor:

OffsetTypeField
+0ptrchain_link
+8u64total_size
+16u64available_size
+24ptrowning_pool
+32ptrmemory_base
+40u8is_small_slab
+44u32slab_id
+48u32bin_size

Hierarchical Pools

Pools are hierarchical. When sub_424070 is called with a1 = NULL, it falls back to a global allocator (sub_427A10) that uses malloc directly. Non-null a1 values are pool objects that allocate from their own slabs, which are themselves allocated from a parent pool (the TLS context at offset +24 holds the per-thread pool pointer). The top-level pool is named "Top level ptxas memory pool" and is created in the compilation driver.

Hash Map

The hash map (sub_426150 insert / sub_426D60 lookup, 2,800+ and 422+ callers respectively) is the primary associative container in ptxas.

Hash Map Object Layout (~112 bytes)

OffsetTypeFieldNotes
+0fptrhash_funcCustom hash function pointer
+8fptrcompare_funcCustom compare function pointer
+16fptrhash_func_2Secondary hash (or NULL)
+24fptrcompare_func_2Secondary compare (or NULL)
+32u32has_custom_compareFlag
+40u64bucket_maskcapacity - 1 for power-of-2 masking
+48u64entry_countNumber of stored entries
+64u64load_factor_thresholdResize when entry_count exceeds this
+72u32first_free_slotTracking for bitmap-based slot allocation
+76u32entries_capacityCapacity of entries array
+80u32bitmap_capacityCapacity of used-bits bitmap
+84u32flagsHash mode in bits 4-7
+88ptrentriesArray of 16-byte {key, value} pairs
+96ptrused_bitmapBitmap tracking occupied slots
+104ptrbucketsArray of pointers to chained index lists

Hash Modes

The hash mode is encoded in bits 4-7 of the flags field at offset +84:

ModeFlag BitsHash FunctionUse Case
00x00Custom (+0 function pointer)User-defined hash/compare
10x10Pointer hash: (key>>11) ^ (key>>8) ^ (key>>5)Pointer-keyed maps
20x20Identity: key used directlyInteger-keyed maps

Mode selection happens automatically in the constructor (sub_425CA0): if the hash/compare pair matches (sub_427750, sub_427760), mode 2 is set; if (sub_4277F0, sub_427810), mode 1.

Lookup Algorithm

// Mode 1 (pointer hash) example:
uint64_t hash = (key >> 11) ^ (key >> 8) ^ (key >> 5);
uint64_t bucket_idx = hash & map->bucket_mask;
int32_t* chain = map->buckets[bucket_idx];
while (*++chain != -1) {
    entry_t* e = map->entries + 16 * (*chain);
    if (key == e->key)
        return e->value;  // found
}
return 0;  // not found

Growth policy: the map doubles capacity and rehashes when entry_count > load_factor_threshold.

String-Keyed Maps

String-keyed maps use MurmurHash3 (sub_427630, 73 callers) as the hash function. The implementation uses the standard MurmurHash3_x86_32 constants:

ConstantValueStandard Name
c10xCC9E2D51 (-862048943)MurmurHash3 c1
c20x1B873593 (461845907)MurmurHash3 c2
fmix10x85EBCA6B (-2048144789)MurmurHash3 fmix
fmix20xC2B2AE35 (-1028477387)MurmurHash3 fmix

CFG Hash Map (FNV-1a)

The control flow graph uses a separate hash map implementation based on FNV-1a hashing, distinct from the general-purpose hash map above. Two instances exist per Code Object at offsets +648 (successor edges) and +680 (backedge info).

ParameterValue
Initial hash0x811C9DC5 (-2128831035)
Prime16777619 (0x01000193)
Input4-byte block index, hashed byte-by-byte

Bucket entry: 24 bytes {head, tail, count}. Node: 64 bytes with chain link, key, values, sub-hash data, and cached hash. See CFG for the full CFG hash map specification.

Linked List

The linked list (sub_42CA60 prepend, 298 callers; sub_42CC30 length, 48 callers) is a singly-linked list of 16-byte nodes:

ListNode (16 bytes, pool-allocated)
  +0    ptr      next        // pointer to next node (NULL = end)
  +8    ptr      data        // pointer to payload object

Prepend allocates a 16-byte node from the pool, sets node->data = payload, and links it at the list head. This is used for function lists, relocation lists, annotation chains, and many intermediate pass-local collections.

Growable Array (Pool Vector)

Growable arrays appear throughout the PhaseManager and elsewhere. The layout is a triple of {data_ptr, count, capacity}:

PoolVector (24 bytes inline, or embedded in parent struct)
  +0    ptr      data         // pointer to element array
  +8    i32      count        // current element count
  +12   i32      capacity     // allocated capacity

Growth strategy (confirmed in the PhaseManager timing records): new_capacity = max(old + old/2 + 1, requested) (1.5x growth factor). Elements are typically 8 bytes (pointers) or 16 bytes (pointer pairs). Reallocation uses sub_424C50 (pool realloc, 27 callers).

The PhaseManager uses this pattern for the phase list (16-byte {phase_ptr, pool_ptr} pairs), the name table (8-byte string pointers), and the timing records (32-byte entries).

Knob Value Array

Knob values are stored in a contiguous array of 72-byte slots, accessed at knob_state[9] + 72 * knob_index (where knob_state[9] is offset +72 of the knob state object).

Knob Value Slot (72 bytes)

OffsetTypeField
+0u8Type tag (0=unset, 1=bool, 2=int, ..., 12=opcode list)
+8i64Integer value / pointer to string / linked list head
+16i64Secondary value (range max, list count, etc.)
+24i64Tertiary value
+64ptrAllocator reference

Supported types:

TypeTagStorage
Boolean1Flag at +0
Integer2Value at +8
Integer+extra3Value at +8, extra at +12
Integer range4Min at +8, max at +16
Integer list5Growable array of ints
Float6float at +8
Double7double at +8
String8/11Pointer at +8
When-string9Linked list of 24-byte condition+value nodes
Value-pair list10Opcode:integer pairs via vtable
Opcode list12Opcode names resolved through vtable

Knob Descriptor (64 bytes)

Knob descriptors are stored in a table at knob_state+16, with count at knob_state+24:

OffsetTypeField
+0ptrPrimary name (ROT13-encoded)
+8u64Primary name length
+16u32Type tag
+24...(reserved)
+40ptrAlias name (ROT13-encoded)
+48u64Alias name length

Stream Object

The output stream used for diagnostics and stats reporting (e.g., at compilation context +1440) is a C++ iostream-like object with operator overloads. Field layout (from sub_7FE5D0 and sub_7FECA0):

OffsetTypeField
+0vtable*vtable (dispatch for actual I/O)
+8u32width
+12u32precision
+16u64char_count
+24ptrformat_buffer
+56u32flags (bit 0=hex, bit 1=oct, bit 2=left-align, bit 3=uppercase, bits 7-8=sign)

ORI Record Serializer (sub_A50650)

The ORI Record Serializer (sub_A50650, 74 KB, 2,728 decompiled lines) is the central function that takes a Code Object's in-memory state and flattens it into a linear output buffer organized as a table of typed section records. It is the serialization backbone for both the DUMPIR diagnostic subsystem and the compilation output path. Despite the _ORI_ string it contains, it is not an optimization pass -- it is infrastructure.

Address0xA50650
Size~74 KB
IdentityCodeObject::EmitRecords
Confidence0.90
Called fromsub_A53840 (wrapper), sub_AACBF0 / sub_AAD2A0 (DUMPIR diagnostic path)
Callssub_A4BC60 (register serializer, new format), sub_A4D3F0 (legacy format), sub_A4B8F0 (register count annotation), sub_A47330 + sub_A474F0 (multi-section finalization), sub_1730890 / sub_17308C0 / sub_17309A0 (scheduling serializers), sub_1730FE0 (register file map)

Parameters

a1 is a serialization state object ("OriRecordContext") that carries the section table, compilation context back-pointer, and per-subsection index/size pairs. a2 is the output buffer write cursor, advanced as data is emitted.

Key fields on a1:

OffsetTypeFieldEvidence
+8ptrcompilation_ctxDereferenced to reach sm_backend at +1584
+24i32header_section_idxv5 + 32 * (*(a1+24) + 1)
+72ptrsection_tableArray of 32-byte section entries
+180u32instr_counter_1Reset to 0 at entry
+472u8has_debug_infoGates debug section emission
+916i32multi_section_count> 0 triggers link-record emission and tail call to sub_A47330
+1102u8multi_section_enabledMaster flag for multi-section mode
+1120ptrscheduling_ctxScheduling context for barrier/scope serialization

Section Record Format

Each section occupies a 32-byte entry in the table at *(a1+72) + 32 * section_index:

Offset  Type   Field
+0      u16    type_tag           section type identifier
+4      u32    data_size          byte size of data payload
+8      ptr    data_ptr           pointer to data in output buffer
+16     u32    element_count      number of elements (or auxiliary metadata)
+20     u32    aux_field          additional per-type context
+24     u32    aux_field_2        secondary per-type context

Data payloads are 16-byte aligned: cursor += (size + 15) & ~0xF.

Section Type Tag Catalog

The serializer emits up to 56 unique section types across three tag ranges.

Base types (0x01--0x58):

TagHexContentEvidence
10x01Instruction stream (register-allocated code body)Emitted via sub_A4BC60 or sub_A4D3F0
30x03Virtual-dispatch section (vtable+48 on state obj)Conditional on *(a1+64) > 0
160x10Source operand bank (v7[199] entries at v7+97)*(entry+48) = v7[199]
170x11Destination operand bank (bit-packed from v7+203)Conditional on !v7[1414]
190x13Annotation stream*(a1+232) counter
340x22Original-definition name table (_ORI_ prefixed)strcpy(v50, "_ORI_") at line 1762
350x23Instruction info snapshot (340 bytes from v7+4)qmemcpy of 340 bytes
460x2ETexture/surface binding tablev7[248] entries, 16 bytes each
500x32Live range interval table (spill map)From compilation context +984
510x33Register file occupancy table*(ctx+1424) & 4
530x35Source operand type bitmap (4-bit per operand)v7[131] operands, 20-byte stride
540x36Destination operand type bitmapv7[134] operands, 20-byte stride
550x37Scheduling barrier datavia sub_1730890
560x38Register file mappingvia sub_1730FE0
580x3AScheduling dependency graphvia sub_17309A0
590x3BMulti-section link recordConditional on *(a1+1102)
640x40External reference (from ctx+2120)Pointer stored, no data copy
680x44Performance counter section*(a1+932) counter
700x46Spill/fill metadatav7[408]
710x47Call graph edge tableFrom v7+61, linked list traversal
730x49Codegen context snapshotFrom ctx+932 register allocation state
800x50Hash table sectionv7+207/208, hash bucket traversal
810x51Extended call infoFrom v7+84
830x53Convergence scope datavia sub_17308C0
850x55Register geometry record (banks, warps, lanes)From ctx+1600, writes bank/warp/lane counts
880x58Extended scheduling annotationsConditional on *(a1+1088) > 0

Extended types (0x1208--0x1221): Emitted only when *(char*)(ctx+1412) < 0, which enables the full post-register-allocation diagnostic mode. These 16 types carry per-register-class live range and operand definition data:

TagHexContent
46160x1208Extended operand class 0
4617--46230x1209--0x120FExtended operand classes 1--7
46240x1210Block-level operand summary
46250x1211Live-in vector (12 bytes/element, count at *(a1+668))
46260x1212Live-out vector (12 bytes/element)
46270x1213Extended operand class 8
4628--46290x1214--0x1215Extended operand classes 9--10
46300x1216Memory space descriptor (SM arch > 0x4FFF)
46310x1217Extended scheduling flag (SM arch > 0x4FFF)
46320x1218Instruction hash (ctx+1386 bit 3)
46330x1219Annotation metadata
46400x1220Extended section metadata
46410x1221Optimization level record (from knob system, knob 988)

The _ORI_ Name Prefix

The _ORI_ string is not a pass name. At line 1762 the serializer iterates the linked list at v7+55 (the original-definition chain maintained for rematerialization debugging) and for each entry creates a string "_ORI_<original_name>":

// Line 1748-1770 (simplified)
for (def = v7->original_defs; def; def = def->next) {
    entry = &section_table[16 * (state->instr_offset + idx)];
    entry->type_tag = 34;      // original-definition name
    entry->data_ptr = cursor;
    strcpy(cursor, "_ORI_");
    strcpy(cursor + 5, def->name);
    cursor += align16(strlen(def->name) + 21);
}

These names are consumed by the register allocation verifier (sub_A55D80) when it compares pre- and post-allocation reaching definitions. A mismatch triggers the "REMATERIALIZATION PROBLEM" diagnostic (string at 0xa55dd8), which lists original definitions under their _ORI_ names alongside the post-allocation state.

Wrapper: sub_A53840

sub_A53840 (48 lines) is a thin wrapper that:

  1. Emits a type-44 header record if *(ctx+1600)[1193] is set (scheduling metadata header)
  2. Calls sub_A50650 with the output buffer
  3. Optionally emits a type-62 trailer record if *(ctx+1600)[48] is set

This wrapper is the typical entry point reached through vtable dispatch.

Function Map

AddressSizeCallersIdentity
sub_A3B080~700 BmultipleCode Object constructor
sub_A3A7E0~700 B1Stats emitter (per-function profile)
sub_A4B8F0~250 B1Register count / annotation writer
sub_A50650~74 KB8ORI Record Serializer (CodeObject::EmitRecords)
sub_A53840~400 B1EmitRecords wrapper (adds type-44 header)
sub_4240702,098 B3,809Pool allocator (alloc)
sub_4248B0923 B1,215Pool deallocator (free)
sub_424C50488 B27Pool reallocator (realloc)
sub_426150~1.2 KB2,800Hash map insert
sub_426D60345 B422Hash map lookup
sub_426EC0349 B29Hash map contains
sub_425CA0114 B127Hash map constructor
sub_425D20121 B63Hash map destructor
sub_42CA6081 B298Linked list prepend
sub_42CC3034 B48Linked list length
sub_427630273 B73MurmurHash3 string hash
sub_62148021 KBlowSymbol table builder
sub_62580027 KBlowSymbol resolution
sub_6BC5604.9 KBlowConstant bank handler
sub_6294E012.1 KBlowReserved shared memory management
sub_6C4DA015 KBlowShared memory intrinsic lowering
sub_7FD6C0~800 B3ELF symbol emitter
sub_7FB6C0~800 B1Pipeline orchestrator (context cleanup)
sub_7FBB70~100 B1Per-kernel entry point
sub_663C30~300 B1Compilation loop body
sub_662920varies1Global initialization (calls KnobsInit)